每天一个没用的代码小技巧

服务器使用

可以使用使NVMeNon-Volatile Memory express）SSD加速disk上文件的读写。
首先查看是否有NVMe SSD
1
ls /dev/nvme*
然后将NVMe mount到需要的位置
1
2
3
sudo mkfs.ext4 /dev/nvme1n1 # 格式化硬盘
sudo mkdir /mnt/data1
sudo mount /dev/nvme1n1 /mnt/data1
RAM disk: 不真正地使用disk读写，而是把文件放进内存。[参考链接]
如果使用ssh登录服务器遇到问题，可以通过ssh -vv查看verbose log。

parallel-ssh 和parallel-scp命令可以快速登陆多个服务器/向多个服务器同时传输文件。
安装：

1	sudo apt install pssh

使用：

1 2	parallel-ssh -i -t 0 -h hostfile "hostname" parallel-scp –h hostfile -l <user> src dst

如果出现

1 2	Usage: parallel-scp [OPTIONS] local remote parallel-scp: error: Hosts not specified.

那就改用

1	parallel-scp --hosts=hostfile --user=<user> src dst

如果服务器不能连到外网，而自己有ClashX代理，可以配置反向代理。
在本地：
1
ssh -NfR 7820:127.0.0.1:7890 username@server_ip
注意第一个端口号需要和代理的端口号不同，而第二个端口号7890是ClashX的默认端口号，根据实际情况替换。
然后正常ssh到服务器。
打开ClashX，command+C复制，然后粘贴到服务器的命令行里，把端口号都改成刚才设置的第一个端口号（7820），然后就可以上网了。另外可以注意一下需要按照这篇博客设置/etc/ssh/sshd_config。
使用NFS共享存储
服务器端安装NFS服务：
1
sudo apt install nfs-kernel-server
服务器端在/etc/exports里添加：
1
/dir *(rw,sync,no_subtree_check,no_root_squash)
然后命令行执行：
1
sudo service nfs-kernel-server restart
客户端命令行运行：
1
2
sudo apt install nfs-common --force-yes
sudo mount -t nfs $1:/dir /dir
不用的时候需要在服务器端stop service，并在客户端unmount。
查看端口使用情况：
1
netstat -tunlp | grep <port_number>
不要轻易使用conda install python==x.x更换已有环境的python版本，以前安装的pip包会被覆盖。

查看物理CPU个数：

1	cat /proc/cpuinfo\| grep "physical id"\| sort\| uniq\| wc -l

查看逻辑CPU个数：

1	cat /proc/cpuinfo\| grep "processor"\| wc -l

查看每个CPU的核数：

1	cat /proc/cpuinfo\| grep "cpu cores"\| uniq

查看CPU型号：

1	cat /proc/cpuinfo \| grep name \| cut -f2 -d: \| uniq -c

查看Linux内核：

uname -a

查看内存使用情况：

free -m

GPU 使用

Persistent mode 可以加速GPU上的计算和内存操作
1
sudo nvidia-smi -pm 1
dmesg命令可以查看硬件上的历史报错信息。

检测不到GPU device：可能是fabric manager版本不匹配的问题。
检测方法：

1	sudo service nvidia-fabricmanager status

如果检测出版本不匹配的问题，则重新安装fabric manager：

wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/nvidia-fabricmanager-450_450.80.02-1_amd64.deb 
sudo apt install ./nvidia-fabricmanager-450_450.80.02-1_amd64.deb 
sudo systemctl enable nvidia-fabricmanager 
sudo systemctl restart nvidia-fabricmanager

在有些机器上，fabric manager会自动更新其版本，需要停止自动跟新更新才能保证版本始终可以匹配。[停止自动更新的脚本]

改变GPU的power上限：nvidia-smi -i %s -pl [power_upper_limit]
改变GPU的frequency："nvidia-smi -i %s -ac %s,%s" % (gid, mem, gra)
查看GPU的frequency：nvidia-smi --query-gpu=clocks.applications.graphics,clocks.applications.mem -i %s --format=csv,noheader,nounits
恢复默认值：nvidia-smi -rac
锁住固定的frequency：nvidia-smi -lgc $frequency -i $i
AWS EC2服务器暂时不能直接使用 GPU，可能是因为NVIDIA更新了公钥，但是AWS系统更新的时候没暂时更新驱动，因此需要自己装一遍驱动才能使用GPU
修改环境变量：
1
2
BASE_URL=https://us.download.nvidia.com/tesla
DRIVER_VERSION=450.80.02 # 510.47.03 for A100
然后执行：
1
2
curl -fSsl -O $BASE_URL/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run
sudo sh NVIDIA-Linux-x86_64-$DRIVER_VERSION.run
或者修复apt-get之后通过https://blog.csdn.net/qq_28256407/article/details/115548675 安装

Deep Learning 开发 (主PyTorch+NCCL分布式训练)

使用pip install安装的PyTorch使用静态链接的NCCL。如果想要更新NCCL版本，则可以使用一下命令（适用于A100 GPU）

git clone -b v2.11.4-1 https://github.com/NVIDIA/nccl /nccl-2.11.4 
cd /nccl-2.11.4 
make -j src.build TRACE=1 NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80" 
make install

使用Pytorch DDP后清理GPU内存
1
2
3
4
5
6
torch.cuda.synchronize()
del ddp_model
del dataset
del optimizer
torch.cuda.empty_cache()
dist.destroy_process_group()
但是目前版本的PyTorch中，dist.destroy_process_group(group=subgroup)并不能真正清理掉subgroup使用的内存，且这个内存泄漏无法被常用的内存泄漏检测方法检测到。
（内存泄漏检测方法1，内存泄漏检测方法2）
使用NCCL_DEBUG=TRACE需要使用TRACE=1这一flag重新编译NCCL，然后运行make install。
还需要保证NCCL的安装路径在LD_LIBRARY_PATH里。打出来的trace前面有时间戳，单位是ms。

如果使用NCCL时遇到了奇怪的bug（如segmentation fault），可以检查环境变量的设置是否正确。即使通过NCCL_DEBUG=INFO打印出来的log一切正常，也需要再设置一次环境变量。

export NCCL_IB_PCI_RELAXED_ORDERING=1
export NCCL_SOCKET_IFNAME=eth0
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_NET_GDR_LEVEL=5
export LD_PRELOAD=/opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so # path to ibnccl-net.so
export LD_LIBRARY_PATH=/opt/hpcx/nccl_rdma_sharp_plugin/lib:$LD_LIBRARY_PATH # the dir where ibnccl-net.so is in

其中，如果不设置NCCL_IB_PCI_RELAXED_ORDERING、CUDA_DEVICE_ORDER和NCCL_NET_GDR_LEVEL，（单机）多卡之间的通信可能会非常慢。

如果使用NCCL时能建立起来进程组，但是在通信时报错：

1
2

RuntimeError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, internal error, NCCL version 21.1.4
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption

检查master IP、port的设置是否正确，如果是多机环境，检查不同机器上环境变量的设置是否一致、NCCL版本是否一致。

如果在卡住一段时间之后报错：

Traceback (most recent call last):
  File "test.py", line 112, in <module>
    ddp_model = DDP(model, device_ids=[args.rank% torch.cuda.device_count()], output_device=args.rank% torch.cuda.device_count())
  File "/data/gdd/software/miniconda3/envs/env/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

则需要设置NCCL_SOCKET_IFNAME环境变量！

如果使用NCCL时无法建立进程组，并报错：

1
2

ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

检查同一台机器上不同进程（rank）的CUDA_VISIBLE_DEVICES环境变量是否不同。

使用PyTorch训练时报错
1
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, xxxx])
检查DataLoader是否设置了drop_last=True，并检查batch size是否大于1（如果模型中有batch Normalization层）。
PyTorch Profiler，分析性能的好帮手

如果PyTorch dataloader报错：

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f733cbf6f70>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1510, in __del__
    self._shutdown_workers()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1493, in _shutdown_workers
    if w.is_alive():
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

可能是数据集corrupt了。

Graceful Python Projects

使用logging包是，如果在一个文件里add handler了，就无需在这个包所import的其他py文件里再次add handler了，只需：
1
2
LOG = logging.getLogger(__name__)
LOG.setLevel(logging.INFO)

删除__pycache__

1 2	find . -name "*.pyc" -type f -print -exec rm -rf {} \;

Docker使用

从远程镜像启动容器

1	sudo docker run -it -d --name=<name> --privileged --net=host --ipc=host --gpus=all -v /opt:/opt2 repo/tag

进入容器的bash命令行
1
sudo docker exec -it <name> bash

make and run

1	sudo docker build -t <name> -f Dockerfile .

将当前容器的镜像push到远程repo
首先确保自己登录了docker，没有登录的话先登录：
1
docker login -u <用户名> -p <密码>

1 2	sudo docker commit <container_id> <repo>/<tag> sudo docker push <repo>/<tag>

删除image之前需要先删除容器

sudo docker ps
sudo docker rm <container>
sudo docker image ls
sudo docker rmi <image>

重启docker服务
1
systemctl restart docker
如果无法pull镜像，但能正常联网，可能是docker代理配置的问题，参考：
https://forums.docker.com/t/docker-pull-results-in-request-canceled-while-waiting-for-connection-client-timeout-exceeded-while-awaiting-headers/73064/26
两种常见的解决方案：

设置系统DNS：https://docs.docker.com/config/daemon/systemd/#httphttps-proxy
Docker代理配置：https://datawookie.dev/blog/2018/10/dns-on-ubuntu/

可以通过docker info查看是否有默认的代理配置。如果编辑了/etc/systemd/system/docker.service.d/并重启Docker不能覆盖这些代理的配置，可能是因为这些配置被写在了/lib/systemd/system/docker.service文件里，需要去编辑/lib/systemd/system/docker.service。

Kubernetes使用

初始化：sudo kubeadm init --pod-network-cidr=192.168.0.0/16
让master节点也成为worker参与到调度中
1
2
kubectl taint nodes --all node-role.kubernetes.io/master-
kubectl get nodes -o wide # 查看节点
其中第一条命令会去掉所有node的node-role.kubernetes.io/master相关taint（有些taint后面会带有NoSchedule的标记，具体有哪些taint可以通过kubectl describe <node-name>查看）。
查看资源详细描述：kubectl describe
查看提交任务的状态：kubectl get pods
查看具体pod的log：kubectl logs <pod-name>

删除pod之前需要先删除deployment/stateful set/daemonset等，不然pod被删除后会一直被重启

1
2
3

kubectl get deployment -n <namespace>
kubectl delete deployment <deployment> -n <namespace>
kubectl delete pod <pod-name> -n <namespace>

如果/目录下已经用了超过80%的空间，会有disk pressure的问题，需要清理磁盘。
Kubernetes inter-pod networking test：https://projectcalico.docs.tigera.io/getting-started/kubernetes/hardway/test-networking
使用Kubernetes时让DL job内使用InfiniBand通信：https://github.com/gudiandian/k8s-rdma-sriov-dev-plugin