大鹏新区网站建设,口碑好的企业网站建设,怎么在电脑上做网站,宝安网站建设推广今天在使用deepspeed进行训练的时候#xff0c;本来想使用GPU 4,5,6,7#xff0c;但是设置了如下命令还是不管用#xff1a; export CUDA_VISIBLE_DEVICES4,5,6,7 最后在deepspeed的配置文件中进行配置#xff0c;才得以解决#xff0c;期间遇到错误#xff1a; [2023-0…今天在使用deepspeed进行训练的时候本来想使用GPU 4,5,6,7但是设置了如下命令还是不管用 export CUDA_VISIBLE_DEVICES4,5,6,7 最后在deepspeed的配置文件中进行配置才得以解决期间遇到错误 [2023-07-29 09:29:29,308] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Traceback (most recent call last): File /home/mapengsen/anaconda3/envs/drug/bin/deepspeed, line 6, in module main() File /home/mapengsen/anaconda3/envs/drug/lib/python3.8/site-packages/deepspeed/launcher/runner.py, line 405, in main raise ValueError(Cannot specify num_nodes/gpus with include/exclude) ValueError: Cannot specify num_nodes/gpus with include/exclude 上述原因是因为在这是deepspeed shell文件的时候出现了错误配置
错误
run_cmddeepspeed --include localhost:4,5,6,7 --num_nodes ${DLWS_NUM_WORKER} --master_port${MASTER_PORT} --num_gpus ${DLWS_NUM_GPU_PER_WORKER} \train_retrieval.py ${full_options} ${custom_train_options}
正确
run_cmddeepspeed --include localhost:4,5,6,7 --master_port${MASTER_PORT} train_retrieval.py ${full_options} ${custom_train_options} 是因为你既然指定了GPU的ID那么就不需要再设置“--num_nodes”、“--num_gpus”