Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot run Tutorial_2.1_MatterGPT_eform #10

Open
funihang opened this issue Aug 27, 2024 · 3 comments
Open

Cannot run Tutorial_2.1_MatterGPT_eform #10

funihang opened this issue Aug 27, 2024 · 3 comments

Comments

@funihang
Copy link

Hi,
I’m encountering some issues with Tutorial 2.1 and wanted to ask if there might be something missing or incorrect.
image

Additionally, when I comment out the show_progress() function, it leads to another error
image
image

Could you please correct that? Thanks for your help.

@xiaohang007
Copy link
Owner

Thank you for identifying the bug. I've resolved it by adding the following lines to ./data/mp20_nonmetal/workflow/0_run.pbs:
"export PATH=/opt/conda/bin:$PATH
source activate chgnet"
This activates the chgnet environment for the data generation script, which I had forgotten to do initially. Now it's working perfectly.

Hi, I’m encountering some issues with Tutorial 2.1 and wanted to ask if there might be something missing or incorrect. image

Additionally, when I comment out the show_progress() function, it leads to another error image image

Could you please correct that? Thanks for your help.

Thank you for identifying the bug. I've resolved it by adding the following lines to ./data/mp20_nonmetal/workflow/0_run.pbs:
"export PATH=/opt/conda/bin:$PATH
source activate chgnet"
This activates the chgnet environment for the data generation script, which I had forgotten to do initially. Now it's working perfectly.

@funihang
Copy link
Author

Thank you for the update, but the issue in Tutorial 2.1 doesn't seem to be solved. The problem seems to originate in line 67 of the "1. building training set" cell, specifically with the show_progress() function. Additionally, even after commenting on this function, another error was reported. Lines 68 and 70 are cleaning up the data instead of collecting it. I guess there may be a need to modify the utils.py file to resolve this.

@xiaohang007
Copy link
Owner

xiaohang007 commented Aug 29, 2024

Thank you for the update, but the issue in Tutorial 2.1 doesn't seem to be solved. The problem seems to originate in line 67 of the "1. building training set" cell, specifically with the show_progress() function. Additionally, even after commenting on this function, another error was reported. Lines 68 and 70 are cleaning up the data instead of collecting it. I guess there may be a need to modify the utils.py file to resolve this.

I noticed that your conda location is ~/.conda, which indicates you're not using the provided Docker image. If you wish to run this tutorial on your own machine, you'll need to install not only SLICES but also the Slurm queue system to manage calculations, which can be a challenging task. I suggest you follow the steps below to setup the jupyter backend with the docker image provided.

Jupyter backend setup
(1) Download this repo and unzipped it.

(2) Put Materials Project's new API key in "APIKEY.ini".

(3) Edit "CPUs" in "slurm.conf" to set up the number of CPU threads available for the docker container.

(4) Run following commands in terminal (Linux or WSL2 Ubuntu on Win11)

# Download SLICES_docker with pre-installed SLICES and other relevant packages. 
docker pull xiaohang07/slices:v9  
# Make entrypoint_set_cpus.sh executable 
sudo chmod +x entrypoint_set_cpus_jupyter.sh
# Repalce "[]" with the absolute path of this repo's unzipped folder to setup share folder for the docker container.
docker run -it -p 8888:8888 -h workq  --shm-size=0.5gb --gpus all -v /[]:/crystal xiaohang07/slices:v9 /crystal/entrypoint_set_cpus_jupyter.sh

If you want to install slurm on your own machine, then follow these steps:

apt update \
&& apt install munge slurm-wlm slurm-wlm-doc slurm-wlm-torque -y \
&& rm -rf  /var/spool/slurm-llnl \
&& mkdir /var/spool/slurm-llnl \
&& chown -R slurm.slurm /var/spool/slurm-llnl \
&& rm -rf /var/run/slurm-llnl/ \
&& mkdir /var/run/slurm-llnl/ \
&& chown -R slurm.slurm /var/run/slurm-llnl/

#修改slurm.conf内容(改变cpu的数量,以及hostname到你的hostname),然后
cp ./slurm.conf /etc/slurm-llnl/

service munge restart \
&& service slurmctld restart \
&& service slurmd restart

In addition, you should modify the 0_run.pbs files to fit your envs.

Another workaround is:
If you don't want to install the Slurm workload manager, you'll need to modify the code in utils.py, replacing 'qsub 0_run.pbs' with 'python 0_run.py' inside the splitRun function. Additionally, please ensure that the number of threads does not exceed the number of CPU threads on your computer. Exceeding this limit may lead to resource contention issues.

如果您不想安装Slurm任务管理系统,那么需要修改utils.py的代码,在splitRun函数内部替换 qsub 0_run.pbs为 python 0_run.py,并且确认线程数不会超过电脑的cpu线程数量,否则会出现计算资源挤占的问题.

要感谢你提出这个问题,我在教程的开头加上了这些详细的描述,可能会帮助避免出现类似问题。

I have sent you a private message on linkedin with my wechat ID, BTW.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants