Cannot run Tutorial_2.1_MatterGPT_eform #10

funihang · 2024-08-27T20:41:39Z

Hi,
I’m encountering some issues with Tutorial 2.1 and wanted to ask if there might be something missing or incorrect.

Additionally, when I comment out the show_progress() function, it leads to another error

Could you please correct that? Thanks for your help.

xiaohang007 · 2024-08-27T23:55:44Z

Thank you for identifying the bug. I've resolved it by adding the following lines to ./data/mp20_nonmetal/workflow/0_run.pbs:
"export PATH=/opt/conda/bin:$PATH
source activate chgnet"
This activates the chgnet environment for the data generation script, which I had forgotten to do initially. Now it's working perfectly.

Hi, I’m encountering some issues with Tutorial 2.1 and wanted to ask if there might be something missing or incorrect.

Additionally, when I comment out the show_progress() function, it leads to another error

Could you please correct that? Thanks for your help.

Thank you for identifying the bug. I've resolved it by adding the following lines to ./data/mp20_nonmetal/workflow/0_run.pbs:
"export PATH=/opt/conda/bin:$PATH
source activate chgnet"
This activates the chgnet environment for the data generation script, which I had forgotten to do initially. Now it's working perfectly.

funihang · 2024-08-28T19:32:02Z

Thank you for the update, but the issue in Tutorial 2.1 doesn't seem to be solved. The problem seems to originate in line 67 of the "1. building training set" cell, specifically with the show_progress() function. Additionally, even after commenting on this function, another error was reported. Lines 68 and 70 are cleaning up the data instead of collecting it. I guess there may be a need to modify the utils.py file to resolve this.

xiaohang007 · 2024-08-29T02:02:02Z

Thank you for the update, but the issue in Tutorial 2.1 doesn't seem to be solved. The problem seems to originate in line 67 of the "1. building training set" cell, specifically with the show_progress() function. Additionally, even after commenting on this function, another error was reported. Lines 68 and 70 are cleaning up the data instead of collecting it. I guess there may be a need to modify the utils.py file to resolve this.

I noticed that your conda location is ~/.conda, which indicates you're not using the provided Docker image. If you wish to run this tutorial on your own machine, you'll need to install not only SLICES but also the Slurm queue system to manage calculations, which can be a challenging task. I suggest you follow the steps below to setup the jupyter backend with the docker image provided.

Jupyter backend setup
(1) Download this repo and unzipped it.

(2) Put Materials Project's new API key in "APIKEY.ini".

(3) Edit "CPUs" in "slurm.conf" to set up the number of CPU threads available for the docker container.

(4) Run following commands in terminal (Linux or WSL2 Ubuntu on Win11)

# Download SLICES_docker with pre-installed SLICES and other relevant packages. 
docker pull xiaohang07/slices:v9  
# Make entrypoint_set_cpus.sh executable 
sudo chmod +x entrypoint_set_cpus_jupyter.sh
# Repalce "[]" with the absolute path of this repo's unzipped folder to setup share folder for the docker container.
docker run -it -p 8888:8888 -h workq  --shm-size=0.5gb --gpus all -v /[]:/crystal xiaohang07/slices:v9 /crystal/entrypoint_set_cpus_jupyter.sh

If you want to install slurm on your own machine, then follow these steps:

apt update \
&& apt install munge slurm-wlm slurm-wlm-doc slurm-wlm-torque -y \
&& rm -rf  /var/spool/slurm-llnl \
&& mkdir /var/spool/slurm-llnl \
&& chown -R slurm.slurm /var/spool/slurm-llnl \
&& rm -rf /var/run/slurm-llnl/ \
&& mkdir /var/run/slurm-llnl/ \
&& chown -R slurm.slurm /var/run/slurm-llnl/

#修改slurm.conf内容(改变cpu的数量，以及hostname到你的hostname)，然后
cp ./slurm.conf /etc/slurm-llnl/

service munge restart \
&& service slurmctld restart \
&& service slurmd restart

In addition, you should modify the 0_run.pbs files to fit your envs.

Another workaround is:
If you don't want to install the Slurm workload manager, you'll need to modify the code in utils.py, replacing 'qsub 0_run.pbs' with 'python 0_run.py' inside the splitRun function. Additionally, please ensure that the number of threads does not exceed the number of CPU threads on your computer. Exceeding this limit may lead to resource contention issues.

如果您不想安装Slurm任务管理系统，那么需要修改utils.py的代码，在splitRun函数内部替换 qsub 0_run.pbs为 python 0_run.py，并且确认线程数不会超过电脑的cpu线程数量，否则会出现计算资源挤占的问题.

要感谢你提出这个问题，我在教程的开头加上了这些详细的描述，可能会帮助避免出现类似问题。

I have sent you a private message on linkedin with my wechat ID, BTW.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot run Tutorial_2.1_MatterGPT_eform #10

Cannot run Tutorial_2.1_MatterGPT_eform #10

funihang commented Aug 27, 2024

xiaohang007 commented Aug 27, 2024

funihang commented Aug 28, 2024

xiaohang007 commented Aug 29, 2024 •

edited

Loading

Cannot run Tutorial_2.1_MatterGPT_eform #10

Cannot run Tutorial_2.1_MatterGPT_eform #10

Comments

funihang commented Aug 27, 2024

xiaohang007 commented Aug 27, 2024

funihang commented Aug 28, 2024

xiaohang007 commented Aug 29, 2024 • edited Loading

xiaohang007 commented Aug 29, 2024 •

edited

Loading