Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New container relion 4.0.1.sm61 #620

Closed
vnm-neurodesk opened this issue Mar 18, 2024 · 20 comments
Closed

New container relion 4.0.1.sm61 #620

vnm-neurodesk opened this issue Mar 18, 2024 · 20 comments

Comments

@vnm-neurodesk
Copy link
Contributor

There is a new container by @stebo85, use this command to test:

bash /neurocommand/local/fetch_and_run.sh relion 4.0.1.sm61 20240318

If test was successful, then add to apps.json to release:
https://github.com/NeuroDesk/neurocommand/edit/main/neurodesk/apps.json

Please close this issue when completed :)

@stebo85
Copy link
Contributor

stebo85 commented Mar 18, 2024

@vennand - could you test this container and see if it all works as expected?

@vennand
Copy link
Contributor

vennand commented Mar 20, 2024

How do I transfer data to the neurodesktop to test? It opens as expected, but I need to launch a job to know if it'll work.

@stebo85
Copy link
Contributor

stebo85 commented Mar 20, 2024

Are you running neurodesktop locally in docker? If yes, you have a shared directory between the desktop and the host.

Alternatively you can drag and drop files on the desktop and guacamole will upload the file (has to be one file, can't be a directory)

@vennand
Copy link
Contributor

vennand commented Mar 21, 2024

I'm trying locally in docker, and I just noticed the directory, thanks!

Is it possible to do a GPU passthrough with the local docker? I'm pretty sure I won't be able to test if the GPU settings work otherwise. Though so far, there was no error message saying it was CPU only.

Though I'm not convinced it compiled with GPU support if the machine that built the container didn't have a GPU. With the new version of relion (ver5.0), they explicitly state that the compiler tries to detect a GPU, and if not, compiles for CPU only, even if the a GPU architecture is provided.

@stebo85
Copy link
Contributor

stebo85 commented Mar 21, 2024

Dear @vennand

yes, you can pass your GPU into the docker container:

sudo docker run \
  --shm-size=1gb -it --privileged --user=root --name neurodesktop \
  -v ~/neurodesktop-storage:/neurodesktop-storage \
  -e NB_UID="$(id -u)" -e NB_GID="$(id -g)" \
  --gpus all \
  -p 8888:8888 -e NEURODESKTOP_VERSION=2024-01-12 \
  vnmd/neurodesktop:2024-01-12

to check if it worked, run nvidia-smi in the desktop container afterwards

that would be annoying if it needs a GPU to compile. We do not have the ability to run a GPU node for building containers.

@vennand
Copy link
Contributor

vennand commented Mar 22, 2024

We might just be limited to version 4 for now then. As far as I can tell, version 5 is still in beta, so it might not be advisable to use it for research anyway.

I tried running your command, but I get the following error message
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]. ERRO[0000] error waiting for container: context canceled

Didn't find anything relevant with a very quick Google. Any idea what could cause this?

Also, I won't be able to touch this until the 16th of April unfortunately, but I plan on getting back to it.

@stebo85
Copy link
Contributor

stebo85 commented Mar 22, 2024

Dear @vennand

Did you install the nvidia-container-toolkit beforehand?

#RHEL/CentOS (yum-based)
sudo yum install nvidia-container-toolkit -y
#Ubuntu/Debian (apt-based)
sudo apt install nvidia-container-toolkit -y

@vennand
Copy link
Contributor

vennand commented Mar 22, 2024

I had not, but I get the same error after installing it

@stebo85
Copy link
Contributor

stebo85 commented Mar 22, 2024

what are you getting when you run nvidia-smi on your host system?

@vennand
Copy link
Contributor

vennand commented Mar 22, 2024

`+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P40 Off | 00000000:01:00.0 Off | Off |
| N/A 18C P8 9W / 250W | 4MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1536 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+`

@stebo85
Copy link
Contributor

stebo85 commented Mar 22, 2024

can you try this? https://www.howtogeek.com/devops/how-to-use-an-nvidia-gpu-with-docker-containers/

It needs a restart restart of the docker daemon and potentially apt-get install -y nvidia-docker2

@vennand
Copy link
Contributor

vennand commented Mar 22, 2024

I've installed nvidia-docker2, but I've also ran this
sudo nvidia-ctk runtime configure --runtime=docker

I don't know which one worked, but it worked. I'll try to test it now, but I don't know if I'll have time

@vennand
Copy link
Contributor

vennand commented May 14, 2024

Hi @stebo85,

I've finished testing. Relion works as intended, but none of the jobs showed up when running "nvidia-smi", even though we could see the GPU being used. Not sure if that's an issue with the GPU passthrough, but it is using the GPU.

Another important issue is that one of the third party software I install along with relion doesn't work. Basically, CTFFIND 4.1.14 fails if it's compiled with GCC 8 or above. The fix I've found is to modify the code, which doesn't seem practical or elegant to do in the neurodesk script.
What would be the best approach around this? Should I host a "fixed" copy of the code on my own Github? (though I'm not sure if the license agreement allows this)

@stebo85
Copy link
Contributor

stebo85 commented May 16, 2024

Dear @vennand,
which command did you use for testing the GPUs? I have seen a similar behaviour once using the old flag. Can you try with --gpus all ? Another check: what comes up when you run which nvidia-smi?

Fixing a software live for a container is a tricky one. I have done various things in the past depending on the project:

  1. apply an sed command that fixes a few single lines in the neurocontainer buildscript - would that work for you?
  2. provide a fixed sourecode file in the neurocontainers repository along with the build script and copy it into the container during build to overwrite the upstream file
  3. fork the software and then fix it there and use the fix inside the container + provide the fix upstream in the hope they merge it.

@vennand
Copy link
Contributor

vennand commented May 17, 2024

@stebo85

To test the GPU, I simply watched nvidia-smi (watch -n 1 nvidia-smi) while running relion. Relion launches python scripts that show there normally. They didn't in the VM, but they were listed on the main machine (the one I'm running neurodesk from).

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P40                      Off | 00000000:01:00.0 Off |                  Off |
| N/A   31C    P0              74W / 250W |  24256MiB / 24576MiB |     67%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1632      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A      7835      C   ...relion-4.0.1.sm61/bin/relion_refine    24250MiB |
+---------------------------------------------------------------------------------------+

I don't know exactly how the code accesses the GPU, but I can probably find out if that's relevant.

When I run which nvidia-smi I get /usr/bin/nvidia-smi

Regarding fixing the software, I think I'll go with option 2, since the source code is only 11MB. Do you want me to push the fix now, or should we investigate the GPU "issue" before?

@stebo85
Copy link
Contributor

stebo85 commented May 21, 2024

Interesting. I don't know what causes this behaviour, but I guess if it works it works no matter where the GPU tasks show up.

Happy for you to push the fix now :) Let's see if we can get this work!

@vennand
Copy link
Contributor

vennand commented May 27, 2024

@stebo85 Would you know what this error means?

$ bash build.sh -ds
Entering Debug mode
WARNING: Skipping neurodocker as it is not installed.
Defaulting to user installation because normal site-packages is not writeable
Collecting https://github.com/ReproNim/neurodocker/tarball/master
  Downloading https://github.com/ReproNim/neurodocker/tarball/master
     - 77.3 kB 10.0 MB/s 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [36 lines of output]
      /tmp/pip-build-env-aeraeba5/overlay/local/lib/python3.10/dist-packages/setuptools_scm/_integration/setuptools.py:31: RuntimeWarning:
      ERROR: setuptools==59.6.0 is used in combination with setuptools_scm>=8.x

      Your build configuration is incomplete and previously worked by accident!
      setuptools_scm requires setuptools>=61

      Suggested workaround if applicable:
       - migrating from the deprecated setup_requires mechanism to pep517/518
         and using a pyproject.toml to declare build dependencies
         which are reliably pre-installed before running the build tools

        warnings.warn(
      Traceback (most recent call last):
        File "/usr/lib/python3/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
          main()
        File "/usr/lib/python3/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/usr/lib/python3/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 164, in prepare_metadata_for_build_wheel
          return hook(metadata_directory, config_settings)
        File "/tmp/pip-build-env-aeraeba5/overlay/local/lib/python3.10/dist-packages/hatchling/build.py", line 112, in prepare_metadata_for_build_wheel
          directory = os.path.join(metadata_directory, f'{builder.artifact_project_id}.dist-info')
        File "/tmp/pip-build-env-aeraeba5/overlay/local/lib/python3.10/dist-packages/hatchling/builders/wheel.py", line 825, in artifact_project_id
          self.project_id
        File "/tmp/pip-build-env-aeraeba5/overlay/local/lib/python3.10/dist-packages/hatchling/builders/plugin/interface.py", line 374, in project_id
          self.__project_id = f'{self.normalize_file_name_component(self.metadata.core.name)}-{self.metadata.version}'
        File "/tmp/pip-build-env-aeraeba5/overlay/local/lib/python3.10/dist-packages/hatchling/metadata/core.py", line 149, in version
          self._version = self._get_version()
        File "/tmp/pip-build-env-aeraeba5/overlay/local/lib/python3.10/dist-packages/hatchling/metadata/core.py", line 248, in _get_version
          version = self.hatch.version.cached
        File "/tmp/pip-build-env-aeraeba5/overlay/local/lib/python3.10/dist-packages/hatchling/metadata/core.py", line 1466, in cached
          raise type(e)(message) from None
      LookupError: Error getting the version from source `vcs`: setuptools-scm was unable to detect version for /tmp/pip-req-build-wu94yd8o.

      Make sure you're either building from a fully intact git repository or PyPI tarballs. Most other sources (such as GitHub's tarballs, a git checkout without the .git folder) don't contain the necessary metadata and will not work.

      For example, if you're using pip, instead of https://github.com/user/proj/archive/master.zip use git+https://github.com/user/proj.git#egg=proj
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

@stebo85
Copy link
Contributor

stebo85 commented May 27, 2024 via email

@vennand
Copy link
Contributor

vennand commented Jun 17, 2024

@stebo85

Hey, I'm back working on this. I'll start implementing the other software soon.

But first, I tested this version of relion on our other GPUs, and it runs without issues. Perhaps the default setting (sm35) is too old, but this one works. I'm thinking it would be simpler for users to only package this one.
If you think this could be a good idea, how do we go about this? Only put this one in the JSON, with Exec: relion?

@stebo85
Copy link
Contributor

stebo85 commented Jun 18, 2024

Great to hear that Relion is working :)

ok, makes sense that the newer version works better. CUDA is usually quite backwards compatible, so if you have fairly current driver versions that makes sense.

Yes, put the version you found working best in the apps.json and this will trigger the release process.

Thank you for getting this to work !!!

@stebo85 stebo85 closed this as completed Aug 21, 2024
@github-project-automation github-project-automation bot moved this from New to Completed in NeuroDesk Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Completed
Development

No branches or pull requests

3 participants