diff --git a/doc/changelog.d/3513.documentation.md b/doc/changelog.d/3513.documentation.md new file mode 100644 index 0000000000..b2f60a9a07 --- /dev/null +++ b/doc/changelog.d/3513.documentation.md @@ -0,0 +1 @@ +docs: adding-sbatch-support \ No newline at end of file diff --git a/doc/source/user_guide/hpc/launch_mapdl_entrypoint.rst b/doc/source/user_guide/hpc/launch_mapdl_entrypoint.rst new file mode 100644 index 0000000000..09efdc20b7 --- /dev/null +++ b/doc/source/user_guide/hpc/launch_mapdl_entrypoint.rst @@ -0,0 +1,234 @@ + +.. _ref_pymapdl_interactive_in_cluster_hpc: + +.. _ref_pymapdl_interactive_in_cluster_hpc_from_login: + + +Interactive MAPDL instance launched from the login node +======================================================= + +Starting the instance +--------------------- + +If you are already logged in a login node, you can launch an MAPDL instance as a SLURM job and +connect to it. +To accomplish this, run these commands in your login node. + +.. code:: pycon + + >>> from ansys.mapdl.core import launch_mapdl + >>> mapdl = launch_mapdl(launch_on_hpc=True) + +PyMAPDL submits a job to the scheduler using the appropriate commands. +In case of SLURM, it uses the ``sbatch`` command with the ``--wrap`` argument +to pass the MAPDL command line to start. +Other scheduler arguments can be specified using the ``scheduler_options`` +argument as a Python :class:`dict`: + +.. code:: pycon + + >>> from ansys.mapdl.core import launch_mapdl + >>> scheduler_options = {"nodes": 10, "ntasks-per-node": 2} + >>> mapdl = launch_mapdl(launch_on_hpc=True, nproc=20, scheduler_options=scheduler_options) + + +.. note:: + PyMAPDL cannot infer the number of CPUs that you are requesting from the scheduler. + Hence, you must specify this value using the ``nproc`` argument. + +The double minus (``--``) common in the long version of some scheduler commands +are added automatically if PyMAPDL detects it is missing and the specified +command is long more than 1 character in length). +For instance, the ``ntasks-per-node`` argument is submitted as ``--ntasks-per-node``. + +Or, a single Python string (:class:`str`) is submitted: + +.. code:: pycon + + >>> from ansys.mapdl.core import launch_mapdl + >>> scheduler_options = "-N 10" + >>> mapdl = launch_mapdl(launch_on_hpc=True, scheduler_options=scheduler_options) + +.. warning:: + Because PyMAPDL is already using the ``--wrap`` argument, this argument + cannot be used again. + +The values of each scheduler argument are wrapped in single quotes (`'`). +This might cause parsing issues that can cause the job to fail after successful +submission. + +PyMAPDL passes all the environment variables of the +user to the new job and to the MAPDL instance. +This is usually convenient because many environmental variables are +needed to run the job or MAPDL command. +For instance, the license server is normally stored in the :envvar:`ANSYSLMD_LICENSE_FILE` environment variable. +If you prefer not to pass these environment variables to the job, use the SLURM argument +``--export`` to specify the desired environment variables. +For more information, see `SLURM documentation `_. + + +Working with the instance +------------------------- + +Once the :class:`Mapdl ` object has been created, +it does not differ from a normal :class:`Mapdl ` +instance. +You can retrieve the IP of the MAPDL instance as well as its hostname: + +.. code:: pycon + + >>> mapdl.ip + '123.45.67.89' + >>> mapdl.hostname + 'node0' + +You can also retrieve the job ID: + +.. code:: pycon + + >>> mapdl.jobid + 10001 + +If you want to check whether the instance has been launched using a scheduler, +you can use the :attr:`mapdl_on_hpc ` +attribute: + +.. code:: pycon + + >>> mapdl.mapdl_on_hpc + True + + +Sharing files +^^^^^^^^^^^^^ + +Most of the HPC clusters share the login node filesystem with the compute nodes, +which means that you do not need to do extra work to upload or download files to the MAPDL +instance. You only need to copy them to the location where MAPDL is running. +You can obtain this location with the +:attr:`directory ` attribute. + +If no location is specified in the :func:`launch_mapdl() ` +function, then a temporal location is selected. +It is a good idea to set the ``run_location`` argument to a directory that is accessible +from all the compute nodes. +Normally anything under ``/home/user`` is available to all compute nodes. +If you are unsure where you should launch MAPDL, contact your cluster administrator. + +Additionally, you can use methods like the :meth:`upload ` +and :meth:`download ` to +upload and download files to and from the MAPDL instance respectively. +You do not need ``ssh`` or another similar connection. +However, for large files, you might want to consider alternatives. + + +Exiting MAPDL +------------- + +Exiting MAPDL, either intentionally or unintentionally, stops the job. +This behavior occurs because MAPDL is the main process at the job. Thus, when finished, +the scheduler considers the job done. + +To exit MAPDL, you can use the :meth:`exit() ` method. +This method exits MAPDL and sends a signal to the scheduler to cancel the job. + +.. code-block:: python + + mapdl.exit() + +When the Python process you are running PyMAPDL on finishes without errors, and you have not +issued the :meth:`exit() ` method, the garbage collector +kills the MAPDL instance and its job. This is intended to save resources. + +If you prefer that the job is not killed, set the following attribute in the +:class:`Mapdl ` class: + +.. code-block:: python + + mapdl.finish_job_on_exit = False + + +In this case, you should set a timeout in your job to avoid having the job +running longer than needed. + + +Handling crashes on an HPC +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If MAPDL crashes while running on an HPC, the job finishes right away. +In this case, MAPDL disconnects from MAPDL. +PyMAPDL retries to reconnect to the MAPDL instance up to 5 times, waiting +for up to 5 seconds. +If unsuccessful, you might get an error like this: + +.. code-block:: text + + MAPDL server connection terminated unexpectedly while running: + /INQUIRE,,DIRECTORY,, + called by: + _send_command + + Suggestions: + MAPDL *might* have died because it executed a not-allowed command or ran out of memory. + Check the MAPDL command output for more details. + Open an issue on GitHub if you need assistance: https://github.com/ansys/pymapdl/issues + Error: + failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50052: Failed to connect to remote host: connect: Connection refused (111) + Full error: + <_InactiveRpcError of RPC that terminated with: + status = StatusCode.UNAVAILABLE + details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50052: Failed to connect to remote host: connect: Connection refused (111)" + debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-10-24T08:25:04.054559811+00:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50052: Failed to connect to remote host: connect: Connection refused (111)"}" + > + +The data of that job is available at :attr:`directory `. +You should set the run location using the ``run_location`` argument. + +While handling this exception, PyMAPDL also cancels the job to avoid resources leaking. +Therefore, the only option is to start a new instance by launching a new job using +the :func:`launch_mapdl() ` function. + +User case on a SLURM cluster +---------------------------- + +Assume a user wants to start a remote MAPDL instance in an HPC cluster +to interact with it. +The user would like to request 10 nodes, and 1 task per node (to avoid clashes +between MAPDL instances). +The user would like to also request 64 GB of RAM. +Because of administration logistics, the user must use the machines in +the ``supercluster01`` partition. +To make PyMAPDL launch an instance like that on SLURM, run the following code: + +.. code-block:: python + + from ansys.mapdl.core import launch_mapdl + from ansys.mapdl.core.examples import vmfiles + + scheduler_options = { + "nodes": 10, + "ntasks-per-node": 1, + "partition": "supercluster01", + "memory": 64, + } + mapdl = launch_mapdl(launch_on_hpc=True, nproc=10, scheduler_options=scheduler_options) + + num_cpu = mapdl.get_value("ACTIVE", 0, "NUMCPU") # It should be equal to 10 + + mapdl.clear() # Not strictly needed. + mapdl.prep7() + + # Run an MAPDL script + mapdl.input(vmfiles["vm1"]) + + # Let's solve again to get the solve printout + mapdl.solution() + output = mapdl.solve() + print(output) + + mapdl.exit() # Kill the MAPDL instance + + +PyMAPDL automatically sets MAPDL to read the job configuration (including machines, +number of CPUs, and memory), which allows MAPDL to use all the resources allocated +to that job. diff --git a/doc/source/user_guide/hpc/pymapdl.rst b/doc/source/user_guide/hpc/pymapdl.rst index 7ce40eff53..9382cbf87b 100644 --- a/doc/source/user_guide/hpc/pymapdl.rst +++ b/doc/source/user_guide/hpc/pymapdl.rst @@ -19,36 +19,34 @@ on whether or not you run them both on the HPC compute nodes. Additionally, you might be able interact with them (``interactive`` mode) or not (``batch`` mode). -For information on supported configurations, see :ref:`ref_pymapdl_batch_in_cluster_hpc`. +PyMAPDL takes advantage of HPC clusters to launch MAPDL instances +with increased resources. +PyMAPDL automatically sets these MAPDL instances to read the +scheduler job configuration (which includes machines, number +of CPUs, and memory), which allows MAPDL to use all the resources +allocated to that job. +For more information, see :ref:`ref_tight_integration_hpc`. +The following configurations are supported: -Since v0.68.5, PyMAPDL can take advantage of the tight integration -between the scheduler and MAPDL to read the job configuration and -launch an MAPDL instance that can use all the resources allocated -to that job. -For instance, if a SLURM job has allocated 8 nodes with 4 cores each, -then PyMAPDL launches an MAPDL instance which uses 32 cores -spawning across those 8 nodes. -This behaviour can turn off if passing the -:envvar:`PYMAPDL_RUNNING_ON_HPC` environment variable -with ``'false'`` value or passing the `running_on_hpc=False` argument -to :func:`launch_mapdl() `. +* :ref:`ref_pymapdl_batch_in_cluster_hpc`. +* :ref:`ref_pymapdl_interactive_in_cluster_hpc_from_login` .. _ref_pymapdl_batch_in_cluster_hpc: -Submit a PyMAPDL batch job to the cluster from the entrypoint node -================================================================== +Batch job submission from the login node +======================================== Many HPC clusters allow their users to log into a machine using ``ssh``, ``vnc``, ``rdp``, or similar technologies and then submit a job to the cluster from there. -This entrypoint machine, sometimes known as the *head node* or *entrypoint node*, +This login machine, sometimes known as the *head node* or *entrypoint node*, might be a virtual machine (VDI/VM). In such cases, once the Python virtual environment with PyMAPDL is already set and is accessible to all the compute nodes, launching a -PyMAPDL job from the entrypoint node is very easy to do using the ``sbatch`` command. +PyMAPDL job from the login node is very easy to do using the ``sbatch`` command. When the ``sbatch`` command is used, PyMAPDL runs and launches an MAPDL instance in the compute nodes. No changes are needed on a PyMAPDL script to run it on an SLURM cluster. @@ -99,6 +97,8 @@ job by setting the :envvar:`PYMAPDL_NPROC` environment variable to the desired v (venv) user@entrypoint-machine:~$ PYMAPDL_NPROC=4 sbatch main.py +For more applicable environment variables, see :ref:`ref_environment_variables`. + You can also add ``sbatch`` options to the command: .. code-block:: console @@ -182,3 +182,30 @@ This bash script performs tasks such as creating environment variables, moving files to different directories, and printing to ensure your configuration is correct. + +.. include:: launch_mapdl_entrypoint.rst + + + +.. _ref_tight_integration_hpc: + +Tight integration between MAPDL and the HPC scheduler +===================================================== + +Since v0.68.5, PyMAPDL can take advantage of the tight integration +between the scheduler and MAPDL to read the job configuration and +launch an MAPDL instance that can use all the resources allocated +to that job. +For instance, if a SLURM job has allocated 8 nodes with 4 cores each, +then PyMAPDL launches an MAPDL instance that uses 32 cores +spawning across those 8 nodes. + +This behavior can turn off by passing the +:envvar:`PYMAPDL_RUNNING_ON_HPC` environment variable +with a ``'false'`` value or passing the ``detect_hpc=False`` argument +to the :func:`launch_mapdl() ` function. + +Alternatively, you can override these settings by either specifying +custom settings in the :func:`launch_mapdl() ` +function's arguments or using specific environment variables. +For more information, see :ref:`ref_environment_variables`. diff --git a/doc/source/user_guide/mapdl.rst b/doc/source/user_guide/mapdl.rst index 2baa8f3aec..fceee2a4e3 100644 --- a/doc/source/user_guide/mapdl.rst +++ b/doc/source/user_guide/mapdl.rst @@ -1092,6 +1092,7 @@ are unsupported. | * ``LSWRITE`` | |:white_check_mark:| Available (Internally running in :attr:`Mapdl.non_interactive `) | |:white_check_mark:| Available | |:exclamation:| Only in :attr:`Mapdl.non_interactive ` | | +---------------+---------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+ +.. _ref_environment_variables: Environment variables ===================== diff --git a/doc/styles/config/vocabularies/ANSYS/accept.txt b/doc/styles/config/vocabularies/ANSYS/accept.txt index 0d27d17331..583fb27fac 100644 --- a/doc/styles/config/vocabularies/ANSYS/accept.txt +++ b/doc/styles/config/vocabularies/ANSYS/accept.txt @@ -53,6 +53,7 @@ CentOS7 Chao ci container_layout +CPUs datas delet Dependabot