deploy: db47311

mala-project · Nov 26, 2024 · 61ae5e9 · 61ae5e9
1 parent 065f0b5
commit 61ae5e9
Show file tree

Hide file tree

Showing 10 changed files with 259 additions and 41 deletions.
diff --git a/_modules/mala/common/parameters.html b/_modules/mala/common/parameters.html
@@ -328,7 +328,6 @@ <h1>Source code for mala.common.parameters</h1><div class="highlight"><pre>
 <span class="sd">    ----------</span>
 <span class="sd">    nn_type : string</span>
 <span class="sd">        Type of the neural network that will be used. Currently supported are</span>
-
 <span class="sd">            - &quot;feed_forward&quot; (default)</span>
 <span class="sd">            - &quot;transformer&quot;</span>
 <span class="sd">            - &quot;lstm&quot;</span>
@@ -382,12 +381,12 @@ <h1>Source code for mala.common.parameters</h1><div class="highlight"><pre>
         <span class="bp">self</span><span class="o">.</span><span class="n">layer_activations</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&quot;Sigmoid&quot;</span><span class="p">]</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">loss_function_type</span> <span class="o">=</span> <span class="s2">&quot;mse&quot;</span>
 
-        <span class="c1"># for LSTM/Gru + Transformer</span>
-        <span class="bp">self</span><span class="o">.</span><span class="n">num_hidden_layers</span> <span class="o">=</span> <span class="mi">1</span>
-
         <span class="c1"># for LSTM/Gru</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">no_hidden_state</span> <span class="o">=</span> <span class="kc">False</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">bidirection</span> <span class="o">=</span> <span class="kc">False</span>
+
+        <span class="c1"># for LSTM/Gru + Transformer</span>
+        <span class="bp">self</span><span class="o">.</span><span class="n">num_hidden_layers</span> <span class="o">=</span> <span class="mi">1</span>
 
         <span class="c1"># for transformer net</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">dropout</span> <span class="o">=</span> <span class="mf">0.1</span>
@@ -815,12 +814,15 @@ <h1>Source code for mala.common.parameters</h1><div class="highlight"><pre>
 <span class="sd">        a &quot;by snapshot&quot; basis.</span>
 
 <span class="sd">    checkpoints_each_epoch : int</span>
-<span class="sd">        If not 0, checkpoint files will be saved after eac</span>
+<span class="sd">        If not 0, checkpoint files will be saved after each</span>
 <span class="sd">        checkpoints_each_epoch epoch.</span>
 
 <span class="sd">    checkpoint_name : string</span>
 <span class="sd">        Name used for the checkpoints. Using this, multiple runs</span>
 <span class="sd">        can be performed in the same directory.</span>
+<span class="sd">        </span>
+<span class="sd">    run_name : string</span>
+<span class="sd">        Name of the run used for logging.</span>
 
 <span class="sd">    logging_dir : string</span>
 <span class="sd">        Name of the folder that logging files will be saved to.</span>
@@ -829,6 +831,34 @@ <h1>Source code for mala.common.parameters</h1><div class="highlight"><pre>
 <span class="sd">        If True, then upon creating logging files, these will be saved</span>
 <span class="sd">        in a subfolder of logging_dir labelled with the starting date</span>
 <span class="sd">        of the logging, to avoid having to change input scripts often.</span>
+<span class="sd">        </span>
+<span class="sd">    logger : string</span>
+<span class="sd">        Name of the logger to be used.</span>
+<span class="sd">        Currently supported are:</span>
+<span class="sd">        </span>
+<span class="sd">            - &quot;tensorboard&quot;: Tensorboard logger.</span>
+<span class="sd">            - &quot;wandb&quot;: Weights and Biases logger.</span>
+<span class="sd">    </span>
+<span class="sd">    validation_metrics : list</span>
+<span class="sd">        List of metrics to be used for validation. Default is [&quot;ldos&quot;].</span>
+<span class="sd">        Possible options are:</span>
+<span class="sd">        </span>
+<span class="sd">            - &quot;ldos&quot;: MSE of the LDOS.</span>
+<span class="sd">            - &quot;band_energy&quot;: Band energy.</span>
+<span class="sd">            - &quot;band_energy_actual_fe&quot;: Band energy computed with ground truth Fermi energy.</span>
+<span class="sd">            - &quot;total_energy&quot;: Total energy.</span>
+<span class="sd">            - &quot;total_energy_actual_fe&quot;: Total energy computed with ground truth Fermi energy.</span>
+<span class="sd">            - &quot;fermi_energy&quot;: Fermi energy.</span>
+<span class="sd">            - &quot;density&quot;: Electron density.</span>
+<span class="sd">            - &quot;density_relative&quot;: Rlectron density (MAPE).</span>
+<span class="sd">            - &quot;dos&quot;: Density of states.</span>
+<span class="sd">            - &quot;dos_relative&quot;: Density of states (MAPE).</span>
+<span class="sd">            </span>
+<span class="sd">    validate_on_training_data : bool</span>
+<span class="sd">        Whether to validate on the training data as well. Default is False.</span>
+<span class="sd">        </span>
+<span class="sd">    validate_every_n_epochs : int</span>
+<span class="sd">        Determines how often validation is performed. Default is 1.</span>
 
 <span class="sd">    inference_data_grid : list</span>
 <span class="sd">        List holding the grid to be used for inference in the form of</span>
@@ -843,19 +873,18 @@ <h1>Source code for mala.common.parameters</h1><div class="highlight"><pre>
 
 <span class="sd">    profiler_range : list</span>
 <span class="sd">        List with two entries determining with which batch/iteration number</span>
-<span class="sd">         the CUDA profiler will start and stop profiling. Please note that</span>
-<span class="sd">         this option only holds significance if the nsys profiler is used.</span>
+<span class="sd">        the CUDA profiler will start and stop profiling. Please note that</span>
+<span class="sd">        this option only holds significance if the nsys profiler is used.</span>
 <span class="sd">    &quot;&quot;&quot;</span>
 
     <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
         <span class="nb">super</span><span class="p">(</span><span class="n">ParametersRunning</span><span class="p">,</span> <span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">optimizer</span> <span class="o">=</span> <span class="s2">&quot;Adam&quot;</span>
-        <span class="bp">self</span><span class="o">.</span><span class="n">learning_rate</span> <span class="o">=</span> <span class="mi">10</span> <span class="o">**</span> <span class="p">(</span><span class="o">-</span><span class="mi">5</span><span class="p">)</span>
+        <span class="bp">self</span><span class="o">.</span><span class="n">learning_rate</span> <span class="o">=</span> <span class="mf">0.5</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">learning_rate_embedding</span> <span class="o">=</span> <span class="mi">10</span> <span class="o">**</span> <span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">)</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">max_number_epochs</span> <span class="o">=</span> <span class="mi">100</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">verbosity</span> <span class="o">=</span> <span class="kc">True</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span> <span class="o">=</span> <span class="mi">10</span>
-        <span class="bp">self</span><span class="o">.</span><span class="n">snapshots_per_epoch</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span>
 
         <span class="bp">self</span><span class="o">.</span><span class="n">l1_regularization</span> <span class="o">=</span> <span class="mf">0.0</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">l2_regularization</span> <span class="o">=</span> <span class="mf">0.0</span>
@@ -874,7 +903,6 @@ <h1>Source code for mala.common.parameters</h1><div class="highlight"><pre>
         <span class="bp">self</span><span class="o">.</span><span class="n">num_workers</span> <span class="o">=</span> <span class="mi">0</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">use_shuffling_for_samplers</span> <span class="o">=</span> <span class="kc">True</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">checkpoints_each_epoch</span> <span class="o">=</span> <span class="mi">0</span>
-        <span class="bp">self</span><span class="o">.</span><span class="n">checkpoint_best_so_far</span> <span class="o">=</span> <span class="kc">False</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">checkpoint_name</span> <span class="o">=</span> <span class="s2">&quot;checkpoint_mala&quot;</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">run_name</span> <span class="o">=</span> <span class="s2">&quot;&quot;</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">logging_dir</span> <span class="o">=</span> <span class="s2">&quot;./mala_logging&quot;</span>

diff --git a/_sources/advanced_usage/trainingmodel.rst.txt b/_sources/advanced_usage/trainingmodel.rst.txt
@@ -194,22 +194,64 @@ keyword, you can fine-tune the number of new snapshots being created.
 By default, the same number of snapshots as had been provided will be created
 (if possible).
 
-Using tensorboard
-******************
+Logging metrics during training
+*******************************
+
+Training progress in MALA can be visualized via tensorboard or wandb, as also shown
+in the file ``advanced/ex03_tensor_board``. Simply select a logger prior to training as
+
+      .. code-block:: python
+
+            parameters.running.logger = "tensorboard"
+            parameters.running.logging_dir = "mala_vis"
 
-Training routines in MALA can be visualized via tensorboard, as also shown
-in the file ``advanced/ex03_tensor_board``. Simply enable tensorboard
-visualization prior to training via
+or
 
       .. code-block:: python
 
-            # 0: No visualizatuon, 1: loss and learning rate, 2: like 1,
-            # but additionally weights and biases are saved
-            parameters.running.logging = 1
+            import wandb
+            wandb.init(
+                  project="mala_training",
+                  entity="your_wandb_entity"
+            )
+            parameters.running.logger = "wandb"
             parameters.running.logging_dir = "mala_vis"
 
 where ``logging_dir`` specifies some directory in which to save the
-MALA logging data. Afterwards, you can run the training without any
+MALA logging data. You can also select which metrics to record via
+
+      .. code-block:: python
+
+            parameters.validation_metrics = ["ldos", "dos", "density", "total_energy"]
+
+Full list of available metrics:
+      - "ldos": MSE of the LDOS.
+      - "band_energy": Band energy.
+      - "band_energy_actual_fe": Band energy computed with ground truth Fermi energy.
+      - "total_energy": Total energy.
+      - "total_energy_actual_fe": Total energy computed with ground truth Fermi energy.
+      - "fermi_energy": Fermi energy.
+      - "density": Electron density.
+      - "density_relative": Rlectron density (Mean Absolute Percentage Error).
+      - "dos": Density of states.
+      - "dos_relative": Density of states (Mean Absolute Percentage Error).
+
+To save time and resources you can specify the logging interval via
+
+      .. code-block:: python
+
+            parameters.running.validate_every_n_epochs = 10
+
+If you want to monitor the degree to which the model overfits to the training data,
+you can use the option
+
+      .. code-block:: python
+            
+            parameters.running.validate_on_training_data = True
+
+MALA will evaluate the validation metrics on the training set as well as the validation set.
+
+Afterwards, you can run the training without any
 other modifications. Once training is finished (or during training, in case
 you want to use tensorboard to monitor progress), you can launch tensorboard
 via
@@ -221,6 +263,7 @@ via
 The full path for ``path_to_log_directory`` can be accessed via
 ``trainer.full_logging_path``.
 
+If you're using wandb, you can monitor the training progress on the wandb website.
 
 Training in parallel
 ********************

diff --git a/advanced_usage/trainingmodel.html b/advanced_usage/trainingmodel.html
@@ -59,7 +59,7 @@
 <li class="toctree-l3"><a class="reference internal" href="#advanced-training-metrics">Advanced training metrics</a></li>
 <li class="toctree-l3"><a class="reference internal" href="#checkpointing-a-training-run">Checkpointing a training run</a></li>
 <li class="toctree-l3"><a class="reference internal" href="#using-lazy-loading">Using lazy loading</a></li>
-<li class="toctree-l3"><a class="reference internal" href="#using-tensorboard">Using tensorboard</a></li>
+<li class="toctree-l3"><a class="reference internal" href="#logging-metrics-during-training">Logging metrics during training</a></li>
 <li class="toctree-l3"><a class="reference internal" href="#training-in-parallel">Training in parallel</a></li>
 </ul>
 </li>
@@ -280,21 +280,65 @@ <h2>Using lazy loading<a class="headerlink" href="#using-lazy-loading" title="Li
 By default, the same number of snapshots as had been provided will be created
 (if possible).</p>
 </section>
-<section id="using-tensorboard">
-<h2>Using tensorboard<a class="headerlink" href="#using-tensorboard" title="Link to this heading"></a></h2>
-<p>Training routines in MALA can be visualized via tensorboard, as also shown
-in the file <code class="docutils literal notranslate"><span class="pre">advanced/ex03_tensor_board</span></code>. Simply enable tensorboard
-visualization prior to training via</p>
+<section id="logging-metrics-during-training">
+<h2>Logging metrics during training<a class="headerlink" href="#logging-metrics-during-training" title="Link to this heading"></a></h2>
+<p>Training progress in MALA can be visualized via tensorboard or wandb, as also shown
+in the file <code class="docutils literal notranslate"><span class="pre">advanced/ex03_tensor_board</span></code>. Simply select a logger prior to training as</p>
 <blockquote>
-<div><div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 0: No visualizatuon, 1: loss and learning rate, 2: like 1,</span>
-<span class="c1"># but additionally weights and biases are saved</span>
-<span class="n">parameters</span><span class="o">.</span><span class="n">running</span><span class="o">.</span><span class="n">logging</span> <span class="o">=</span> <span class="mi">1</span>
+<div><div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">parameters</span><span class="o">.</span><span class="n">running</span><span class="o">.</span><span class="n">logger</span> <span class="o">=</span> <span class="s2">&quot;tensorboard&quot;</span>
+<span class="n">parameters</span><span class="o">.</span><span class="n">running</span><span class="o">.</span><span class="n">logging_dir</span> <span class="o">=</span> <span class="s2">&quot;mala_vis&quot;</span>
+</pre></div>
+</div>
+</div></blockquote>
+<p>or</p>
+<blockquote>
+<div><div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">wandb</span>
+<span class="n">wandb</span><span class="o">.</span><span class="n">init</span><span class="p">(</span>
+      <span class="n">project</span><span class="o">=</span><span class="s2">&quot;mala_training&quot;</span><span class="p">,</span>
+      <span class="n">entity</span><span class="o">=</span><span class="s2">&quot;your_wandb_entity&quot;</span>
+<span class="p">)</span>
+<span class="n">parameters</span><span class="o">.</span><span class="n">running</span><span class="o">.</span><span class="n">logger</span> <span class="o">=</span> <span class="s2">&quot;wandb&quot;</span>
 <span class="n">parameters</span><span class="o">.</span><span class="n">running</span><span class="o">.</span><span class="n">logging_dir</span> <span class="o">=</span> <span class="s2">&quot;mala_vis&quot;</span>
 </pre></div>
 </div>
 </div></blockquote>
 <p>where <code class="docutils literal notranslate"><span class="pre">logging_dir</span></code> specifies some directory in which to save the
-MALA logging data. Afterwards, you can run the training without any
+MALA logging data. You can also select which metrics to record via</p>
+<blockquote>
+<div><div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">parameters</span><span class="o">.</span><span class="n">validation_metrics</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&quot;ldos&quot;</span><span class="p">,</span> <span class="s2">&quot;dos&quot;</span><span class="p">,</span> <span class="s2">&quot;density&quot;</span><span class="p">,</span> <span class="s2">&quot;total_energy&quot;</span><span class="p">]</span>
+</pre></div>
+</div>
+</div></blockquote>
+<dl class="simple">
+<dt>Full list of available metrics:</dt><dd><ul class="simple">
+<li><p>“ldos”: MSE of the LDOS.</p></li>
+<li><p>“band_energy”: Band energy.</p></li>
+<li><p>“band_energy_actual_fe”: Band energy computed with ground truth Fermi energy.</p></li>
+<li><p>“total_energy”: Total energy.</p></li>
+<li><p>“total_energy_actual_fe”: Total energy computed with ground truth Fermi energy.</p></li>
+<li><p>“fermi_energy”: Fermi energy.</p></li>
+<li><p>“density”: Electron density.</p></li>
+<li><p>“density_relative”: Rlectron density (Mean Absolute Percentage Error).</p></li>
+<li><p>“dos”: Density of states.</p></li>
+<li><p>“dos_relative”: Density of states (Mean Absolute Percentage Error).</p></li>
+</ul>
+</dd>
+</dl>
+<p>To save time and resources you can specify the logging interval via</p>
+<blockquote>
+<div><div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">parameters</span><span class="o">.</span><span class="n">running</span><span class="o">.</span><span class="n">validate_every_n_epochs</span> <span class="o">=</span> <span class="mi">10</span>
+</pre></div>
+</div>
+</div></blockquote>
+<p>If you want to monitor the degree to which the model overfits to the training data,
+you can use the option</p>
+<blockquote>
+<div><div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">parameters</span><span class="o">.</span><span class="n">running</span><span class="o">.</span><span class="n">validate_on_training_data</span> <span class="o">=</span> <span class="kc">True</span>
+</pre></div>
+</div>
+</div></blockquote>
+<p>MALA will evaluate the validation metrics on the training set as well as the validation set.</p>
+<p>Afterwards, you can run the training without any
 other modifications. Once training is finished (or during training, in case
 you want to use tensorboard to monitor progress), you can launch tensorboard
 via</p>
@@ -305,6 +349,7 @@ <h2>Using tensorboard<a class="headerlink" href="#using-tensorboard" title="Link
 </div></blockquote>
 <p>The full path for <code class="docutils literal notranslate"><span class="pre">path_to_log_directory</span></code> can be accessed via
 <code class="docutils literal notranslate"><span class="pre">trainer.full_logging_path</span></code>.</p>
+<p>If you’re using wandb, you can monitor the training progress on the wandb website.</p>
 </section>
 <section id="training-in-parallel">
 <h2>Training in parallel<a class="headerlink" href="#training-in-parallel" title="Link to this heading"></a></h2>