Skip to content

The architecture of PROTzilla

Henning Gärtner edited this page Jun 7, 2024 · 2 revisions

Run -

This is class describes a run with data. The instance of this class holds all subclasses, that are associated with a specific run, namely the steps attribute, which holds all steps. Also holds a run-specific instance of a DiskOperator. Also encapsulates general error handling and is responsible for performing high-level operations like exporting a workflow, saving a run, and passing the inputs to the current step for calculation or adding / removing a step. The Run class in the protzilla project is responsible for managing and executing a single run in a data processing pipeline. It handles the execution of steps, error handling, and saving the run state.

Key attributes of the Run class include:

  • run_name: The name of the run.
  • workflow_name: The name of the workflow associated with the run.
  • disk_operator: An instance of the DiskOperator class that handles reading and writing data to disk.
  • steps: An instance of the StepManager class that contains all the steps of the processing pipeline.

Key methods of the Run class include:

  • step_add: Adds a step to the run.
  • step_remove: Removes a step from the run.
  • step_calculate: Executes the calculation for the current step.
  • step_change_method: Changes the method of the current step.

The Run class also includes decorators for error handling and automatic saving of the run state. These decorators can be applied to methods to handle errors and save the run state automatically after the method is called.

StepManager -

The StepManager class in the protzilla project manages the execution of steps in a data processing pipeline. It keeps track of the current step, the previous steps, and the future steps. It also provides methods to navigate through the steps.

Key attributes of the StepManager class include:

  • df_mode: Represents the mode of data storage, either in memory or on disk.
  • disk_operator: An instance of the DiskOperator class that handles reading and writing data to disk.
  • current_step_index: The index of the current step in the workflow.
  • importing, data_preprocessing, data_analysis, data_integration: Lists of steps in each section of the pipeline.

Key methods of the StepManager class include:

  • get_step_output: Gets the output of a specific step.
  • get_step_input: Gets the input of a specific step.

The StepManager class is designed to manage the execution of steps in a data processing pipeline. It ensures that the inputs and outputs of each step are validated and that the steps are executed in the correct order.

Step -

The Step class in the protzilla project represents a single step in a data processing pipeline. Each step has a specific operation it performs, and it can have inputs and outputs.

Key attributes of the Step class include:

  • section: Represents the section of the pipeline this step belongs to.
  • display_name: The display name of the step.
  • operation: The operation this step performs.
  • input_keys: The keys for the input data this step requires.
  • output_keys: The keys for the output data this step produces.

Key methods of the Step class include:

  • calculate: The core calculation method for all steps. It receives the inputs from the front-end and calculates the output.
  • method: The main method that performs the operation of the step. This method must be implemented in a subclass.
  • handle_outputs: This method handles the outputs from the calculation method and creates an Output object from it.
  • validate_inputs: This method validates the inputs of the step.
  • validate_outputs: This method validates the outputs of the step.

The Step class is designed to be subclassed for specific types of steps. Each subclass should implement the method method, and may also need to override other methods depending on its specific requirements.

DiskOperator -

The DiskOperator class in the protzilla project is responsible for reading and writing data to and from the disk. It uses instances of YamlOperator and DataFrameOperator to handle YAML files and dataframes respectively.

Key attributes of the DiskOperator class include:

  • run_name: The name of the run.
  • workflow_name: The name of the workflow associated with the run.
  • yaml_operator: An instance of the YamlOperator class that handles reading and writing YAML files.
  • dataframe_operator: An instance of the DataFrameOperator class that handles reading and writing dataframes.

Key methods of the DiskOperator class include:

  • read_run: Reads a run from a YAML file and returns a StepManager object.
  • write_run: Writes a run to a YAML file.
  • read_workflow: Reads a workflow from a YAML file and returns a StepManager object.
  • export_workflow: Exports a workflow to a YAML file.
  • check_file_validity: Checks if a file is still needed or if it can be deleted.
  • clean_dataframes_dir: Deletes unnecessary dataframes from the directory.

How to add a new step

1. Implement the function itself

Determine the correct section where to add the new function, e.g. protzilla/data_analysis/ and implement the function - e.g. t_test(...).

2. Create the Method

Create a subclass of Step (or an appropiate subclass of Step) the correct file in protzilla/methods/. Determine the required inputs for the function you created in 1. as well as outputs and the other metadata - take a look at other steps for reference. Important to note: if not all keys defined in the ´input_keysare present, the input validation will fail. Also, all keys NOT mentioned in theinput_keys` will be removed, as to avoid passing too many parameters.

If necessary, implement the insert_dataframes(...) method, if the input cannot be passed directly from the frontend and/or information from previous steps is required for calculation. IMPORTANT: for getting the outputs of previous steps, ALWAYS use the method get_step_output() of the StepManager

3. Creating the frontend form

Add the MethodForm class in the corresponding file in ui/runs/forms/

  • parameters are added by creating an attribute in the class as a Field - for reference, take a look at other steps.
  • in the input dictionary that will be passed to the backend, the keys will be the name of the attribute AS IT IS IN THE CODE ITSELF


class TTestForm(MethodForm):
   alpha = CustomFloatField(label="Error rate (alpha)", min_value=0, max_value=1, initial=0.05)

will result in an input dictionary like this: {"alpha": 0.05} - keep this in mind, as they need to match the parameter names of the function of 1. If this is not feasible, you can rename the keys in the insert_dataframes(...) method of 3.

If you need to reference other steps, or want to give choices based on previous data of the run, or want to make the input fields dynamic in some way, implement the fill_form(...) method of the form. This will allow you to activate/deactivate fields, and fill the fields with data. provides some useful tools. For reference usage, take a look at other steps that implemented this method.

4. Linking it all together

In the ui/runs/ file, you need to link your Method of 2. to the Form of 3. After doing this, you should be able to add use your new step in PROTzilla.