diff --git a/.nojekyll b/.nojekyll index 92a0ebd0..d38689ec 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -abf00c67 \ No newline at end of file +c0d3f531 \ No newline at end of file diff --git a/develop/01_RDM_intro.html b/develop/01_RDM_intro.html index 695bdc75..d90925e6 100644 --- a/develop/01_RDM_intro.html +++ b/develop/01_RDM_intro.html @@ -295,7 +295,7 @@

1. Introduction to RDM

Modified
-

April 22, 2024

+

April 25, 2024

diff --git a/develop/02_DMP.html b/develop/02_DMP.html index 7b21d743..668f5142 100644 --- a/develop/02_DMP.html +++ b/develop/02_DMP.html @@ -261,7 +261,7 @@

2. Data Management Plan

Modified
-

April 22, 2024

+

April 25, 2024

diff --git a/develop/03_DOD.html b/develop/03_DOD.html index 0506a6d6..d8d33bbe 100644 --- a/develop/03_DOD.html +++ b/develop/03_DOD.html @@ -311,7 +311,7 @@

3. Data organization and storage

Modified
-

April 22, 2024

+

April 25, 2024

@@ -955,23 +955,23 @@

Naming conventions

-
- diff --git a/develop/04_metadata.html b/develop/04_metadata.html index f0ac1c24..91075922 100644 --- a/develop/04_metadata.html +++ b/develop/04_metadata.html @@ -305,7 +305,7 @@

4. Documentation for biodata

Modified
-

April 22, 2024

+

April 25, 2024

diff --git a/develop/05_VC.html b/develop/05_VC.html index c6f69427..2ed0ff9f 100644 --- a/develop/05_VC.html +++ b/develop/05_VC.html @@ -267,7 +267,7 @@

5. Data Analysis with Version Control

Modified
-

April 22, 2024

+

April 25, 2024

diff --git a/develop/06_pipelines.html b/develop/06_pipelines.html index 8e7132ce..871d7283 100644 --- a/develop/06_pipelines.html +++ b/develop/06_pipelines.html @@ -246,7 +246,7 @@

6. Processing and analyzing biodata

Modified
-

April 22, 2024

+

April 25, 2024

diff --git a/develop/07_repos.html b/develop/07_repos.html index 91e262e3..c0f490fe 100644 --- a/develop/07_repos.html +++ b/develop/07_repos.html @@ -260,7 +260,7 @@

7. Storing and sharing biodata

Modified
-

April 22, 2024

+

April 25, 2024

diff --git a/develop/contributors.html b/develop/contributors.html index c393eef1..dfe90bf4 100644 --- a/develop/contributors.html +++ b/develop/contributors.html @@ -152,7 +152,7 @@

Practical material

Modified
-

April 22, 2024

+

April 25, 2024

diff --git a/develop/examples/NGS_OS_FAIR.html b/develop/examples/NGS_OS_FAIR.html index 578193b3..05c16cce 100644 --- a/develop/examples/NGS_OS_FAIR.html +++ b/develop/examples/NGS_OS_FAIR.html @@ -244,7 +244,7 @@

Applied Open Science and FAIR principles to NGS

Modified
-

April 22, 2024

+

April 25, 2024

diff --git a/develop/examples/NGS_management.html b/develop/examples/NGS_management.html index 146c14e6..78044004 100644 --- a/develop/examples/NGS_management.html +++ b/develop/examples/NGS_management.html @@ -272,7 +272,7 @@

NGS data strategies

Modified
-

April 22, 2024

+

April 25, 2024

diff --git a/develop/examples/NGS_metadata.html b/develop/examples/NGS_metadata.html index 2e557a3b..9a9fdb35 100644 --- a/develop/examples/NGS_metadata.html +++ b/develop/examples/NGS_metadata.html @@ -244,7 +244,7 @@

NGS Assay and Project metadata

Modified
-

April 22, 2024

+

April 25, 2024

@@ -279,23 +279,23 @@

Sample metadata fie
-
- @@ -864,23 +864,23 @@

Project metadata f
-
- @@ -1358,23 +1358,23 @@

Assay metadata field
-
- @@ -1962,23 +1962,23 @@

Assay metadata field
-
- diff --git a/develop/images/fork_repo_project.png b/develop/images/fork_repo_project.png index 94ab2629..3d5ec960 100644 Binary files a/develop/images/fork_repo_project.png and b/develop/images/fork_repo_project.png differ diff --git a/develop/practical_workshop.html b/develop/practical_workshop.html index 06debd4f..1296e34b 100644 --- a/develop/practical_workshop.html +++ b/develop/practical_workshop.html @@ -172,10 +172,8 @@

On this page

    -
  • 1. Organize and structure your NGS data and data analysis +
  • 1. Organize and structure your datasets and data analysis
  • 2. Metadata @@ -227,7 +225,7 @@

    Practical material

    Modified
    -

    April 22, 2024

    +

    April 25, 2024

    @@ -252,15 +250,15 @@

    Practical material

    💬 Learning Objectives:

    1. Organize and structure your data and data analysis with Cookiecutter templates
    2. -
    3. Establish metadata fields and collect metadata when creating a cookiecutter folder
    4. +
    5. Define metadata fields and collect metadata when creating a Cookiecutter folder
    6. Establish naming conventions for your data
    7. -
    8. Make a catalog of your data
    9. -
    10. Create GitHub repositories of your data analysis and display them as GitHub Pages
    11. +
    12. Create a catalog of your data
    13. +
    14. Use GitHub repositories of your data analysis and display them as GitHub Pages
    15. Archive GitHub repositories on Zenodo
-

This is a practical version of the full RDM on NGS data workshop. The main key points of the exercises shown here are to help you organize and structure your NGS datasets and your data analyses. We will see how to keep track of your experiments metadata and how to safely version control and archive your data analyses using GitHub repositories and Zenodo. We hope that through these practical exercises and step-by-step guidance, you’ll gain valuable skills in efficiently managing and sharing your research data, enhancing the reproducibility and impact of your work.

+

This practical version covers practical aspects of RDM applied to biodata. The exercises provided here aim to help you organize and structure your datasets and data analyses. You’ll learn how to manage your experimental metadata effectively and safely version control and archive your data analyses using GitHub repositories and Zenodo. Through these guided exercises and step-by-step instructions, we hope you will acquire essential skills for managing and sharing your research data efficiently, thereby enhancing the reproducibility and impact of your work.

@@ -271,17 +269,20 @@

Practical material

-

Ensure that all necessary tools and software are installed before proceeding with the practical exercises.

+

Ensure all necessary tools and software are installed before beginning the practical exercises:

    -
  • A GitHub account

  • -
  • Git

  • -
  • A Zenodo account

  • -
  • Python

  • -
  • pip

  • -
  • Cookicutter to create folder structure templates (pip install cookiecutter)

  • -
  • cruft to version control your templates (pip install cruft)

  • -
  • Option a. Install Quarto. We recommend Quarto as is easy to use and provides native support for notebooks (R Markdown and Jupyter Notebooks). No extensions or dependencies are needed.

  • -
  • Option b. Install MkDocs and MkDocs extensions using the command line.

  • +
  • A GitHub account for hosting and collaborating on projects
  • +
  • Git for version control of your projects
  • +
  • A Zenodo account for archiving and sharing your research outputs
  • +
  • Python
  • +
  • pip for managing Python packages
  • +
  • Cookicutter for creating folder structure templates (pip install cookiecutter)
  • +
  • cruft to version control your templates (pip install cruft)
  • +
+

Two more tools will be required, choose the one you are familiar with or the first option:

+
    +
  • Option a. Install Quarto. We recommend Quarto as is easy to use and provides native support for notebooks (both R Markdown and Jupyter Notebooks). It requires no additional extensions or dependencies.
  • +
  • Option b. Install MkDocs and MkDocs extensions using the command line. Additional extensions are optional but can be useful if you choose this approach.
pip install mkdocs # create webpages
 pip install mkdocs-material # customize webpages
@@ -289,108 +290,208 @@ 

Practical material

pip install mkdocs-minify-plugin # Minimize html code pip install mkdocs-git-revision-date-localized-plugin # display last updated date pip install mkdocs-jupyter # include Jupyter notebooks -pip install mkdocs-table-reader-plugin -pip install mkdocs-bibtex # add references in your text (`.bib`) -pip install neoteroi-mkdocs # create author cards -pip install mkdocs-table-reader-plugin # embed tabular format files (`.tsv`)
+pip install mkdocs-bibtex # add references in your text (`.bib`) +pip install neoteroi-mkdocs # create author cards +pip install mkdocs-table-reader-plugin # embed tabular format files (`.tsv`)
-
-

1. Organize and structure your NGS data and data analysis

-

Applying a consistent file structure and naming conventions to your files will help you to efficiently manage your data. We will divide your NGS data and data analyses into two different types of folders:

+
+

1. Organize and structure your datasets and data analysis

+

Establishing a consistent file structure and naming conventions will help you efficiently manage your data. We will classify your data and data analyses into two distinct types of folders to ensure the data can be used and shared by many lab members while preventing modifications by any individual:

    -
  1. Assay folders: These folders contain the raw and processed NGS datasets, as well as the pipeline/workflow used to generate the processed data, provenance of the raw data, and quality control reports of the data. This data should be locked and read-only to prevent unwanted modifications.
  2. -
  3. Project folders: These folders contain all the necessary files for a specific research project. A project may use several assays or results from other projects. The assay data should not be copied or duplicated, but linked from the source.
  4. +
  5. Data folders (assay or external databases and resources): They house the raw and processed datasets, alongside the pipeline/workflow used to generate the processed data, the provenance of the raw data, and quality control reports of the data. The data should be locked and set to read-only to prevent unintended modifications. This applies to experimental data generated in your lab as well as external resources. Provide an MD5 checksum file when you download them yourself to verify their integrity.
  6. +
  7. Project folders: They contain all the essential files for a specific research project. Projects may use data from various resources or experiments, or build upon previous results from other projects. The data should not be copied or duplicated, instead, it should be linked directly from the source.
-

Projects and Assays are separated from each other because a project may use one or more assays to answer a scientific question, and assays may be reused several times in different projects. This could be, for example, all the data analysis related to a publication (an RNAseq and a ChIPseq experiment), or a comparison between a previous ATACseq experiment (which was used for a older project) with a new laboratory protocol.

-

You could also create Genomic resources folders things such as genome references (fasta files) and annotations (gtf files) for different species, as well as indexes for different alignment algorithms. If you want to know more, feel free to check the relevant full lesson

+

Data and data analysis are kept separate because a project may utilize one or more datasets to address a scientific question. Data can be reused in multiple projects over time, combined with other datasets for comparison, or used to build larger datasets. Additionally, data may be utilized by different researchers to answer various research questions.

+
+ +
+
+
+
+

When organizing your data folders, separate assays from external resources and maintain a consistent structure. For example, organize genome references by species and further categorize them by versions. Make sure to include all relevant information, and refer to this lesson for additional tips on data organization.

This will help you to keep your data tidied up, especially if you are working in a big lab where assays may be used for different purposes and by different people!

-
-

Assay folder

-

For each NGS experiment, there should be an Assay folder that will contain all experimental datasets, that is, an Assay (raw files and pipeline processed files). Raw files should not be modified at all, but you should probably lock modifications to the final results once you are done with preprocessing the data. This will help you prevent unwanted modifications to the data. Each Assay subfolder should be named in a way that is unique, easily readable, distinguishable, and understood at a glance. For example, you could name an NGS assay using an acronym for the type of NGS assay (RNAseq, ChIPseq, ATACseq), a keyword that represents a unique descriptive element of that assay, and the date. Like this:

-
<Assay-ID>_<keyword>_YYYYMMDD
-

For example CHIP_Oct4_20230101 is a ChIPseq assay made on 1st January 2023 with the keyword Oct4, so it is easily identifiable by the eye. Next, let’s take a look at a possible folder structure and what kind of files you can find there.

-
CHIP_Oct4_20230101/
-├── README.md
-├── metadata.yml
-├── pipeline.md
-├── processed
-└── raw
-   ├── .fastq.gz
-   └── samplesheet.csv
+
+
+
+
+
+
+

Data folders

+

Whether your lab generates its own experimental data, receives it from collaborators, or works with previously published datasets, the data folder should follow a similar structure to the one presented here. Create a separate folder for each dataset, including raw files and processed files alongside the corresponding documentation and pipeline that generated the processed data. Raw files should remain untouched, and you should consider locking modifications to the final results once data preprocessing is complete. This precaution helps prevent unwanted changes to the data. Each subfolder should be named in a way that is distinct, easily readable and clear at a glance. Check this lesson for tips on naming conventions.

+
+ +
+
+
+
+

Use an acronym (1) that describes the type of NGS assay (RNAseq, ChIPseq, ATACseq) a keyword (2) that represents a unique element to that assay, and the date (3).

+
<Assay-ID>_<keyword>_YYYYMMDD
+

For example CHIP_Oct4_20230101 is a ChIPseq assay made on 1st January 2023 with the keyword Oct4, so it is easily identifiable by the eye.

+
+
+
+
+
+

Let’s explore a potential folder structure and the types of files you might encounter within it.

+
<data_type>_<keyword>_YYYYMMDD/
+├── README.md 
+├── CHECKSUMS
+├── pipeline
+    ├── pipeline.md
+    ├── scripts/
+├── processed
+    ├── fastqc/
+    ├── multiqc/
+    ├── final_fastq/
+└── raw
+    ├── .fastq.gz 
+    └── samplesheet.csv
    -
  • README.md: Long description of the assay in markdown format. It should contain provenance of the raw NGS data (samples, laboratory protocols used, the aim of the assay, etc)
  • -
  • metadata.yml: metadata file for the assay describing different keys and important information regarding that assay (see this lesson).
  • -
  • pipeline.md: description of the pipeline used to process raw data, as well as the commands used to run the pipeline.
  • -
  • processed: folder with results of the preprocessing pipeline. Contents depend on the pipeline used.
  • -
  • raw: folder with the raw data. +
  • README.md: This file contains a detailed description of the dataset commonly in markdown format. It should include the provenance of the raw data (such as samples, laboratory protocols used, the aim of the project, folder structure, naming conventions, etc.).
  • +
  • metadata.yml: This metadata file outlines different keys and essential information, usually presented in YAML format. For more details, refer to this lesson.
  • +
  • pipeline.md: This file provides an overview of the pipeline used to process raw data, as well as the commands to run the pipeline. The pipeline itself and all the required scripts should be collected in the same directory.
  • +
  • processed: This folder contains the results from the preprocessing pipeline. The content vary depending on the specific pipeline used (create additional subdirectories as needed).
  • +
  • raw: This folder holds the raw data.
      -
    • .fastq.gz:In the case of NGS assays, there should be fastq files.
    • -
    • samplesheet.csv: file that contains metadata information for the samples. This file is used to run the nf-core pipelines. You can also add extra columns with info regarding the experimental variables and batches so it can be used for downstream analysis as well.
    • +
    • .fastq.gz: For example, in NGS assays, there should be ‘fastq’ files.
    • +
    • samplesheet.csv: This file holds essential metadata for the samples, including sample identification, experimental variables, batch information, and other metrics crucial for downstream analysis. It is important that this file is complete and current, as it is key to interpreting results. If you are considering running nf-core pipelines, this file will be required.
-
-

Project folder

-

On the other hand, we have the other type of folder called Projects. In this folder, you will save a subfolder for each project that you (or your lab) work on. Each Project subfolder will contain project information and all the data analysis notebooks and scripts used in that project.

-

As like for an Assay folder, the Project folder should be named in a way that is unique, easily readable, distinguishable, and understood at a glance. For example, you could name it after the main author’s initials, a keyword that represents a unique descriptive element of that assay, and the date:

-
<author_initials>_<keyword>_YYYYMMDD
-

For example, JARH_Oct4_20230101, is a project about the gene Oct4 owned by Jose Alejandro Romero Herrera, created on the 1st of January of 2023.

-

Next, let’s take a look at a possible folder structure and what kind of files you can find there.

-
<author_initials>_<keyword>_YYYYMMDD
+
+

Project folders

+

On the other hand, we have another type of folder called Projects which refers to data analyses that are specific to particular tasks, such as those involved in preparing a potential article. In this folder, you will create a subfolder for each project that you or your lab is working on. Each Project subfolder should include project-specific information, data analysis pipelines, notebooks, and scripts used for that particular project. Additionally, you should include an environment file with all the required software and dependencies needed for the project, including their versions. This helps ensure that the analyses can be easily replicated and shared with others.

+

The Project folder should be named in a way that is unique, easy to read, distinguishable, and clear at a glance. For example, you might name it based on the main author’s initials, the dataset being analyzed, the project name, a unique descriptive element related to the project, or the part of the project you are responsible for, along with the date:

+
<project>_<keyword>_YYYYMMDD
+
+ +
+
+
+
+
    +
  • RNASeq_Mouse_Brain_20230512: a project RNA sequencing data from a mouse brain experiment, created on May 12, 2023
  • +
  • EHR_COVID19_Study_20230115: a project around electronic health records data for a COVID-19 study, created on January 15, 2023.
  • +
+
+
+
+
+
+

Now, let’s explore an example of a folder structure and the types of files you might encounter within it.

+
<project>_<keyword>_YYYYMMDD
 ├── data
-  └── <Assay-ID>_<keyword>_YYYYMMDD/
+  └── <ID>_<keyword>_YYYYMMDD <- symbolic link
 ├── documents
-  └── Non-sensitive_NGS_research_project_template.docx
-├── notebooks
-  └── 01_data_analysis.rmd
-├── README.md
-├── reports
-  ├── figures
-  │  └── 01_data_analysis/
-  │   └── heatmap_sampleCor_20230102.png
-  └── 01_data_analysis.html
-├── requirements.txt
-├── results
-  └── 01_data_analysis/
-      └── DEA_treat-control_LFC1_p01.tsv
-├── scripts
-└── metadata.yml
+ └── research_project_template.docx +├── metadata.yml +├── notebooks + └── 01_data_processing.rmd + └── 02_data_analysis.rmd + └── 03_data_visualization.rmd +├── README.md +├── reports + └── 01_data_processing.html + └── 02_data_analysis.html + ├── 03_data_visualization.html + │ └── figures + │ └── tables +├── requirements.txt // env.yaml +├── results + ├── figures + │ └── 02_data_analysis/ + │ └── heatmap_sampleCor_20230102.png + ├── tables + │ └── 02_data_analysis/ + │ └── DEA_treat-control_LFC1_p01.tsv + │ └── SumStats_sampleCor_20230102.tsv +├── pipeline + ├── rules // processes + │ └── step1_data_processing.smk + └── pipeline.md +├── scratch +└── scripts
    -
  • data: a folder that contains symlinks or shortcuts to where the data is, avoiding copying and modification of original files.
  • -
  • documents: a folder containing Word documents, slides, or PDFs related to the project, such as explanations of the data or project, papers, etc. It also contains your Data Management Plan. +
  • data: This folder contains symlinks or shortcuts to the actual data files, ensuring that the original files remain unaltered.
  • +
  • documents: This folder houses Word documents, slides, or PDFs associated with the project, including data and project explanations, research papers, and more. It also includes the Data Management Plan.
      -
    • Non-sensitive_NGS_research_project_template.docx. This is a pre-filled Data Management Plan based on the Horizon Europe guidelines.
    • +
    • research_project_template.docx. If you download our template you will find a is a pre-filled Data Management Plan based on the Horizon Europe guidelines named ‘Non-sensitive_NGS_research_project_template.docx’.
  • -
  • notebooks: a folder containing Jupyter, R markdown, or Quarto notebooks with the actual data analysis.
  • -
  • README.md: detailed description of the project in markdown format.
  • -
  • reports: notebooks rendered as HTML/docx/pdf versions, ideal for sharing with colleagues and also as a formal report of the data analysis procedure. +
  • metadata.yml: metadata file describing various keys of the project or experiment (see this lesson).
  • +
  • notebooks: This folder stores Jupyter, R Markdown, or Quarto notebooks containing the data analysis. Figures and tables used for the reports are organized under subfolders named after the notebook that created them for provenance purposes.
  • +
  • README.md: A detailed project description in markdown or plain-text format.
  • +
  • reports: Notebooks rendered as HTML, docx, or PDF files for sharing with colleagues or as formal data analysis reports.
    • figures: figures produced upon rendering notebooks. The figures will be saved under a subfolder named after the notebook that created them. This is for provenance purposes so we know which notebook created which figures.
  • -
  • requirements.txt: file explaining what software and libraries/packages and their versions are necessary to reproduce the code.
  • -
  • results: results from the data analysis, such as tables with differentially expressed genes, enrichment results, etc.
  • -
  • scripts: folder containing helper scripts needed to run data analysis or reproduce the work of the folder
  • -
  • description.yml: a short description of the project.
  • -
  • metadata.yml: metadata file for the assay describing different keys (see this lesson).
  • +
  • requirements.txt: This file lists the necessary software, libraries, and their versions required to reproduce the code. If you’re using conda environments, you will also find the env.yaml file here, which outlines the specific environment configuration.
  • +
  • results: This folder contains analysis results, such as figures and tables. Organizing results by the pipeline, script, or notebook that generated them will make it easier to locate and interpret the data.
  • +
  • pipeline: A folder containing pipeline scripts or workflows for processing and analyzing data.
  • +
  • scratch: A folder designated for temporary files or workspace for experiments and development.
  • +
  • scripts: Folder for helper scripts needed to run data analysis or reproduce the work.

Template engine

-

It is very easy to create a folder template using cookiecutter. Cookiecutter is a command-line utility that creates projects from cookiecutters (that is, a template), e.g. creating a Python package project from a Python package project template. Here you can find an example of a cookiecutter folder template-directed to NGS data, where we have applied the structures explained in the previous sections. You are very welcome to adapt it or modify it to your needs!

+

Creating a folder template is straightforward with cookiecutter a command-line tool that generates projects from templates (called cookiecutters). For example, it can help you set up a Python package project based on a Python package project template.

+
+
+
+ +
+
+Cookiecutter templates +
+
+
+

Here are some template that you can use to get started, adapt and modify them to your own needs:

+ +

Create your own template from scratch.

+
+

Quick tutorial on cookiecutter

-

Creating a Cookiecutter template from scratch involves defining a folder structure, creating a cookiecutter.json file, and specifying the placeholders (keywords) that will be replaced during project generation. Let’s walk through the process step by step:

+

Building a Cookiecutter template from scratch requires defining a folder structure, crafting a cookiecutter.json file, and outlining placeholders (keywords) that will be substituted when generating a new project. Here’s a step-by-step guide on how to proceed:

Step 1: Create a Folder Template
-

Start by creating a folder with the structure you want for your template. For example, let’s create a simple Python project template:

+

First, begin by creating a folder structure that aligns with your desired template design. For instance, let’s set up a simple Python project template:

my_template/
 |-- {{cookiecutter.project_name}}
 |   |-- main.py
 |-- tests
 |   |-- test_{{cookiecutter.project_name}}.py
 |-- README.md
-

In this example, {cookiecutter.project_name} is a placeholder that will be replaced with the actual project name when the template is used.

+

In this example, {cookiecutter.project_name} is a placeholder that will be replaced with the actual project name when the template is used. This directory contains a python script (‘main.py’), a subdirectory (‘tests’) with a second python script named after the project (‘test_{{cookiecutter.project_name}}.py’) and a ‘README.md’ file.

Step 2: Create cookiecutter.json
@@ -400,14 +501,15 @@
Step 2: Cr "author_name": "Your Name", "description": "A short description of your project" }
-

These are the questions users will be asked when generating a project based on your template. The values provided here will be used to replace the corresponding placeholders in the template files.

-

In addition to replacing placeholders in files and directory names, Cookiecutter can also automatically fill in information within the contents of text files. This can be useful for providing default configurations or templates for code files. Let’s extend our previous example to include a placeholder inside a text file:

+

When users generate a project based on your template, they will be prompted with these questions. The provided values (“responses”) will be used to substitute the placeholders in your template files.

+

Beyond substituting placeholders in file and directory names, Cookiecutter can automatically populate text file contents with information. This feature is useful for offering default configurations or code file templates. Let’s enhance our earlier example by incorporating a placeholder within a text file:

First, modify the my_template/main.py file to include a placeholder inside its contents:

# main.py
 
 def hello():
     print("Hello, {{cookiecutter.project_name}}!")
-

Now, the {cookiecutter.project_name} placeholder is inside the main.py file. When you run Cookiecutter, it will automatically replace the placeholders not only in file and directory names but also within the contents of text files. After running Cookiecutter, your generated main.py file might look like this:

+

The ‘{{cookiecutter.project_name}}’ placeholder is now included within the main.py file. When you execute Cookiecutter, it will automatically replace the placeholders in both file and directory names and within text file contents.

+

After running Cookiecutter, your generated ‘main.py’ file could appear as follows:

# main.py
 
 def hello():
@@ -415,15 +517,15 @@ 
Step 2: Cr
Step 3: Use Cookiecutter
-

Now that your template is set up, you can use Cookiecutter to generate a project based on it. Open a terminal and run:

+

Once your template is prepared, you can utilize Cookiecutter to create a project from it. Open a terminal and execute:

cookiecutter path/to/your/template
-

Cookiecutter will prompt you to fill in the values for project_name, author_name, and description. After you provide these values, Cookiecutter will replace the placeholders in your template files with the entered values.

+

Cookiecutter will prompt you to provide values for project_name, author_name, and description. Once you input these values, Cookiecutter will replace the placeholders in your template files with the entered values.

-
-
Step 4: Explore the Generated Project
-

Once the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will see a project structure with the placeholders replaced by the values you provided.

+
+
Step 4: Review the Generated Project
+

After the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will find a project structure with the placeholders replaced by the values you provided.

-
+
@@ -432,32 +534,84 @@
Step
-
+
-

Using Cookiecutter, create your own templates for your folders. You do not need to copy exactly our suggestions, adjust your template to your own needs!

-

Requirements:

-

Using Cookiecutter, create your own templates for your folders. You do not need to copy exactly our suggestions, adjust your template to your own needs! In order to create your cookiecutter template, you will need to install Python, cookiecutter, Git, and a GitHub account. If you do not have Git and a GitHub account, we suggest you do one as soon as possible. We will take a deeper look at Git and GitHub in the version control lesson.

-

We have prepared already two simple Cookiecutter templates in GitHub repositories.

-

Assay

-
    -
  1. First, fork our Assay folder template from the GitHub page into your own account/organization. fork_repo_example
  2. -
  3. Then, use git clone <your URL to the template> to put it on your computer.
  4. -
  5. Modify the contents of the repository so that it matches the Assay example above. You are welcome to make changes as you please!
  6. -
  7. Modify the cookiecutter.json file so that it will include the Assay name template
  8. -
  9. Git add, commit, and push your changes
  10. -
  11. Test your folder by using cookiecutter <URL to your GitHub repository for "assay-template>
  12. -
+

Use Cookiecutter to create custom templates for your folders. You can do it from scratch (see Exercise 1, part B) or opt for one of our pre-made templates available as a Github repository (recommended for this workshop). Feel free to tailor the template to your specific requirements—you don’t have to follow our examples exactly.

+

Requirements We assume you have already gone through the requirements at the beginning of the practical lesson. This includes installing the necessary tools and setting up accounts as needed.

Project

    -
  1. First, fork our Project folder template from the GitHub page into your own account/organization. fork_repo_example
  2. -
  3. Then, use git clone <your URL to the template> to put it on your computer.
  4. -
  5. Modify the contents of the repository so that it matches the Project example above. You are welcome to make changes as you please!
  6. -
  7. Modify the cookiecutter.json file so that it will include the Project name template
  8. -
  9. Git add, commit, and push your changes
  10. -
  11. Test your folder by using cookiecutter <URL to your GitHub repository for "project-template>
  12. +
  13. Go to our Cookicutter template and click on the **Fork*
  14. +
+
    +
  • button at the top-right corner of the repository page to create a copy of the repository on your own GitHub account or organization. fork_repo_example
  • +
+
    +
  1. Open a terminal on your computer, copy the URL of your fork and clone the repository to your local machine (the URL should look something like https://github.com/your_username/cookiecutter-template):
  2. +
+
git clone <your URL to the template>
+

If you have a GitHub Desktop, click Add and select “Clone repository” from the options 3. Open the repository and navigate through the different directories 4. Modify the contents of the repository as needed to fit your project’s requirements. You can change files, add new ones. remove existing one or adjust the folder structure. For inspiration, review the data structure above under ‘Project folder’. For instance, this template is missing the ‘reports’ directory. Consider creating it, along with a subdirectory named ‘figures’. Here’s an example of how to do it:

+
cd \{\{\ cookiecutter.project_name\ \}\}/  
+mkdir reports 
+touch requirements.txt
+
    +
  1. Modify the cookiecutter.json file. You could add new variables or change the default values:
+
# open a text editor
+ "author": "Alba Refoyo",
+
    +
  1. Commit and push changes when you are done with your modifications
  2. +
+
    +
  • Stage the changes with ‘git add’
  • +
  • Commit the changes with a meaningful commit message ‘git commit -m “update cookicutter template”’
  • +
  • Push the changes to your forked repository on Github ‘git push origin main’ (or the appropriate branch name)
  • +
+
    +
  1. Test your template by using cookiecutter <URL to your GitHub repository "cookicutter-template"> Fill up the variables and verify that the modified template looks like you would expect.
    +
  2. +
  3. Optional: You can customize or remove this prompt message entirely, allowing you to tailor the text to your preferences for a unique experience each time you use the template.
  4. +
+
"__prompts__": {
+    "project_name": "Project directory name [Example: project_short_description_202X]",
+    "author": "Author of the project",
+    "date": "Date of project creation, default is today's date",
+    "short_description": "Provide a detailed description of the project (context/content)"
+  },
+
+
+
+
+
+
+
+
+ +
+
+Optional Exercise 1, part B +
+
+
+
+
+
+
+

Create a template from scratch using this tutorial scratch, it can be as basic as this one below or ‘Data folder’:

+
my_template/
+|-- {{cookiecutter.project_name}}
+|   |-- main.py
+|-- tests
+|   |-- test_{{cookiecutter.project_name}}.py
+|-- README.md
+
    +
  • Step 1: Create a directory for the template.
  • +
  • Step 2: Write a cookiecutter.json file with variables such as project_name and author.
  • +
  • Step 3: Set up the folder structure by creating subdirectories and files as needed.
  • +
  • Step 4: Incorporate cookiecutter variables in the names of files.
  • +
  • Step 5: Use cookiecutter variables within scripts, such as printing a message that includes the project name.
  • +
@@ -469,7 +623,7 @@
Step

2. Metadata

-

Metadata is the behind-the-scenes information that makes sense of data and gives context and structure. For NGS data, metadata includes information such as when and where the data was collected, what it represents, and how it was processed. Let’s check what kind of relevant metadata is available for NGS data and how to capture it in your Assay or Project folders. Both of these folders contain a metadata.yml file and a README.md file. In this section, we will check what kind of information you should collect in each of these files.

+

Metadata is the behind-the-scenes information that makes sense of data and gives context and structure. For biodata, metadata includes information such as when and where the data was collected, what it represents, and how it was processed. Let’s check what kind of relevant metadata is available for NGS data and how to capture it in your Assay or Project folders. Both of these folders contain a metadata.yml file and a README.md file. In this section, we will check what kind of information you should collect in each of these files.

@@ -487,31 +641,31 @@

2. Metadata

README.md file

The README.md file is a markdown file that allows you to write a long description of the data placed in a folder. Since it is a markdown file, you are able to write in rich text format (bold, italic, include links, etc) what is inside the folder, why it was created/collected, and how and when. If it is an Assay folder, you could include the laboratory protocol used to generate the samples, images explaining the experiment design, a summary of the results of the experiment, and any sort of comments that would help to understand the context of the experiment. On the other hand, a ‘Project’ README file may contain a description of the project, what are its aims, why is it important, what ‘Assays’ is it using, how to interpret the code notebooks, a summary of the results and, again, any sort of comments that would help to understand the project.

Here is an example of a README file for a Project folder:

-
# NGS Analysis Project: Exploring Gene Expression in Human Tissues
-
-## Aims
-
-This project aims to investigate gene expression patterns across various human tissues using Next Generation Sequencing (NGS) data. By analyzing the transcriptomes of different tissues, we seek to uncover tissue-specific gene expression profiles and identify potential markers associated with specific biological functions or diseases.
-
-## Why It's Important
-
-Understanding tissue-specific gene expression is crucial for deciphering the molecular basis of health and disease. Identifying genes that are uniquely expressed in certain tissues can provide insights into tissue function, development, and potential therapeutic targets. This project contributes to our broader understanding of human biology and has implications for personalized medicine and disease research.
-
-## Datasets
-
-We have used internal datasets with IDs: RNA_humanSkin_20201030, RNA_humanBrain_20210102, RNA_humanLung_20220304.
-
-In addition, we utilized publicly available NGS datasets from the GTEx (Genotype-Tissue Expression) project, which provides comprehensive RNA-seq data across multiple human tissues. These datasets offer a wealth of information on gene expression levels and isoform variations across diverse tissues, making them ideal for our analysis.
-
-## Summary of Results
-
-Our analysis revealed distinct gene expression patterns among different human tissues. We identified tissue-specific genes enriched in brain tissues, highlighting their potential roles in neurodevelopment and function. Additionally, we found a set of genes that exhibit consistent expression across a range of tissues, suggesting their fundamental importance in basic cellular processes.
-
-Furthermore, our differential expression analysis unveiled significant changes in gene expression between healthy and diseased tissues, shedding light on potential molecular factors underlying various diseases. Overall, this project underscores the power of NGS data in unraveling intricate gene expression networks and their implications for human health.
-
----
-
-For more details, refer to our [Jupyter Notebook](link-to-jupyter-notebook.ipynb) for the complete analysis pipeline and code.
+
# NGS Analysis Project: Exploring Gene Expression in Human Tissues
+
+## Aims
+
+This project aims to investigate gene expression patterns across various human tissues using Next Generation Sequencing (NGS) data. By analyzing the transcriptomes of different tissues, we seek to uncover tissue-specific gene expression profiles and identify potential markers associated with specific biological functions or diseases.
+
+## Why It's Important
+
+Understanding tissue-specific gene expression is crucial for deciphering the molecular basis of health and disease. Identifying genes that are uniquely expressed in certain tissues can provide insights into tissue function, development, and potential therapeutic targets. This project contributes to our broader understanding of human biology and has implications for personalized medicine and disease research.
+
+## Datasets
+
+We have used internal datasets with IDs: RNA_humanSkin_20201030, RNA_humanBrain_20210102, RNA_humanLung_20220304.
+
+In addition, we utilized publicly available NGS datasets from the GTEx (Genotype-Tissue Expression) project, which provides comprehensive RNA-seq data across multiple human tissues. These datasets offer a wealth of information on gene expression levels and isoform variations across diverse tissues, making them ideal for our analysis.
+
+## Summary of Results
+
+Our analysis revealed distinct gene expression patterns among different human tissues. We identified tissue-specific genes enriched in brain tissues, highlighting their potential roles in neurodevelopment and function. Additionally, we found a set of genes that exhibit consistent expression across a range of tissues, suggesting their fundamental importance in basic cellular processes.
+
+Furthermore, our differential expression analysis unveiled significant changes in gene expression between healthy and diseased tissues, shedding light on potential molecular factors underlying various diseases. Overall, this project underscores the power of NGS data in unraveling intricate gene expression networks and their implications for human health.
+
+---
+
+For more details, refer to our [Jupyter Notebook](link-to-jupyter-notebook.ipynb) for the complete analysis pipeline and code.

metadata.yml

@@ -545,23 +699,23 @@

Assay metadata field
-
- @@ -1151,23 +1305,23 @@

Project metadata f
-
- @@ -1648,7 +1802,7 @@

More info

  • Bionty: Biological ontologies for data scientists.
  • -
    +
    @@ -1657,7 +1811,7 @@

    More info

    -
    +
    @@ -1667,7 +1821,7 @@

    More info

  • Modify the cookiecutter.json file so that when you create a new folder template, all the metadata is filled accordingly.
  • - -
    +
    @@ -1695,7 +1849,7 @@

    More info

  • Modify the metadata.yml file so that it includes the metadata recorded by the cookiecutter.json file.
  • - -
    +
    @@ -1761,23 +1915,23 @@

    Suggestions for N
    -
    - @@ -2340,7 +2494,7 @@

    Suggestions for N

    -
    +
    @@ -2349,7 +2503,7 @@

    Suggestions for N

    -
    +
    @@ -2365,7 +2519,7 @@

    Suggestions for N

    4. Create a catalog of your assay folder

    The next step is to collect all the NGS datasets that you have created in the manner explained above. Since your folders all should contain the metadata.yml file in the same place with the same metadata, it should be very easy to iteratively go through all the folders and merge all the metadata.yml files into a one single table. This table can be then browsed easily with Microsoft Excel, for example. If you are interested in making a Shiny app or Python Panel tool to interactively browse the catalog, check out this lesson.

    -
    +
    @@ -2374,7 +2528,7 @@

    4. C

    -
    +
    @@ -2385,34 +2539,34 @@

    4. C
  • Run the script below with R (or create your own with Python). Modify the folder_path variable so it matches the path to the folder Assays. The table will be written under the same folder_path.
  • Visualize your Assays table with Excel
  • -
    
    -library(yaml)
    -library(dplyr)
    -library(lubridate)
    -
    -# Function to recursively fetch metadata.yml files
    -get_metadata <- function(folder_path) {
    -    file_list <- list.files(path = folder_path, pattern = "metadata\\.yml$", recursive = TRUE, full.names = TRUE)
    -    metadata_list <- lapply(file_list, yaml::yaml.load_file)
    -    return(metadata_list)
    -    }
    -
    -# Specify the folder path
    -    folder_path <- "/path/to/your/folder"
    -
    -    # Fetch metadata from the specified folder
    -    metadata <- get_metadata(folder_path)
    -
    -    # Convert metadata to a data frame
    -    metadata_df <- data.frame(matrix(unlist(metadata), ncol = length(metadata), byrow = TRUE))
    -    colnames(metadata_df) <- names(metadata[[1]])
    -
    -    # Save the data frame as a TSV file
    -    output_file <- paste0("database_", format(Sys.Date(), "%Y%m%d"), ".tsv")
    -    write.table(metadata_df, file = output_file, sep = "\t", quote = FALSE, row.names = FALSE)
    -
    -    # Print confirmation message
    -    cat("Database saved as", output_file, "\n")
    +
    
    +library(yaml)
    +library(dplyr)
    +library(lubridate)
    +
    +# Function to recursively fetch metadata.yml files
    +get_metadata <- function(folder_path) {
    +    file_list <- list.files(path = folder_path, pattern = "metadata\\.yml$", recursive = TRUE, full.names = TRUE)
    +    metadata_list <- lapply(file_list, yaml::yaml.load_file)
    +    return(metadata_list)
    +    }
    +
    +# Specify the folder path
    +    folder_path <- "/path/to/your/folder"
    +
    +    # Fetch metadata from the specified folder
    +    metadata <- get_metadata(folder_path)
    +
    +    # Convert metadata to a data frame
    +    metadata_df <- data.frame(matrix(unlist(metadata), ncol = length(metadata), byrow = TRUE))
    +    colnames(metadata_df) <- names(metadata[[1]])
    +
    +    # Save the data frame as a TSV file
    +    output_file <- paste0("database_", format(Sys.Date(), "%Y%m%d"), ".tsv")
    +    write.table(metadata_df, file = output_file, sep = "\t", quote = FALSE, row.names = FALSE)
    +
    +    # Print confirmation message
    +    cat("Database saved as", output_file, "\n")

    @@ -2461,7 +2615,7 @@

    GitHub Pages

    Once you have created your repository (and put it in GitHub), you have now the opportunity to add your data analysis reports that you created, in either Jupyter Notebooks, Rmarkdowns, or HTML reports, in a GitHub Page website. Creating a GitHub page is very simple, and we really recommend that you follow the nice tutorial that GitHub has put for you. Nonetheless, we will see the main steps in the exercise below.

    There are many different ways to create your web pages. We recommend using Mkdocs and Mkdocs materials as a framework to create a nice webpage simply. The folder templates that we used as an example in the previous exercise already contain everything you need to start a webpage. Nonetheless, you will need to understand the basics of MkDocs and MkDocs materials to design a webpage to your liking. MkDocs is a static webpage generator that is very easy to use, while MkDocs materials is an extension of the tool that gives you many more options to customize your website. Check out their web pages to get started!

    -
    +
    @@ -2470,7 +2624,7 @@

    GitHub Pages

    -
    +
    @@ -2529,7 +2683,7 @@

    Zenodo

    Zenodo[https://zenodo.org/] is an open-access digital repository designed to facilitate the archiving of scientific research outputs. It operates under the umbrella of the European Organization for Nuclear Research (CERN) and is supported by the European Commission. Zenodo accommodates a broad spectrum of research outputs, including datasets, papers, software, and multimedia files. This versatility makes it an invaluable resource for researchers across a wide array of domains, promoting transparency, collaboration, and the advancement of knowledge on a global scale.

    Operating on a user-friendly web platform, Zenodo allows researchers to easily upload, share, and preserve their research data and related materials. Upon deposit, each item is assigned a unique Digital Object Identifier (DOI), granting it a citable status and ensuring its long-term accessibility. Additionally, Zenodo provides robust metadata capabilities, enabling researchers to enrich their submissions with detailed contextual information. In addition, it allows you to link your GitHub account, providing a streamlined way to archive a specific release of your GitHub repository directly into Zenodo. This integration simplifies the process of preserving a snapshot of your project’s progress for long-term accessibility and citation.

    -
    +
    @@ -2538,7 +2692,7 @@

    Zenodo

    -
    +
    diff --git a/index.html b/index.html index fc2fa8a4..6746e252 100644 --- a/index.html +++ b/index.html @@ -164,7 +164,7 @@

    Computational Research Data Management

    Modified
    -

    April 22, 2024

    +

    April 25, 2024

    diff --git a/practical_workflows.html b/practical_workflows.html index f93e65d1..4c13ac2f 100644 --- a/practical_workflows.html +++ b/practical_workflows.html @@ -190,7 +190,7 @@
    Modified
    -

    April 22, 2024

    +

    April 25, 2024

    @@ -201,9 +201,51 @@ +
    +
    +
    + +
    +
    +Course Overview +
    +
    +
    +
      +
    • Total Time Estimation: X hours
      +
    • +
    • 📁 Supporting Materials:
      +
    • +
    • 👨‍💻 Target Audience: Ph.D., MSc, anyone interested in workflow management systems for High-Throughput data or other related fields within bioinformatics.
    • +
    • 👩‍🎓 Level: Advanced.
    • +
    • 🔒 License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.
      +
    • +
    • 💰 Funding: This project was funded by the Novo Nordisk Fonden (NNF20OC0063268).
    • +
    +
    +
    +
    +
    +
    + +
    +
    +Course Goals +
    +
    +
    +
      +
    • Create analysis pipelines
    • +
    • Specify software and computational resource needs
    • +
    • Customise your pipeline to accept user-defined configurations (params)
    • +
    • Create reproducible analyses that can be adapted to new data with little effort
    • +
    +
    +

    Workflows

    -

    Data analyses usually entail the application of various tools, algorithms and scripts. Workflow management handles parallelization, resume, logging and data provenance. If you develop your own software make sure you follow FAIR principles. We highly endorse following these FAIR recommendations and to register your computational workflow here.

    +

    Data analysis typically involves the use of different tools, algorithms, and scripts. It often requires multiple steps to transform, filter, aggregate, and visualize data. The process can be time-consuming because each tool may demand specific inputs and parameter settings. As analyses become more complex, the importance of reproducible and scalable automated workflow management increases. Workflow management encompasses tasks such as parallelization, resumption, logging, and data provenance.

    +

    If you develop your own software make sure you follow FAIR principles. We highly endorse following these FAIR recommendations and to register your computational workflow here.

    Using workflow managers, you ensure:

    • automation
    • @@ -213,10 +255,11 @@

      Workflows

    • scalability
    • readable
    -

    Some of the most popular workflow management systems are snakemake, nextflow and galaxy.

    +

    Popular workflow management systems such as Snakemake, Nextflow, and Galaxy can be scaled effortlessly across server, cluster, and cloud environments without altering the workflow definition. They also allow for specifying the necessary software, ensuring the workflows can be deployed in any setting.

    +

    During this lesson, you will learn about: - Syntax: understand the syntax of two workflow languages. - Defining steps: how to define a step in each of the language (rule in Snakemake, process in Nextflow), including specifying input, outputs and execution statements. - Generalizing steps: explore how to generalise steps and create a chain of dependency across multiple steps using wildcards (Snakemake) or parameters and channel operators (Nextflow). - Advanced Customisation: gain knowledge of advanced pipeline customisation using configuration files and custom-made functions - Scaling workflows: understand how to scale workflows to compute servers and clusters while adapting to hardware-specific constraints

    -

    Snakemake

    -

    Text-based using python plus domain specific syntax. The workflow is decompose into rules that are define to obtain output files from input files. It infers dependencies and the execution order.

    +

    Snakemake

    +

    It is a text-based tool using python-based language plus domain specific syntax. The workflow is decompose into rules that are define to obtain output files from input files. It infers dependencies and the execution order.

    Basics

      @@ -298,7 +341,7 @@

      | - Snakefile

    Create conda environment, one per project!

    # create env
    -conda create -n myworklow --file requierments.txt
    +conda create -n myworklow --file requirements.txt
     # activate environment
     source activate myworkflow
     # then execute snakemake
    @@ -311,7 +354,8 @@

    Nextflow

    Sources

      -
    • https://bitbucket.org/johanneskoester/snakemake
    • +
    • Snakemake tutorial
    • +
    • Snakemake turorial slides by Johannes Koster
    • https://bioconda.github.io
    • Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.
    • Köster, Johannes. “Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis”, PhD thesis, TU Dortmund 2014.
    • diff --git a/search.json b/search.json index 3e2c7753..5e797b26 100644 --- a/search.json +++ b/search.json @@ -51,21 +51,21 @@ "href": "develop/practical_workshop.html", "title": "Practical material", "section": "", - "text": "Course Overview\n\n\n\n⏰ Time Estimation: X minutes\n💬 Learning Objectives:\n\nOrganize and structure your data and data analysis with Cookiecutter templates\nEstablish metadata fields and collect metadata when creating a cookiecutter folder\nEstablish naming conventions for your data\nMake a catalog of your data\nCreate GitHub repositories of your data analysis and display them as GitHub Pages\nArchive GitHub repositories on Zenodo\nThis is a practical version of the full RDM on NGS data workshop. The main key points of the exercises shown here are to help you organize and structure your NGS datasets and your data analyses. We will see how to keep track of your experiments metadata and how to safely version control and archive your data analyses using GitHub repositories and Zenodo. We hope that through these practical exercises and step-by-step guidance, you’ll gain valuable skills in efficiently managing and sharing your research data, enhancing the reproducibility and impact of your work." + "text": "Course Overview\n\n\n\n⏰ Time Estimation: X minutes\n💬 Learning Objectives:\n\nOrganize and structure your data and data analysis with Cookiecutter templates\nDefine metadata fields and collect metadata when creating a Cookiecutter folder\nEstablish naming conventions for your data\nCreate a catalog of your data\nUse GitHub repositories of your data analysis and display them as GitHub Pages\nArchive GitHub repositories on Zenodo\nThis practical version covers practical aspects of RDM applied to biodata. The exercises provided here aim to help you organize and structure your datasets and data analyses. You’ll learn how to manage your experimental metadata effectively and safely version control and archive your data analyses using GitHub repositories and Zenodo. Through these guided exercises and step-by-step instructions, we hope you will acquire essential skills for managing and sharing your research data efficiently, thereby enhancing the reproducibility and impact of your work." }, { - "objectID": "develop/practical_workshop.html#organize-and-structure-your-ngs-data-and-data-analysis", - "href": "develop/practical_workshop.html#organize-and-structure-your-ngs-data-and-data-analysis", + "objectID": "develop/practical_workshop.html#organize-and-structure-your-datasets-and-data-analysis", + "href": "develop/practical_workshop.html#organize-and-structure-your-datasets-and-data-analysis", "title": "Practical material", - "section": "1. Organize and structure your NGS data and data analysis", - "text": "1. Organize and structure your NGS data and data analysis\nApplying a consistent file structure and naming conventions to your files will help you to efficiently manage your data. We will divide your NGS data and data analyses into two different types of folders:\n\nAssay folders: These folders contain the raw and processed NGS datasets, as well as the pipeline/workflow used to generate the processed data, provenance of the raw data, and quality control reports of the data. This data should be locked and read-only to prevent unwanted modifications.\nProject folders: These folders contain all the necessary files for a specific research project. A project may use several assays or results from other projects. The assay data should not be copied or duplicated, but linked from the source.\n\nProjects and Assays are separated from each other because a project may use one or more assays to answer a scientific question, and assays may be reused several times in different projects. This could be, for example, all the data analysis related to a publication (an RNAseq and a ChIPseq experiment), or a comparison between a previous ATACseq experiment (which was used for a older project) with a new laboratory protocol.\nYou could also create Genomic resources folders things such as genome references (fasta files) and annotations (gtf files) for different species, as well as indexes for different alignment algorithms. If you want to know more, feel free to check the relevant full lesson\nThis will help you to keep your data tidied up, especially if you are working in a big lab where assays may be used for different purposes and by different people!\n\nAssay folder\nFor each NGS experiment, there should be an Assay folder that will contain all experimental datasets, that is, an Assay (raw files and pipeline processed files). Raw files should not be modified at all, but you should probably lock modifications to the final results once you are done with preprocessing the data. This will help you prevent unwanted modifications to the data. Each Assay subfolder should be named in a way that is unique, easily readable, distinguishable, and understood at a glance. For example, you could name an NGS assay using an acronym for the type of NGS assay (RNAseq, ChIPseq, ATACseq), a keyword that represents a unique descriptive element of that assay, and the date. Like this:\n<Assay-ID>_<keyword>_YYYYMMDD\nFor example CHIP_Oct4_20230101 is a ChIPseq assay made on 1st January 2023 with the keyword Oct4, so it is easily identifiable by the eye. Next, let’s take a look at a possible folder structure and what kind of files you can find there.\nCHIP_Oct4_20230101/\n├── README.md\n├── metadata.yml\n├── pipeline.md\n├── processed\n└── raw\n ├── .fastq.gz\n └── samplesheet.csv\n\nREADME.md: Long description of the assay in markdown format. It should contain provenance of the raw NGS data (samples, laboratory protocols used, the aim of the assay, etc)\nmetadata.yml: metadata file for the assay describing different keys and important information regarding that assay (see this lesson).\npipeline.md: description of the pipeline used to process raw data, as well as the commands used to run the pipeline.\nprocessed: folder with results of the preprocessing pipeline. Contents depend on the pipeline used.\nraw: folder with the raw data.\n\n.fastq.gz:In the case of NGS assays, there should be fastq files.\nsamplesheet.csv: file that contains metadata information for the samples. This file is used to run the nf-core pipelines. You can also add extra columns with info regarding the experimental variables and batches so it can be used for downstream analysis as well.\n\n\n\n\nProject folder\nOn the other hand, we have the other type of folder called Projects. In this folder, you will save a subfolder for each project that you (or your lab) work on. Each Project subfolder will contain project information and all the data analysis notebooks and scripts used in that project.\nAs like for an Assay folder, the Project folder should be named in a way that is unique, easily readable, distinguishable, and understood at a glance. For example, you could name it after the main author’s initials, a keyword that represents a unique descriptive element of that assay, and the date:\n<author_initials>_<keyword>_YYYYMMDD\nFor example, JARH_Oct4_20230101, is a project about the gene Oct4 owned by Jose Alejandro Romero Herrera, created on the 1st of January of 2023.\nNext, let’s take a look at a possible folder structure and what kind of files you can find there.\n<author_initials>_<keyword>_YYYYMMDD\n├── data\n│ └── <Assay-ID>_<keyword>_YYYYMMDD/\n├── documents\n│ └── Non-sensitive_NGS_research_project_template.docx\n├── notebooks\n│ └── 01_data_analysis.rmd\n├── README.md\n├── reports\n│ ├── figures\n│ │ └── 01_data_analysis/\n│ │ └── heatmap_sampleCor_20230102.png\n│ └── 01_data_analysis.html\n├── requirements.txt\n├── results\n│ └── 01_data_analysis/\n│ └── DEA_treat-control_LFC1_p01.tsv\n├── scripts\n└── metadata.yml\n\ndata: a folder that contains symlinks or shortcuts to where the data is, avoiding copying and modification of original files.\ndocuments: a folder containing Word documents, slides, or PDFs related to the project, such as explanations of the data or project, papers, etc. It also contains your Data Management Plan.\n\nNon-sensitive_NGS_research_project_template.docx. This is a pre-filled Data Management Plan based on the Horizon Europe guidelines.\n\nnotebooks: a folder containing Jupyter, R markdown, or Quarto notebooks with the actual data analysis.\nREADME.md: detailed description of the project in markdown format.\nreports: notebooks rendered as HTML/docx/pdf versions, ideal for sharing with colleagues and also as a formal report of the data analysis procedure.\n\nfigures: figures produced upon rendering notebooks. The figures will be saved under a subfolder named after the notebook that created them. This is for provenance purposes so we know which notebook created which figures.\n\nrequirements.txt: file explaining what software and libraries/packages and their versions are necessary to reproduce the code.\nresults: results from the data analysis, such as tables with differentially expressed genes, enrichment results, etc.\nscripts: folder containing helper scripts needed to run data analysis or reproduce the work of the folder\ndescription.yml: a short description of the project.\nmetadata.yml: metadata file for the assay describing different keys (see this lesson).\n\n\n\nTemplate engine\nIt is very easy to create a folder template using cookiecutter. Cookiecutter is a command-line utility that creates projects from cookiecutters (that is, a template), e.g. creating a Python package project from a Python package project template. Here you can find an example of a cookiecutter folder template-directed to NGS data, where we have applied the structures explained in the previous sections. You are very welcome to adapt it or modify it to your needs!\n\nQuick tutorial on cookiecutter\nCreating a Cookiecutter template from scratch involves defining a folder structure, creating a cookiecutter.json file, and specifying the placeholders (keywords) that will be replaced during project generation. Let’s walk through the process step by step:\n\nStep 1: Create a Folder Template\nStart by creating a folder with the structure you want for your template. For example, let’s create a simple Python project template:\nmy_template/\n|-- {{cookiecutter.project_name}}\n| |-- main.py\n|-- tests\n| |-- test_{{cookiecutter.project_name}}.py\n|-- README.md\nIn this example, {cookiecutter.project_name} is a placeholder that will be replaced with the actual project name when the template is used.\n\n\nStep 2: Create cookiecutter.json\nIn the root of your template folder, create a file named cookiecutter.json. This file will define the variables (keywords) that users will be prompted to fill in. For our Python project template, it might look like this:\n{\n \"project_name\": \"MyProject\",\n \"author_name\": \"Your Name\",\n \"description\": \"A short description of your project\"\n}\nThese are the questions users will be asked when generating a project based on your template. The values provided here will be used to replace the corresponding placeholders in the template files.\nIn addition to replacing placeholders in files and directory names, Cookiecutter can also automatically fill in information within the contents of text files. This can be useful for providing default configurations or templates for code files. Let’s extend our previous example to include a placeholder inside a text file:\nFirst, modify the my_template/main.py file to include a placeholder inside its contents:\n# main.py\n\ndef hello():\n print(\"Hello, {{cookiecutter.project_name}}!\")\nNow, the {cookiecutter.project_name} placeholder is inside the main.py file. When you run Cookiecutter, it will automatically replace the placeholders not only in file and directory names but also within the contents of text files. After running Cookiecutter, your generated main.py file might look like this:\n# main.py\n\ndef hello():\n print(\"Hello, MyProject!\") # Assuming \"MyProject\" was entered as the project_name\n\n\nStep 3: Use Cookiecutter\nNow that your template is set up, you can use Cookiecutter to generate a project based on it. Open a terminal and run:\ncookiecutter path/to/your/template\nCookiecutter will prompt you to fill in the values for project_name, author_name, and description. After you provide these values, Cookiecutter will replace the placeholders in your template files with the entered values.\n\n\nStep 4: Explore the Generated Project\nOnce the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will see a project structure with the placeholders replaced by the values you provided.\n\n\n\n\n\n\nExercise 1: Create your own template\n\n\n\n\n\n\n\nUsing Cookiecutter, create your own templates for your folders. You do not need to copy exactly our suggestions, adjust your template to your own needs!\nRequirements:\nUsing Cookiecutter, create your own templates for your folders. You do not need to copy exactly our suggestions, adjust your template to your own needs! In order to create your cookiecutter template, you will need to install Python, cookiecutter, Git, and a GitHub account. If you do not have Git and a GitHub account, we suggest you do one as soon as possible. We will take a deeper look at Git and GitHub in the version control lesson.\nWe have prepared already two simple Cookiecutter templates in GitHub repositories.\nAssay\n\nFirst, fork our Assay folder template from the GitHub page into your own account/organization. \nThen, use git clone <your URL to the template> to put it on your computer.\nModify the contents of the repository so that it matches the Assay example above. You are welcome to make changes as you please!\nModify the cookiecutter.json file so that it will include the Assay name template\nGit add, commit, and push your changes\nTest your folder by using cookiecutter <URL to your GitHub repository for \"assay-template>\n\nProject\n\nFirst, fork our Project folder template from the GitHub page into your own account/organization. \nThen, use git clone <your URL to the template> to put it on your computer.\nModify the contents of the repository so that it matches the Project example above. You are welcome to make changes as you please!\nModify the cookiecutter.json file so that it will include the Project name template\nGit add, commit, and push your changes\nTest your folder by using cookiecutter <URL to your GitHub repository for \"project-template>" + "section": "1. Organize and structure your datasets and data analysis", + "text": "1. Organize and structure your datasets and data analysis\nEstablishing a consistent file structure and naming conventions will help you efficiently manage your data. We will classify your data and data analyses into two distinct types of folders to ensure the data can be used and shared by many lab members while preventing modifications by any individual:\n\nData folders (assay or external databases and resources): They house the raw and processed datasets, alongside the pipeline/workflow used to generate the processed data, the provenance of the raw data, and quality control reports of the data. The data should be locked and set to read-only to prevent unintended modifications. This applies to experimental data generated in your lab as well as external resources. Provide an MD5 checksum file when you download them yourself to verify their integrity.\nProject folders: They contain all the essential files for a specific research project. Projects may use data from various resources or experiments, or build upon previous results from other projects. The data should not be copied or duplicated, instead, it should be linked directly from the source.\n\nData and data analysis are kept separate because a project may utilize one or more datasets to address a scientific question. Data can be reused in multiple projects over time, combined with other datasets for comparison, or used to build larger datasets. Additionally, data may be utilized by different researchers to answer various research questions.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nWhen organizing your data folders, separate assays from external resources and maintain a consistent structure. For example, organize genome references by species and further categorize them by versions. Make sure to include all relevant information, and refer to this lesson for additional tips on data organization.\nThis will help you to keep your data tidied up, especially if you are working in a big lab where assays may be used for different purposes and by different people!\n\n\n\n\n\n\nData folders\nWhether your lab generates its own experimental data, receives it from collaborators, or works with previously published datasets, the data folder should follow a similar structure to the one presented here. Create a separate folder for each dataset, including raw files and processed files alongside the corresponding documentation and pipeline that generated the processed data. Raw files should remain untouched, and you should consider locking modifications to the final results once data preprocessing is complete. This precaution helps prevent unwanted changes to the data. Each subfolder should be named in a way that is distinct, easily readable and clear at a glance. Check this lesson for tips on naming conventions.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nUse an acronym (1) that describes the type of NGS assay (RNAseq, ChIPseq, ATACseq) a keyword (2) that represents a unique element to that assay, and the date (3).\n<Assay-ID>_<keyword>_YYYYMMDD\nFor example CHIP_Oct4_20230101 is a ChIPseq assay made on 1st January 2023 with the keyword Oct4, so it is easily identifiable by the eye.\n\n\n\n\n\nLet’s explore a potential folder structure and the types of files you might encounter within it.\n<data_type>_<keyword>_YYYYMMDD/\n├── README.md \n├── CHECKSUMS\n├── pipeline\n ├── pipeline.md\n ├── scripts/\n├── processed\n ├── fastqc/\n ├── multiqc/\n ├── final_fastq/\n└── raw\n ├── .fastq.gz \n └── samplesheet.csv\n\nREADME.md: This file contains a detailed description of the dataset commonly in markdown format. It should include the provenance of the raw data (such as samples, laboratory protocols used, the aim of the project, folder structure, naming conventions, etc.).\nmetadata.yml: This metadata file outlines different keys and essential information, usually presented in YAML format. For more details, refer to this lesson.\npipeline.md: This file provides an overview of the pipeline used to process raw data, as well as the commands to run the pipeline. The pipeline itself and all the required scripts should be collected in the same directory.\nprocessed: This folder contains the results from the preprocessing pipeline. The content vary depending on the specific pipeline used (create additional subdirectories as needed).\nraw: This folder holds the raw data.\n\n.fastq.gz: For example, in NGS assays, there should be ‘fastq’ files.\nsamplesheet.csv: This file holds essential metadata for the samples, including sample identification, experimental variables, batch information, and other metrics crucial for downstream analysis. It is important that this file is complete and current, as it is key to interpreting results. If you are considering running nf-core pipelines, this file will be required.\n\n\n\n\nProject folders\nOn the other hand, we have another type of folder called Projects which refers to data analyses that are specific to particular tasks, such as those involved in preparing a potential article. In this folder, you will create a subfolder for each project that you or your lab is working on. Each Project subfolder should include project-specific information, data analysis pipelines, notebooks, and scripts used for that particular project. Additionally, you should include an environment file with all the required software and dependencies needed for the project, including their versions. This helps ensure that the analyses can be easily replicated and shared with others.\nThe Project folder should be named in a way that is unique, easy to read, distinguishable, and clear at a glance. For example, you might name it based on the main author’s initials, the dataset being analyzed, the project name, a unique descriptive element related to the project, or the part of the project you are responsible for, along with the date:\n<project>_<keyword>_YYYYMMDD\n\n\n\n\n\n\nNaming examples\n\n\n\n\n\n\n\n\nRNASeq_Mouse_Brain_20230512: a project RNA sequencing data from a mouse brain experiment, created on May 12, 2023\nEHR_COVID19_Study_20230115: a project around electronic health records data for a COVID-19 study, created on January 15, 2023.\n\n\n\n\n\n\nNow, let’s explore an example of a folder structure and the types of files you might encounter within it.\n<project>_<keyword>_YYYYMMDD\n├── data\n│ └── <ID>_<keyword>_YYYYMMDD <- symbolic link\n├── documents\n│ └── research_project_template.docx\n├── metadata.yml\n├── notebooks\n│ └── 01_data_processing.rmd\n│ └── 02_data_analysis.rmd\n│ └── 03_data_visualization.rmd\n├── README.md\n├── reports\n│ └── 01_data_processing.html\n│ └── 02_data_analysis.html\n│ ├── 03_data_visualization.html\n│ │ └── figures\n│ │ └── tables\n├── requirements.txt // env.yaml\n├── results\n│ ├── figures\n│ │ └── 02_data_analysis/\n│ │ └── heatmap_sampleCor_20230102.png\n│ ├── tables\n│ │ └── 02_data_analysis/\n│ │ └── DEA_treat-control_LFC1_p01.tsv\n│ │ └── SumStats_sampleCor_20230102.tsv\n├── pipeline\n│ ├── rules // processes \n│ │ └── step1_data_processing.smk\n│ └── pipeline.md\n├── scratch\n└── scripts\n\ndata: This folder contains symlinks or shortcuts to the actual data files, ensuring that the original files remain unaltered.\ndocuments: This folder houses Word documents, slides, or PDFs associated with the project, including data and project explanations, research papers, and more. It also includes the Data Management Plan.\n\nresearch_project_template.docx. If you download our template you will find a is a pre-filled Data Management Plan based on the Horizon Europe guidelines named ‘Non-sensitive_NGS_research_project_template.docx’.\n\nmetadata.yml: metadata file describing various keys of the project or experiment (see this lesson).\nnotebooks: This folder stores Jupyter, R Markdown, or Quarto notebooks containing the data analysis. Figures and tables used for the reports are organized under subfolders named after the notebook that created them for provenance purposes.\nREADME.md: A detailed project description in markdown or plain-text format.\nreports: Notebooks rendered as HTML, docx, or PDF files for sharing with colleagues or as formal data analysis reports.\n\nfigures: figures produced upon rendering notebooks. The figures will be saved under a subfolder named after the notebook that created them. This is for provenance purposes so we know which notebook created which figures.\n\nrequirements.txt: This file lists the necessary software, libraries, and their versions required to reproduce the code. If you’re using conda environments, you will also find the env.yaml file here, which outlines the specific environment configuration.\nresults: This folder contains analysis results, such as figures and tables. Organizing results by the pipeline, script, or notebook that generated them will make it easier to locate and interpret the data.\npipeline: A folder containing pipeline scripts or workflows for processing and analyzing data.\nscratch: A folder designated for temporary files or workspace for experiments and development.\nscripts: Folder for helper scripts needed to run data analysis or reproduce the work.\n\n\n\nTemplate engine\nCreating a folder template is straightforward with cookiecutter a command-line tool that generates projects from templates (called cookiecutters). For example, it can help you set up a Python package project based on a Python package project template.\n\n\n\n\n\n\nCookiecutter templates\n\n\n\nHere are some template that you can use to get started, adapt and modify them to your own needs:\n\nPython package project\nSandbox test\nData science\nNGS data\n\nCreate your own template from scratch.\n\n\n\nQuick tutorial on cookiecutter\nBuilding a Cookiecutter template from scratch requires defining a folder structure, crafting a cookiecutter.json file, and outlining placeholders (keywords) that will be substituted when generating a new project. Here’s a step-by-step guide on how to proceed:\n\nStep 1: Create a Folder Template\nFirst, begin by creating a folder structure that aligns with your desired template design. For instance, let’s set up a simple Python project template:\nmy_template/\n|-- {{cookiecutter.project_name}}\n| |-- main.py\n|-- tests\n| |-- test_{{cookiecutter.project_name}}.py\n|-- README.md\nIn this example, {cookiecutter.project_name} is a placeholder that will be replaced with the actual project name when the template is used. This directory contains a python script (‘main.py’), a subdirectory (‘tests’) with a second python script named after the project (‘test_{{cookiecutter.project_name}}.py’) and a ‘README.md’ file.\n\n\nStep 2: Create cookiecutter.json\nIn the root of your template folder, create a file named cookiecutter.json. This file will define the variables (keywords) that users will be prompted to fill in. For our Python project template, it might look like this:\n{\n \"project_name\": \"MyProject\",\n \"author_name\": \"Your Name\",\n \"description\": \"A short description of your project\"\n}\nWhen users generate a project based on your template, they will be prompted with these questions. The provided values (“responses”) will be used to substitute the placeholders in your template files.\nBeyond substituting placeholders in file and directory names, Cookiecutter can automatically populate text file contents with information. This feature is useful for offering default configurations or code file templates. Let’s enhance our earlier example by incorporating a placeholder within a text file:\nFirst, modify the my_template/main.py file to include a placeholder inside its contents:\n# main.py\n\ndef hello():\n print(\"Hello, {{cookiecutter.project_name}}!\")\nThe ‘{{cookiecutter.project_name}}’ placeholder is now included within the main.py file. When you execute Cookiecutter, it will automatically replace the placeholders in both file and directory names and within text file contents.\nAfter running Cookiecutter, your generated ‘main.py’ file could appear as follows:\n# main.py\n\ndef hello():\n print(\"Hello, MyProject!\") # Assuming \"MyProject\" was entered as the project_name\n\n\nStep 3: Use Cookiecutter\nOnce your template is prepared, you can utilize Cookiecutter to create a project from it. Open a terminal and execute:\ncookiecutter path/to/your/template\nCookiecutter will prompt you to provide values for project_name, author_name, and description. Once you input these values, Cookiecutter will replace the placeholders in your template files with the entered values.\n\n\nStep 4: Review the Generated Project\nAfter the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will find a project structure with the placeholders replaced by the values you provided.\n\n\n\n\n\n\nExercise 1: Create your own template\n\n\n\n\n\n\n\nUse Cookiecutter to create custom templates for your folders. You can do it from scratch (see Exercise 1, part B) or opt for one of our pre-made templates available as a Github repository (recommended for this workshop). Feel free to tailor the template to your specific requirements—you don’t have to follow our examples exactly.\nRequirements We assume you have already gone through the requirements at the beginning of the practical lesson. This includes installing the necessary tools and setting up accounts as needed.\nProject\n\nGo to our Cookicutter template and click on the **Fork*\n\n\nbutton at the top-right corner of the repository page to create a copy of the repository on your own GitHub account or organization. \n\n\nOpen a terminal on your computer, copy the URL of your fork and clone the repository to your local machine (the URL should look something like https://github.com/your_username/cookiecutter-template):\n\ngit clone <your URL to the template>\nIf you have a GitHub Desktop, click Add and select “Clone repository” from the options 3. Open the repository and navigate through the different directories 4. Modify the contents of the repository as needed to fit your project’s requirements. You can change files, add new ones. remove existing one or adjust the folder structure. For inspiration, review the data structure above under ‘Project folder’. For instance, this template is missing the ‘reports’ directory. Consider creating it, along with a subdirectory named ‘figures’. Here’s an example of how to do it:\ncd \\{\\{\\ cookiecutter.project_name\\ \\}\\}/ \nmkdir reports \ntouch requirements.txt\n\nModify the cookiecutter.json file. You could add new variables or change the default values:\n\n# open a text editor\n \"author\": \"Alba Refoyo\",\n\nCommit and push changes when you are done with your modifications\n\n\nStage the changes with ‘git add’\nCommit the changes with a meaningful commit message ‘git commit -m “update cookicutter template”’\nPush the changes to your forked repository on Github ‘git push origin main’ (or the appropriate branch name)\n\n\nTest your template by using cookiecutter <URL to your GitHub repository \"cookicutter-template\"> Fill up the variables and verify that the modified template looks like you would expect.\n\nOptional: You can customize or remove this prompt message entirely, allowing you to tailor the text to your preferences for a unique experience each time you use the template.\n\n\"__prompts__\": {\n \"project_name\": \"Project directory name [Example: project_short_description_202X]\",\n \"author\": \"Author of the project\",\n \"date\": \"Date of project creation, default is today's date\",\n \"short_description\": \"Provide a detailed description of the project (context/content)\"\n },\n\n\n\n\n\n\n\n\n\n\n\nOptional Exercise 1, part B\n\n\n\n\n\n\n\nCreate a template from scratch using this tutorial scratch, it can be as basic as this one below or ‘Data folder’:\nmy_template/\n|-- {{cookiecutter.project_name}}\n| |-- main.py\n|-- tests\n| |-- test_{{cookiecutter.project_name}}.py\n|-- README.md\n\nStep 1: Create a directory for the template.\nStep 2: Write a cookiecutter.json file with variables such as project_name and author.\nStep 3: Set up the folder structure by creating subdirectories and files as needed.\nStep 4: Incorporate cookiecutter variables in the names of files.\nStep 5: Use cookiecutter variables within scripts, such as printing a message that includes the project name." }, { "objectID": "develop/practical_workshop.html#metadata", "href": "develop/practical_workshop.html#metadata", "title": "Practical material", "section": "2. Metadata", - "text": "2. Metadata\nMetadata is the behind-the-scenes information that makes sense of data and gives context and structure. For NGS data, metadata includes information such as when and where the data was collected, what it represents, and how it was processed. Let’s check what kind of relevant metadata is available for NGS data and how to capture it in your Assay or Project folders. Both of these folders contain a metadata.yml file and a README.md file. In this section, we will check what kind of information you should collect in each of these files.\n\n\n\n\n\n\nMetadata and controlled vocabularies\n\n\n\nIn order for metadata to be most useful, you should try to use controlled vocabularies for all your fields. For example, tissue could be described with the UBERON ontologies, species using the NCBI taxonomy, diseases using the Mondo database, etc. Unfortunately, implementing a systematic way of using these vocabularies is rather complex and outside the scope of this workshop, but you are very welcome to try to implement them on your own!\n\n\n\nREADME.md file\nThe README.md file is a markdown file that allows you to write a long description of the data placed in a folder. Since it is a markdown file, you are able to write in rich text format (bold, italic, include links, etc) what is inside the folder, why it was created/collected, and how and when. If it is an Assay folder, you could include the laboratory protocol used to generate the samples, images explaining the experiment design, a summary of the results of the experiment, and any sort of comments that would help to understand the context of the experiment. On the other hand, a ‘Project’ README file may contain a description of the project, what are its aims, why is it important, what ‘Assays’ is it using, how to interpret the code notebooks, a summary of the results and, again, any sort of comments that would help to understand the project.\nHere is an example of a README file for a Project folder:\n# NGS Analysis Project: Exploring Gene Expression in Human Tissues\n\n## Aims\n\nThis project aims to investigate gene expression patterns across various human tissues using Next Generation Sequencing (NGS) data. By analyzing the transcriptomes of different tissues, we seek to uncover tissue-specific gene expression profiles and identify potential markers associated with specific biological functions or diseases.\n\n## Why It's Important\n\nUnderstanding tissue-specific gene expression is crucial for deciphering the molecular basis of health and disease. Identifying genes that are uniquely expressed in certain tissues can provide insights into tissue function, development, and potential therapeutic targets. This project contributes to our broader understanding of human biology and has implications for personalized medicine and disease research.\n\n## Datasets\n\nWe have used internal datasets with IDs: RNA_humanSkin_20201030, RNA_humanBrain_20210102, RNA_humanLung_20220304.\n\nIn addition, we utilized publicly available NGS datasets from the GTEx (Genotype-Tissue Expression) project, which provides comprehensive RNA-seq data across multiple human tissues. These datasets offer a wealth of information on gene expression levels and isoform variations across diverse tissues, making them ideal for our analysis.\n\n## Summary of Results\n\nOur analysis revealed distinct gene expression patterns among different human tissues. We identified tissue-specific genes enriched in brain tissues, highlighting their potential roles in neurodevelopment and function. Additionally, we found a set of genes that exhibit consistent expression across a range of tissues, suggesting their fundamental importance in basic cellular processes.\n\nFurthermore, our differential expression analysis unveiled significant changes in gene expression between healthy and diseased tissues, shedding light on potential molecular factors underlying various diseases. Overall, this project underscores the power of NGS data in unraveling intricate gene expression networks and their implications for human health.\n\n---\n\nFor more details, refer to our [Jupyter Notebook](link-to-jupyter-notebook.ipynb) for the complete analysis pipeline and code.\n\n\nmetadata.yml\nThe metadata file is a yml file, which is a text document that contains data formatted using a human-readable data format for data serialization.\n\n\n\nyaml file example\n\n\n\n\nMetadata fields\nThere is a ton of information you can collect regarding an NGS assay or a project. Some information fields are very general, such as author or date, while others are specific to the Assay or Project folder. Below, we will take a look at the minimal information you should collect in each of the folders.\n\nGeneral metadata fields\nHere you can find a list of suggestions for general metadata fields that can be used for both assays and project folders:\n\nTitle: A brief yet informative name for the dataset.\nAuthor(s): The individual(s) or organization responsible for creating the dataset. You can use your ORCID\nDate Created: The date when the dataset was originally generated or compiled. Use YYYY-MM-DD format!\nDescription: A short narrative explaining the content, purpose, and context.\nKeywords: A set of descriptive terms or phrases that capture the folder’s main topics and attributes.\nVersion: The version number or identifier for the folder, useful for tracking changes.\nLicense: The type of license or terms of use associated with the dataset/project.\n\n\n\nAssay metadata fields\nHere you will find a table with possible metadata fields that you can use to annotate and track your Assay folders:\n\n\n\n\n\n\n\n\n\nMetadata field\nDefinition\nFormat\nOntology\nExample\n\n\n\n\nassay_ID\nIdentifier for the assay that is at least unique within the project\n<Assay-ID\\>_<keyword\\>_YYYYMMDD\nNA\nCHIP_Oct4_20200101\n\n\nassay_type\nThe type of experiment performed, eg ATAC-seq or seqFISH\nNA\nontology field- e.g. EFO or OBI\nChIPseq\n\n\nassay_subtype\nMore specific type or assay like bulk nascent RNAseq or single cell ATACseq\nNA\nontology field- e.g. EFO or OBI\nbulk ChIPseq\n\n\nowner\nOwner of the assay (who made the experiment?).\n<First Name\\> <Last Name\\>\nNA\nJose Romero\n\n\nplatform\nThe type of instrument used to perform the assay, eg Illumina HiSeq 4000 or Fluidigm C1 microfluidics platform\nNA\nontology field- e.g. EFO or OBI\nIllumina\n\n\nextraction_method\nTechnique used to extract the nucleic acid from the cell\nNA\nontology field- e.g. EFO or OBI\nNA\n\n\nlibrary_method\nTechnique used to amplify a cDNA library\nNA\nontology field- e.g. EFO or OBI\nNA\n\n\nexternal_accessions\nAccession numbers from external resources to which assay or protocol information was submitted\nNA\neg protocols.io, AE, GEO accession number, etc\nGSEXXXXX\n\n\nkeyword\nKeyword for easy identification\nwordWord\ncamelCase\nOct4ChIP\n\n\ndate\nDate of assay creation\nYYYYMMDD\nNA\n20200101\n\n\nnsamples\nNumber of samples analyzed in this assay\n<integer\\>\nNA\n9\n\n\nis_paired\nPaired fastq files or not\n<single OR paired\\>\nNA\nsingle\n\n\npipeline\nPipeline used to process data and version\nNA\nNA\nnf-core/chipseq -r 1.0\n\n\nstrandedness\nThe strandedness of the cDNA library\n<+ OR - OR *\\>\nNA\n*\n\n\nprocessed_by\nWho processed the data\n<First Name\\> <Last Name\\>\nNA\nSarah Lundregan\n\n\norganism\nOrganism origin\n<Genus species\\>\nTaxonomy name\nMus musculus\n\n\norigin\nIs internal or external (from a public resources) data\n<internal OR external\\>\nNA\ninternal\n\n\npath\nPath to files\n</path/to/file\\>\nNA\nNA\n\n\nshort_desc\nShort description of the assay\nplain text\nNA\nOct4 ChIP after pERK activation\n\n\nELN_ID\nID of the experiment/assay in your Electronic Lab Notebook software, like labguru or benchling\nplain text\nNA\nNA\n\n\n\n\n\n\n\n\n\n\nProject metadata fields\nHere you will find a table with possible metadata fields that you can use to annotate and track your Project folders:\n\n\n\n\n\n\n\n\n\nMetadata field\nDefinition\nFormat\nOntology\nExample\n\n\n\n\nproject\nProject ID\n<surname\\>_et_al_2023\nNA\nproks_et_al_2023\n\n\nauthor\nOwner of the project\n<First name\\> <Surname\\>\nNA\nMartin Proks\n\n\ndate\nDate of creation\nYYYYMMDD\nNA\n20230101\n\n\ndescription\nShort description of the project\nPlain text\nNA\nThis is a project describing the effect of Oct4 perturbation after pERK activation\n\n\n\n\n\n\n\n\n\n\n\nMore info\nThe information provided in this lesson is not at all exhaustive. There might be many more fields and controlled vocabularies that could be useful for your NGS data. We recommend that you take a look at the following sources for more information!\n\nTranscriptomics metadata standards and fields\nBionty: Biological ontologies for data scientists.\n\n\n\n\n\n\n\nExercise 2: modify the metadata.yml files in your Cookiecutter templates\n\n\n\n\n\n\n\nWe have seen some examples of metadata for NGS data. It is time now to customize your Cookiecutter templates and modify the metadata.yml files so that they fit your needs!\n\nThink about what kind of metadata you would like to include.\nModify the cookiecutter.json file so that when you create a new folder template, all the metadata is filled accordingly.\n\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\n\n\n\ncookiecutter_json_example\n\n\n\n\n\n\n\n\nModify the metadata.yml file so that it includes the metadata recorded by the cookiecutter.json file.\n\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\n\n\n\nassay_metadata_example\n\n\n\n\n\n\n\n\nModify the README.md file so that it includes the short description recorded by the cookiecutter.json file.\nGit add, commit, and push the changes to your template.\nTest your folders by using the command cookiecutter <URL to your cookiecutter repository in GitHub>" + "text": "2. Metadata\nMetadata is the behind-the-scenes information that makes sense of data and gives context and structure. For biodata, metadata includes information such as when and where the data was collected, what it represents, and how it was processed. Let’s check what kind of relevant metadata is available for NGS data and how to capture it in your Assay or Project folders. Both of these folders contain a metadata.yml file and a README.md file. In this section, we will check what kind of information you should collect in each of these files.\n\n\n\n\n\n\nMetadata and controlled vocabularies\n\n\n\nIn order for metadata to be most useful, you should try to use controlled vocabularies for all your fields. For example, tissue could be described with the UBERON ontologies, species using the NCBI taxonomy, diseases using the Mondo database, etc. Unfortunately, implementing a systematic way of using these vocabularies is rather complex and outside the scope of this workshop, but you are very welcome to try to implement them on your own!\n\n\n\nREADME.md file\nThe README.md file is a markdown file that allows you to write a long description of the data placed in a folder. Since it is a markdown file, you are able to write in rich text format (bold, italic, include links, etc) what is inside the folder, why it was created/collected, and how and when. If it is an Assay folder, you could include the laboratory protocol used to generate the samples, images explaining the experiment design, a summary of the results of the experiment, and any sort of comments that would help to understand the context of the experiment. On the other hand, a ‘Project’ README file may contain a description of the project, what are its aims, why is it important, what ‘Assays’ is it using, how to interpret the code notebooks, a summary of the results and, again, any sort of comments that would help to understand the project.\nHere is an example of a README file for a Project folder:\n# NGS Analysis Project: Exploring Gene Expression in Human Tissues\n\n## Aims\n\nThis project aims to investigate gene expression patterns across various human tissues using Next Generation Sequencing (NGS) data. By analyzing the transcriptomes of different tissues, we seek to uncover tissue-specific gene expression profiles and identify potential markers associated with specific biological functions or diseases.\n\n## Why It's Important\n\nUnderstanding tissue-specific gene expression is crucial for deciphering the molecular basis of health and disease. Identifying genes that are uniquely expressed in certain tissues can provide insights into tissue function, development, and potential therapeutic targets. This project contributes to our broader understanding of human biology and has implications for personalized medicine and disease research.\n\n## Datasets\n\nWe have used internal datasets with IDs: RNA_humanSkin_20201030, RNA_humanBrain_20210102, RNA_humanLung_20220304.\n\nIn addition, we utilized publicly available NGS datasets from the GTEx (Genotype-Tissue Expression) project, which provides comprehensive RNA-seq data across multiple human tissues. These datasets offer a wealth of information on gene expression levels and isoform variations across diverse tissues, making them ideal for our analysis.\n\n## Summary of Results\n\nOur analysis revealed distinct gene expression patterns among different human tissues. We identified tissue-specific genes enriched in brain tissues, highlighting their potential roles in neurodevelopment and function. Additionally, we found a set of genes that exhibit consistent expression across a range of tissues, suggesting their fundamental importance in basic cellular processes.\n\nFurthermore, our differential expression analysis unveiled significant changes in gene expression between healthy and diseased tissues, shedding light on potential molecular factors underlying various diseases. Overall, this project underscores the power of NGS data in unraveling intricate gene expression networks and their implications for human health.\n\n---\n\nFor more details, refer to our [Jupyter Notebook](link-to-jupyter-notebook.ipynb) for the complete analysis pipeline and code.\n\n\nmetadata.yml\nThe metadata file is a yml file, which is a text document that contains data formatted using a human-readable data format for data serialization.\n\n\n\nyaml file example\n\n\n\n\nMetadata fields\nThere is a ton of information you can collect regarding an NGS assay or a project. Some information fields are very general, such as author or date, while others are specific to the Assay or Project folder. Below, we will take a look at the minimal information you should collect in each of the folders.\n\nGeneral metadata fields\nHere you can find a list of suggestions for general metadata fields that can be used for both assays and project folders:\n\nTitle: A brief yet informative name for the dataset.\nAuthor(s): The individual(s) or organization responsible for creating the dataset. You can use your ORCID\nDate Created: The date when the dataset was originally generated or compiled. Use YYYY-MM-DD format!\nDescription: A short narrative explaining the content, purpose, and context.\nKeywords: A set of descriptive terms or phrases that capture the folder’s main topics and attributes.\nVersion: The version number or identifier for the folder, useful for tracking changes.\nLicense: The type of license or terms of use associated with the dataset/project.\n\n\n\nAssay metadata fields\nHere you will find a table with possible metadata fields that you can use to annotate and track your Assay folders:\n\n\n\n\n\n\n\n\n\nMetadata field\nDefinition\nFormat\nOntology\nExample\n\n\n\n\nassay_ID\nIdentifier for the assay that is at least unique within the project\n<Assay-ID\\>_<keyword\\>_YYYYMMDD\nNA\nCHIP_Oct4_20200101\n\n\nassay_type\nThe type of experiment performed, eg ATAC-seq or seqFISH\nNA\nontology field- e.g. EFO or OBI\nChIPseq\n\n\nassay_subtype\nMore specific type or assay like bulk nascent RNAseq or single cell ATACseq\nNA\nontology field- e.g. EFO or OBI\nbulk ChIPseq\n\n\nowner\nOwner of the assay (who made the experiment?).\n<First Name\\> <Last Name\\>\nNA\nJose Romero\n\n\nplatform\nThe type of instrument used to perform the assay, eg Illumina HiSeq 4000 or Fluidigm C1 microfluidics platform\nNA\nontology field- e.g. EFO or OBI\nIllumina\n\n\nextraction_method\nTechnique used to extract the nucleic acid from the cell\nNA\nontology field- e.g. EFO or OBI\nNA\n\n\nlibrary_method\nTechnique used to amplify a cDNA library\nNA\nontology field- e.g. EFO or OBI\nNA\n\n\nexternal_accessions\nAccession numbers from external resources to which assay or protocol information was submitted\nNA\neg protocols.io, AE, GEO accession number, etc\nGSEXXXXX\n\n\nkeyword\nKeyword for easy identification\nwordWord\ncamelCase\nOct4ChIP\n\n\ndate\nDate of assay creation\nYYYYMMDD\nNA\n20200101\n\n\nnsamples\nNumber of samples analyzed in this assay\n<integer\\>\nNA\n9\n\n\nis_paired\nPaired fastq files or not\n<single OR paired\\>\nNA\nsingle\n\n\npipeline\nPipeline used to process data and version\nNA\nNA\nnf-core/chipseq -r 1.0\n\n\nstrandedness\nThe strandedness of the cDNA library\n<+ OR - OR *\\>\nNA\n*\n\n\nprocessed_by\nWho processed the data\n<First Name\\> <Last Name\\>\nNA\nSarah Lundregan\n\n\norganism\nOrganism origin\n<Genus species\\>\nTaxonomy name\nMus musculus\n\n\norigin\nIs internal or external (from a public resources) data\n<internal OR external\\>\nNA\ninternal\n\n\npath\nPath to files\n</path/to/file\\>\nNA\nNA\n\n\nshort_desc\nShort description of the assay\nplain text\nNA\nOct4 ChIP after pERK activation\n\n\nELN_ID\nID of the experiment/assay in your Electronic Lab Notebook software, like labguru or benchling\nplain text\nNA\nNA\n\n\n\n\n\n\n\n\n\n\nProject metadata fields\nHere you will find a table with possible metadata fields that you can use to annotate and track your Project folders:\n\n\n\n\n\n\n\n\n\nMetadata field\nDefinition\nFormat\nOntology\nExample\n\n\n\n\nproject\nProject ID\n<surname\\>_et_al_2023\nNA\nproks_et_al_2023\n\n\nauthor\nOwner of the project\n<First name\\> <Surname\\>\nNA\nMartin Proks\n\n\ndate\nDate of creation\nYYYYMMDD\nNA\n20230101\n\n\ndescription\nShort description of the project\nPlain text\nNA\nThis is a project describing the effect of Oct4 perturbation after pERK activation\n\n\n\n\n\n\n\n\n\n\n\nMore info\nThe information provided in this lesson is not at all exhaustive. There might be many more fields and controlled vocabularies that could be useful for your NGS data. We recommend that you take a look at the following sources for more information!\n\nTranscriptomics metadata standards and fields\nBionty: Biological ontologies for data scientists.\n\n\n\n\n\n\n\nExercise 2: modify the metadata.yml files in your Cookiecutter templates\n\n\n\n\n\n\n\nWe have seen some examples of metadata for NGS data. It is time now to customize your Cookiecutter templates and modify the metadata.yml files so that they fit your needs!\n\nThink about what kind of metadata you would like to include.\nModify the cookiecutter.json file so that when you create a new folder template, all the metadata is filled accordingly.\n\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\n\n\n\ncookiecutter_json_example\n\n\n\n\n\n\n\n\nModify the metadata.yml file so that it includes the metadata recorded by the cookiecutter.json file.\n\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\n\n\n\nassay_metadata_example\n\n\n\n\n\n\n\n\nModify the README.md file so that it includes the short description recorded by the cookiecutter.json file.\nGit add, commit, and push the changes to your template.\nTest your folders by using the command cookiecutter <URL to your cookiecutter repository in GitHub>" }, { "objectID": "develop/practical_workshop.html#naming-conventions", @@ -318,21 +318,28 @@ "href": "practical_workflows.html", "title": "Workflows", "section": "", - "text": "Data analyses usually entail the application of various tools, algorithms and scripts. Workflow management handles parallelization, resume, logging and data provenance. If you develop your own software make sure you follow FAIR principles. We highly endorse following these FAIR recommendations and to register your computational workflow here.\nUsing workflow managers, you ensure:\n\nautomation\nconvenience\nportability\nreproducibility\nscalability\nreadable\n\nSome of the most popular workflow management systems are snakemake, nextflow and galaxy.\n\n\nText-based using python plus domain specific syntax. The workflow is decompose into rules that are define to obtain output files from input files. It infers dependencies and the execution order.\n\n\n\nDefine rules\nGeneralise the rule: creating wildcards You can refer by index or by name\nDependencies are determined top-down\n\nFor a given target, a rule that can be applied to create it, is determined (a job) For the input files of the rule, go on recursively, If no target is specified, snakemake , tries to apply the first rule\n\nRule all: target rule that collects results\n\n\n\n\nA job is executed if and only if: - otuput file is target and does not exist - output file needed by another executed job and does not exist - input file newer than output file - input file will be updated by other job (eg. changes in rules) - execution is force (‘–force-all’)\nYou can plot the DAG (directed acyclic graph) of the jobs\n\n\n\n# dry-run (-n), print shell commands (-p)\nsnakemake -n -p\n# Snakefile named different in another location \nsnakemake --snakefile path/to/file.smoker\n# dry-run (-n), print execution reason for each job\nsnakemake -n -r\n# Visualise DAG of jobs using Graphviz dot command\nsnakemake --dag | dot -Tsvg > dag.svg\n\n\n\nrule myrule:\n resources: mem_mb= 100 #(100MB memory allocation)\n threads: X\n shell:\n \"command {threads}\"\nLet’s say you defined our rule myrule needs 4 works, if we execute the workflow with 8 cores as follows:\nsnakemake --cores 8\nThis means that 2 ‘myrule’ jobs, will be executed in parallel.\nThe jobs are schedules to maximize parallelization, high priority jobs will be scheduled first, all while satisfying resource constrains. This means:\nIf we allocate 100MB for the execution of ‘myrule’ and we call snakemake as follows:\nsnakemake --resources mem_mb=100 --cores 8\nOnly one ‘myrule’ job can be executed in parallel (you do not provide enough memory resources for 2). The memory resources is useful for jobs that are heavy memory demanding to avoid running out of memory. You will need to benchmark your pipeline to estimate how much memory and time your full workflow will take. We highly recommend doing so, get a subset of your dataset and give it a go! Log files will come very handy for the resource estimation. Of course, the execution of jobs is dependant on the free resources availability (eg. CPU cores).\nrule myrule:\n log: \"logs/myrule.log\"\n threads: X\n shell:\n \"command {threads}\"\nLog files need to define the same wildcards as the output files, otherwise, you will get an error.\n\n\n\nYou can also define values for wildcards or parameters in the config file. This is recommended when the pipeline might be used several times at different time points, to avoid unwanted modifications to the workflow. parameterization is key for such cases.\n\n\n\nWhen working from cluster systems you can execute the workflow using -qsub submission command\nsnakemake --cluster qsub \n\n\n\n\nmodularization\nhandling temporary and protected files: very important for intermediate files that filled up our memory and are not used in the long run and can be deleted once the final output is generated. This is automatically done by snakemake if you defined them in your pipeline HTML5 reports\nrule parameters\ntracking tool versions and code changes: will force rerunning older jobs when code and software are modified/updated.\ndata provenance information per file\npython API for embedding snakemake in other tools\n\n\n\n\nBasic file structure\n| - config.yml\n| - requirements.txt (commonly also named environment.txt)\n| - rules/\n| | - myrules.smk\n| - scripts/\n| | - script1.py\n| - Snakefile\nCreate conda environment, one per project!\n# create env\nconda create -n myworklow --file requierments.txt\n# activate environment\nsource activate myworkflow\n# then execute snakemake\nUse git repositories to save your projects and pipelines!\n\n\n\n\n\n\n\n\nhttps://bitbucket.org/johanneskoester/snakemake\nhttps://bioconda.github.io\nKöster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.\nKöster, Johannes. “Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis”, PhD thesis, TU Dortmund 2014." + "text": "Course Overview\n\n\n\n\n⏰ Total Time Estimation: X hours\n\n📁 Supporting Materials:\n\n👨‍💻 Target Audience: Ph.D., MSc, anyone interested in workflow management systems for High-Throughput data or other related fields within bioinformatics.\n👩‍🎓 Level: Advanced.\n🔒 License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.\n\n💰 Funding: This project was funded by the Novo Nordisk Fonden (NNF20OC0063268)." }, { "objectID": "practical_workflows.html#snakemake", "href": "practical_workflows.html#snakemake", "title": "Workflows", - "section": "", - "text": "Text-based using python plus domain specific syntax. The workflow is decompose into rules that are define to obtain output files from input files. It infers dependencies and the execution order.\n\n\n\nDefine rules\nGeneralise the rule: creating wildcards You can refer by index or by name\nDependencies are determined top-down\n\nFor a given target, a rule that can be applied to create it, is determined (a job) For the input files of the rule, go on recursively, If no target is specified, snakemake , tries to apply the first rule\n\nRule all: target rule that collects results\n\n\n\n\nA job is executed if and only if: - otuput file is target and does not exist - output file needed by another executed job and does not exist - input file newer than output file - input file will be updated by other job (eg. changes in rules) - execution is force (‘–force-all’)\nYou can plot the DAG (directed acyclic graph) of the jobs\n\n\n\n# dry-run (-n), print shell commands (-p)\nsnakemake -n -p\n# Snakefile named different in another location \nsnakemake --snakefile path/to/file.smoker\n# dry-run (-n), print execution reason for each job\nsnakemake -n -r\n# Visualise DAG of jobs using Graphviz dot command\nsnakemake --dag | dot -Tsvg > dag.svg\n\n\n\nrule myrule:\n resources: mem_mb= 100 #(100MB memory allocation)\n threads: X\n shell:\n \"command {threads}\"\nLet’s say you defined our rule myrule needs 4 works, if we execute the workflow with 8 cores as follows:\nsnakemake --cores 8\nThis means that 2 ‘myrule’ jobs, will be executed in parallel.\nThe jobs are schedules to maximize parallelization, high priority jobs will be scheduled first, all while satisfying resource constrains. This means:\nIf we allocate 100MB for the execution of ‘myrule’ and we call snakemake as follows:\nsnakemake --resources mem_mb=100 --cores 8\nOnly one ‘myrule’ job can be executed in parallel (you do not provide enough memory resources for 2). The memory resources is useful for jobs that are heavy memory demanding to avoid running out of memory. You will need to benchmark your pipeline to estimate how much memory and time your full workflow will take. We highly recommend doing so, get a subset of your dataset and give it a go! Log files will come very handy for the resource estimation. Of course, the execution of jobs is dependant on the free resources availability (eg. CPU cores).\nrule myrule:\n log: \"logs/myrule.log\"\n threads: X\n shell:\n \"command {threads}\"\nLog files need to define the same wildcards as the output files, otherwise, you will get an error.\n\n\n\nYou can also define values for wildcards or parameters in the config file. This is recommended when the pipeline might be used several times at different time points, to avoid unwanted modifications to the workflow. parameterization is key for such cases.\n\n\n\nWhen working from cluster systems you can execute the workflow using -qsub submission command\nsnakemake --cluster qsub \n\n\n\n\nmodularization\nhandling temporary and protected files: very important for intermediate files that filled up our memory and are not used in the long run and can be deleted once the final output is generated. This is automatically done by snakemake if you defined them in your pipeline HTML5 reports\nrule parameters\ntracking tool versions and code changes: will force rerunning older jobs when code and software are modified/updated.\ndata provenance information per file\npython API for embedding snakemake in other tools\n\n\n\n\nBasic file structure\n| - config.yml\n| - requirements.txt (commonly also named environment.txt)\n| - rules/\n| | - myrules.smk\n| - scripts/\n| | - script1.py\n| - Snakefile\nCreate conda environment, one per project!\n# create env\nconda create -n myworklow --file requierments.txt\n# activate environment\nsource activate myworkflow\n# then execute snakemake\nUse git repositories to save your projects and pipelines!" + "section": "Snakemake", + "text": "Snakemake\nIt is a text-based tool using python-based language plus domain specific syntax. The workflow is decompose into rules that are define to obtain output files from input files. It infers dependencies and the execution order.\n\nBasics\n\nDefine rules\nGeneralise the rule: creating wildcards You can refer by index or by name\nDependencies are determined top-down\n\nFor a given target, a rule that can be applied to create it, is determined (a job) For the input files of the rule, go on recursively, If no target is specified, snakemake , tries to apply the first rule\n\nRule all: target rule that collects results\n\n\n\nJob execution\nA job is executed if and only if: - otuput file is target and does not exist - output file needed by another executed job and does not exist - input file newer than output file - input file will be updated by other job (eg. changes in rules) - execution is force (‘–force-all’)\nYou can plot the DAG (directed acyclic graph) of the jobs\n\n\nUseful command line interface\n# dry-run (-n), print shell commands (-p)\nsnakemake -n -p\n# Snakefile named different in another location \nsnakemake --snakefile path/to/file.smoker\n# dry-run (-n), print execution reason for each job\nsnakemake -n -r\n# Visualise DAG of jobs using Graphviz dot command\nsnakemake --dag | dot -Tsvg > dag.svg\n\n\nDefining resources\nrule myrule:\n resources: mem_mb= 100 #(100MB memory allocation)\n threads: X\n shell:\n \"command {threads}\"\nLet’s say you defined our rule myrule needs 4 works, if we execute the workflow with 8 cores as follows:\nsnakemake --cores 8\nThis means that 2 ‘myrule’ jobs, will be executed in parallel.\nThe jobs are schedules to maximize parallelization, high priority jobs will be scheduled first, all while satisfying resource constrains. This means:\nIf we allocate 100MB for the execution of ‘myrule’ and we call snakemake as follows:\nsnakemake --resources mem_mb=100 --cores 8\nOnly one ‘myrule’ job can be executed in parallel (you do not provide enough memory resources for 2). The memory resources is useful for jobs that are heavy memory demanding to avoid running out of memory. You will need to benchmark your pipeline to estimate how much memory and time your full workflow will take. We highly recommend doing so, get a subset of your dataset and give it a go! Log files will come very handy for the resource estimation. Of course, the execution of jobs is dependant on the free resources availability (eg. CPU cores).\nrule myrule:\n log: \"logs/myrule.log\"\n threads: X\n shell:\n \"command {threads}\"\nLog files need to define the same wildcards as the output files, otherwise, you will get an error.\n\n\nConfig files\nYou can also define values for wildcards or parameters in the config file. This is recommended when the pipeline might be used several times at different time points, to avoid unwanted modifications to the workflow. parameterization is key for such cases.\n\n\nCluster execution\nWhen working from cluster systems you can execute the workflow using -qsub submission command\nsnakemake --cluster qsub \n\n\nAdditional advanced features\n\nmodularization\nhandling temporary and protected files: very important for intermediate files that filled up our memory and are not used in the long run and can be deleted once the final output is generated. This is automatically done by snakemake if you defined them in your pipeline HTML5 reports\nrule parameters\ntracking tool versions and code changes: will force rerunning older jobs when code and software are modified/updated.\ndata provenance information per file\npython API for embedding snakemake in other tools\n\n\n\nCreate an isolated environment to install dependencies\nBasic file structure\n| - config.yml\n| - requirements.txt (commonly also named environment.txt)\n| - rules/\n| | - myrules.smk\n| - scripts/\n| | - script1.py\n| - Snakefile\nCreate conda environment, one per project!\n# create env\nconda create -n myworklow --file requirements.txt\n# activate environment\nsource activate myworkflow\n# then execute snakemake\nUse git repositories to save your projects and pipelines!" + }, + { + "objectID": "practical_workflows.html#nextflow", + "href": "practical_workflows.html#nextflow", + "title": "Workflows", + "section": "Nextflow", + "text": "Nextflow" }, { "objectID": "practical_workflows.html#sources", "href": "practical_workflows.html#sources", "title": "Workflows", - "section": "", - "text": "https://bitbucket.org/johanneskoester/snakemake\nhttps://bioconda.github.io\nKöster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.\nKöster, Johannes. “Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis”, PhD thesis, TU Dortmund 2014." + "section": "Sources", + "text": "Sources\n\nSnakemake tutorial\nSnakemake turorial slides by Johannes Koster\nhttps://bioconda.github.io\nKöster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.\nKöster, Johannes. “Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis”, PhD thesis, TU Dortmund 2014." }, { "objectID": "cards/AlbaMartinez.html", diff --git a/sitemap.xml b/sitemap.xml index 258d56c5..77972346 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,74 +2,74 @@ https://hds-sandbox.github.io/RDM_NGS_course/use_cases.html - 2024-04-22T07:46:43.288Z + 2024-04-25T07:09:28.032Z https://hds-sandbox.github.io/RDM_NGS_course/develop/06_pipelines.html - 2024-04-22T07:46:43.240Z + 2024-04-25T07:09:27.992Z https://hds-sandbox.github.io/RDM_NGS_course/develop/practical_workshop.html - 2024-04-22T07:46:43.288Z + 2024-04-25T07:09:28.032Z https://hds-sandbox.github.io/RDM_NGS_course/develop/04_metadata.html - 2024-04-22T07:46:43.240Z + 2024-04-25T07:09:27.992Z https://hds-sandbox.github.io/RDM_NGS_course/develop/05_VC.html - 2024-04-22T07:46:43.240Z + 2024-04-25T07:09:27.992Z https://hds-sandbox.github.io/RDM_NGS_course/develop/07_repos.html - 2024-04-22T07:46:43.240Z + 2024-04-25T07:09:27.992Z https://hds-sandbox.github.io/RDM_NGS_course/develop/examples/proteomics_metadata.html - 2024-04-22T07:46:43.260Z + 2024-04-25T07:09:28.008Z https://hds-sandbox.github.io/RDM_NGS_course/develop/examples/NGS_management.html - 2024-04-22T07:46:43.260Z + 2024-04-25T07:09:28.008Z https://hds-sandbox.github.io/RDM_NGS_course/cards/JARomero.html - 2024-04-22T07:46:43.240Z + 2024-04-25T07:09:27.992Z https://hds-sandbox.github.io/RDM_NGS_course/practical_workflows.html - 2024-04-22T07:46:43.288Z + 2024-04-25T07:09:28.032Z https://hds-sandbox.github.io/RDM_NGS_course/cards/AlbaMartinez.html - 2024-04-22T07:46:43.240Z + 2024-04-25T07:09:27.992Z https://hds-sandbox.github.io/RDM_NGS_course/develop/examples/NGS_OS_FAIR.html - 2024-04-22T07:46:43.260Z + 2024-04-25T07:09:28.008Z https://hds-sandbox.github.io/RDM_NGS_course/develop/examples/NGS_metadata.html - 2024-04-22T07:46:43.260Z + 2024-04-25T07:09:28.008Z https://hds-sandbox.github.io/RDM_NGS_course/develop/contributors.html - 2024-04-22T07:46:43.260Z + 2024-04-25T07:09:28.008Z https://hds-sandbox.github.io/RDM_NGS_course/develop/03_DOD.html - 2024-04-22T07:46:43.240Z + 2024-04-25T07:09:27.992Z https://hds-sandbox.github.io/RDM_NGS_course/develop/01_RDM_intro.html - 2024-04-22T07:46:43.240Z + 2024-04-25T07:09:27.992Z https://hds-sandbox.github.io/RDM_NGS_course/develop/02_DMP.html - 2024-04-22T07:46:43.240Z + 2024-04-25T07:09:27.992Z https://hds-sandbox.github.io/RDM_NGS_course/index.html - 2024-04-22T07:46:43.288Z + 2024-04-25T07:09:28.032Z diff --git a/use_cases.html b/use_cases.html index 662b1ec0..3c181f2c 100644 --- a/use_cases.html +++ b/use_cases.html @@ -236,7 +236,7 @@

      RDM use cases

      Modified
      -

      April 22, 2024

      +

      April 25, 2024