4. Documentation for biodata
5. Data Analysis with Version Control
6. Processing and analyzing biodata
7. Storing and sharing biodata
Practical material
Applied Open Science and FAIR principles to NGS
NGS data strategies
NGS Assay and Project metadata
Sample metadata fie
-
-
@@ -864,23 +864,23 @@ Project metadata f
-
-
@@ -1358,23 +1358,23 @@ Assay metadata field
-
-
@@ -1962,23 +1962,23 @@ Assay metadata field
-
-
diff --git a/develop/images/fork_repo_project.png b/develop/images/fork_repo_project.png
index 94ab2629..3d5ec960 100644
Binary files a/develop/images/fork_repo_project.png and b/develop/images/fork_repo_project.png differ
diff --git a/develop/practical_workshop.html b/develop/practical_workshop.html
index 06debd4f..1296e34b 100644
--- a/develop/practical_workshop.html
+++ b/develop/practical_workshop.html
@@ -172,10 +172,8 @@
On this page
- - 1. Organize and structure your NGS data and data analysis
+
- 1. Organize and structure your datasets and data analysis
- 2. Metadata
@@ -227,7 +225,7 @@
Practical material
@@ -252,15 +250,15 @@
Practical material
💬 Learning Objectives:
- Organize and structure your data and data analysis with Cookiecutter templates
-- Establish metadata fields and collect metadata when creating a cookiecutter folder
+- Define metadata fields and collect metadata when creating a Cookiecutter folder
- Establish naming conventions for your data
-- Make a catalog of your data
-- Create GitHub repositories of your data analysis and display them as GitHub Pages
+- Create a catalog of your data
+- Use GitHub repositories of your data analysis and display them as GitHub Pages
- Archive GitHub repositories on Zenodo
-This is a practical version of the full RDM on NGS data workshop. The main key points of the exercises shown here are to help you organize and structure your NGS datasets and your data analyses. We will see how to keep track of your experiments metadata and how to safely version control and archive your data analyses using GitHub repositories and Zenodo. We hope that through these practical exercises and step-by-step guidance, you’ll gain valuable skills in efficiently managing and sharing your research data, enhancing the reproducibility and impact of your work.
+This practical version covers practical aspects of RDM applied to biodata. The exercises provided here aim to help you organize and structure your datasets and data analyses. You’ll learn how to manage your experimental metadata effectively and safely version control and archive your data analyses using GitHub repositories and Zenodo. Through these guided exercises and step-by-step instructions, we hope you will acquire essential skills for managing and sharing your research data efficiently, thereby enhancing the reproducibility and impact of your work.
-Ensure that all necessary tools and software are installed before proceeding with the practical exercises.
+Ensure all necessary tools and software are installed before beginning the practical exercises:
-
-
-
-
-
-Cookicutter to create folder structure templates (pip install cookiecutter
)
-cruft to version control your templates (pip install cruft
)
-Option a. Install Quarto. We recommend Quarto as is easy to use and provides native support for notebooks (R Markdown and Jupyter Notebooks). No extensions or dependencies are needed.
-Option b. Install MkDocs and MkDocs extensions using the command line.
+- A GitHub account for hosting and collaborating on projects
+- Git for version control of your projects
+- A Zenodo account for archiving and sharing your research outputs
+- Python
+- pip for managing Python packages
+- Cookicutter for creating folder structure templates (
pip install cookiecutter
)
+- cruft to version control your templates (
pip install cruft
)
+
+Two more tools will be required, choose the one you are familiar with or the first option:
+
+- Option a. Install Quarto. We recommend Quarto as is easy to use and provides native support for notebooks (both R Markdown and Jupyter Notebooks). It requires no additional extensions or dependencies.
+- Option b. Install MkDocs and MkDocs extensions using the command line. Additional extensions are optional but can be useful if you choose this approach.
pip install mkdocs # create webpages
pip install mkdocs-material # customize webpages
@@ -289,108 +290,208 @@ Practical material
pip install mkdocs-minify-plugin # Minimize html code
pip install mkdocs-git-revision-date-localized-plugin # display last updated date
pip install mkdocs-jupyter # include Jupyter notebooks
-pip install mkdocs-table-reader-plugin
-pip install mkdocs-bibtex # add references in your text (`.bib`)
-pip install neoteroi-mkdocs # create author cards
-pip install mkdocs-table-reader-plugin # embed tabular format files (`.tsv`)
+pip install mkdocs-bibtex # add references in your text (`.bib`)
+pip install neoteroi-mkdocs # create author cards
+pip install mkdocs-table-reader-plugin # embed tabular format files (`.tsv`)
-
-1. Organize and structure your NGS data and data analysis
-Applying a consistent file structure and naming conventions to your files will help you to efficiently manage your data. We will divide your NGS data and data analyses into two different types of folders:
+
+1. Organize and structure your datasets and data analysis
+Establishing a consistent file structure and naming conventions will help you efficiently manage your data. We will classify your data and data analyses into two distinct types of folders to ensure the data can be used and shared by many lab members while preventing modifications by any individual:
-- Assay folders: These folders contain the raw and processed NGS datasets, as well as the pipeline/workflow used to generate the processed data, provenance of the raw data, and quality control reports of the data. This data should be locked and read-only to prevent unwanted modifications.
-- Project folders: These folders contain all the necessary files for a specific research project. A project may use several assays or results from other projects. The assay data should not be copied or duplicated, but linked from the source.
+- Data folders (assay or external databases and resources): They house the raw and processed datasets, alongside the pipeline/workflow used to generate the processed data, the provenance of the raw data, and quality control reports of the data. The data should be locked and set to read-only to prevent unintended modifications. This applies to experimental data generated in your lab as well as external resources. Provide an MD5 checksum file when you download them yourself to verify their integrity.
+- Project folders: They contain all the essential files for a specific research project. Projects may use data from various resources or experiments, or build upon previous results from other projects. The data should not be copied or duplicated, instead, it should be linked directly from the source.
-Projects and Assays are separated from each other because a project may use one or more assays to answer a scientific question, and assays may be reused several times in different projects. This could be, for example, all the data analysis related to a publication (an RNAseq and a ChIPseq experiment), or a comparison between a previous ATACseq experiment (which was used for a older project) with a new laboratory protocol.
-You could also create Genomic resources folders things such as genome references (fasta files) and annotations (gtf files) for different species, as well as indexes for different alignment algorithms. If you want to know more, feel free to check the relevant full lesson
+Data and data analysis are kept separate because a project may utilize one or more datasets to address a scientific question. Data can be reused in multiple projects over time, combined with other datasets for comparison, or used to build larger datasets. Additionally, data may be utilized by different researchers to answer various research questions.
+
+
+Data folders
+Whether your lab generates its own experimental data, receives it from collaborators, or works with previously published datasets, the data folder should follow a similar structure to the one presented here. Create a separate folder for each dataset, including raw files and processed files alongside the corresponding documentation and pipeline that generated the processed data. Raw files should remain untouched, and you should consider locking modifications to the final results once data preprocessing is complete. This precaution helps prevent unwanted changes to the data. Each subfolder should be named in a way that is distinct, easily readable and clear at a glance. Check this lesson for tips on naming conventions.
+
+Let’s explore a potential folder structure and the types of files you might encounter within it.
+<data_type>_<keyword>_YYYYMMDD/
+├── README.md
+├── CHECKSUMS
+├── pipeline
+├── pipeline.md
+ ├── scripts/
+ ├── processed
+├── fastqc/
+ ├── multiqc/
+ ├── final_fastq/
+ └── raw
+├── .fastq.gz
+ └── samplesheet.csv
-- README.md: Long description of the assay in markdown format. It should contain provenance of the raw NGS data (samples, laboratory protocols used, the aim of the assay, etc)
-- metadata.yml: metadata file for the assay describing different keys and important information regarding that assay (see this lesson).
-- pipeline.md: description of the pipeline used to process raw data, as well as the commands used to run the pipeline.
-- processed: folder with results of the preprocessing pipeline. Contents depend on the pipeline used.
-- raw: folder with the raw data.
+
- README.md: This file contains a detailed description of the dataset commonly in markdown format. It should include the provenance of the raw data (such as samples, laboratory protocols used, the aim of the project, folder structure, naming conventions, etc.).
+- metadata.yml: This metadata file outlines different keys and essential information, usually presented in YAML format. For more details, refer to this lesson.
+- pipeline.md: This file provides an overview of the pipeline used to process raw data, as well as the commands to run the pipeline. The pipeline itself and all the required scripts should be collected in the same directory.
+- processed: This folder contains the results from the preprocessing pipeline. The content vary depending on the specific pipeline used (create additional subdirectories as needed).
+- raw: This folder holds the raw data.
-- .fastq.gz:In the case of NGS assays, there should be fastq files.
-- samplesheet.csv: file that contains metadata information for the samples. This file is used to run the nf-core pipelines. You can also add extra columns with info regarding the experimental variables and batches so it can be used for downstream analysis as well.
+- .fastq.gz: For example, in NGS assays, there should be ‘fastq’ files.
+- samplesheet.csv: This file holds essential metadata for the samples, including sample identification, experimental variables, batch information, and other metrics crucial for downstream analysis. It is important that this file is complete and current, as it is key to interpreting results. If you are considering running nf-core pipelines, this file will be required.
-
-Project folder
-On the other hand, we have the other type of folder called Projects
. In this folder, you will save a subfolder for each project that you (or your lab) work on. Each Project
subfolder will contain project information and all the data analysis notebooks and scripts used in that project.
-As like for an Assay folder, the Project folder should be named in a way that is unique, easily readable, distinguishable, and understood at a glance. For example, you could name it after the main author’s initials, a keyword that represents a unique descriptive element of that assay, and the date:
-<author_initials>_<keyword>_YYYYMMDD
-For example, JARH_Oct4_20230101
, is a project about the gene Oct4 owned by Jose Alejandro Romero Herrera, created on the 1st of January of 2023.
-Next, let’s take a look at a possible folder structure and what kind of files you can find there.
-<author_initials>_<keyword>_YYYYMMDD
+
+Project folders
+On the other hand, we have another type of folder called Projects which refers to data analyses that are specific to particular tasks, such as those involved in preparing a potential article. In this folder, you will create a subfolder for each project that you or your lab is working on. Each Project subfolder should include project-specific information, data analysis pipelines, notebooks, and scripts used for that particular project. Additionally, you should include an environment file with all the required software and dependencies needed for the project, including their versions. This helps ensure that the analyses can be easily replicated and shared with others.
+The Project folder should be named in a way that is unique, easy to read, distinguishable, and clear at a glance. For example, you might name it based on the main author’s initials, the dataset being analyzed, the project name, a unique descriptive element related to the project, or the part of the project you are responsible for, along with the date:
+<project>_<keyword>_YYYYMMDD
+
+Now, let’s explore an example of a folder structure and the types of files you might encounter within it.
+<project>_<keyword>_YYYYMMDD
├── data
-│ └── <Assay-ID>_<keyword>_YYYYMMDD/
+│ └── <ID>_<keyword>_YYYYMMDD <- symbolic link
├── documents
-│ └── Non-sensitive_NGS_research_project_template.docx
-├── notebooks
-│ └── 01_data_analysis.rmd
-├── README.md
-├── reports
-│ ├── figures
-│ │ └── 01_data_analysis/
-│ │ └── heatmap_sampleCor_20230102.png
-│ └── 01_data_analysis.html
-├── requirements.txt
-├── results
-│ └── 01_data_analysis/
-│ └── DEA_treat-control_LFC1_p01.tsv
-├── scripts
-└── metadata.yml
+│ └── research_project_template.docx
+├── metadata.yml
+├── notebooks
+│ └── 01_data_processing.rmd
+│ └── 02_data_analysis.rmd
+│ └── 03_data_visualization.rmd
+├── README.md
+├── reports
+│ └── 01_data_processing.html
+│ └── 02_data_analysis.html
+│ ├── 03_data_visualization.html
+│ │ └── figures
+│ │ └── tables
+├── requirements.txt // env.yaml
+├── results
+│ ├── figures
+│ │ └── 02_data_analysis/
+│ │ └── heatmap_sampleCor_20230102.png
+│ ├── tables
+│ │ └── 02_data_analysis/
+│ │ └── DEA_treat-control_LFC1_p01.tsv
+│ │ └── SumStats_sampleCor_20230102.tsv
+├── pipeline
+│ ├── rules // processes
+│ │ └── step1_data_processing.smk
+│ └── pipeline.md
+├── scratch
+└── scripts
-- data: a folder that contains symlinks or shortcuts to where the data is, avoiding copying and modification of original files.
-- documents: a folder containing Word documents, slides, or PDFs related to the project, such as explanations of the data or project, papers, etc. It also contains your Data Management Plan.
+
- data: This folder contains symlinks or shortcuts to the actual data files, ensuring that the original files remain unaltered.
+- documents: This folder houses Word documents, slides, or PDFs associated with the project, including data and project explanations, research papers, and more. It also includes the Data Management Plan.
-- Non-sensitive_NGS_research_project_template.docx. This is a pre-filled Data Management Plan based on the Horizon Europe guidelines.
+- research_project_template.docx. If you download our template you will find a is a pre-filled Data Management Plan based on the Horizon Europe guidelines named ‘Non-sensitive_NGS_research_project_template.docx’.
-- notebooks: a folder containing Jupyter, R markdown, or Quarto notebooks with the actual data analysis.
-- README.md: detailed description of the project in markdown format.
-- reports: notebooks rendered as HTML/docx/pdf versions, ideal for sharing with colleagues and also as a formal report of the data analysis procedure.
+
- metadata.yml: metadata file describing various keys of the project or experiment (see this lesson).
+- notebooks: This folder stores Jupyter, R Markdown, or Quarto notebooks containing the data analysis. Figures and tables used for the reports are organized under subfolders named after the notebook that created them for provenance purposes.
+- README.md: A detailed project description in markdown or plain-text format.
+- reports: Notebooks rendered as HTML, docx, or PDF files for sharing with colleagues or as formal data analysis reports.
- figures: figures produced upon rendering notebooks. The figures will be saved under a subfolder named after the notebook that created them. This is for provenance purposes so we know which notebook created which figures.
-- requirements.txt: file explaining what software and libraries/packages and their versions are necessary to reproduce the code.
-- results: results from the data analysis, such as tables with differentially expressed genes, enrichment results, etc.
-- scripts: folder containing helper scripts needed to run data analysis or reproduce the work of the folder
-- description.yml: a short description of the project.
-- metadata.yml: metadata file for the assay describing different keys (see this lesson).
+- requirements.txt: This file lists the necessary software, libraries, and their versions required to reproduce the code. If you’re using conda environments, you will also find the env.yaml file here, which outlines the specific environment configuration.
+- results: This folder contains analysis results, such as figures and tables. Organizing results by the pipeline, script, or notebook that generated them will make it easier to locate and interpret the data.
+- pipeline: A folder containing pipeline scripts or workflows for processing and analyzing data.
+- scratch: A folder designated for temporary files or workspace for experiments and development.
+- scripts: Folder for helper scripts needed to run data analysis or reproduce the work.
Template engine
-It is very easy to create a folder template using cookiecutter. Cookiecutter is a command-line utility that creates projects from cookiecutters (that is, a template), e.g. creating a Python package project from a Python package project template. Here you can find an example of a cookiecutter folder template-directed to NGS data, where we have applied the structures explained in the previous sections. You are very welcome to adapt it or modify it to your needs!
+Creating a folder template is straightforward with cookiecutter a command-line tool that generates projects from templates (called cookiecutters). For example, it can help you set up a Python package project based on a Python package project template.
+
+
+
+Cookiecutter templates
+
+
+
+
+Here are some template that you can use to get started, adapt and modify them to your own needs:
+
+Create your own template from scratch.
+
+
-
These are the questions users will be asked when generating a project based on your template. The values provided here will be used to replace the corresponding placeholders in the template files.
-In addition to replacing placeholders in files and directory names, Cookiecutter can also automatically fill in information within the contents of text files. This can be useful for providing default configurations or templates for code files. Let’s extend our previous example to include a placeholder inside a text file:
+When users generate a project based on your template, they will be prompted with these questions. The provided values (“responses”) will be used to substitute the placeholders in your template files.
+Beyond substituting placeholders in file and directory names, Cookiecutter can automatically populate text file contents with information. This feature is useful for offering default configurations or code file templates. Let’s enhance our earlier example by incorporating a placeholder within a text file:
First, modify the my_template/main.py
file to include a placeholder inside its contents:
# main.py
def hello():
print("Hello, {{cookiecutter.project_name}}!")
-Now, the {cookiecutter.project_name}
placeholder is inside the main.py
file. When you run Cookiecutter, it will automatically replace the placeholders not only in file and directory names but also within the contents of text files. After running Cookiecutter, your generated main.py
file might look like this:
+The ‘{{cookiecutter.project_name}}’ placeholder is now included within the main.py file. When you execute Cookiecutter, it will automatically replace the placeholders in both file and directory names and within text file contents.
+After running Cookiecutter, your generated ‘main.py’ file could appear as follows:
# main.py
def hello():
@@ -415,15 +517,15 @@ Step 2: Cr
-Step 4: Explore the Generated Project
-Once the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will see a project structure with the placeholders replaced by the values you provided.
+
+Step 4: Review the Generated Project
+After the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will find a project structure with the placeholders replaced by the values you provided.
-
Project metadata f
-
-
@@ -1358,23 +1358,23 @@ Assay metadata field
-
-
@@ -1962,23 +1962,23 @@ Assay metadata field
-
-
diff --git a/develop/images/fork_repo_project.png b/develop/images/fork_repo_project.png
index 94ab2629..3d5ec960 100644
Binary files a/develop/images/fork_repo_project.png and b/develop/images/fork_repo_project.png differ
diff --git a/develop/practical_workshop.html b/develop/practical_workshop.html
index 06debd4f..1296e34b 100644
--- a/develop/practical_workshop.html
+++ b/develop/practical_workshop.html
@@ -172,10 +172,8 @@
On this page
- - 1. Organize and structure your NGS data and data analysis
+
- 1. Organize and structure your datasets and data analysis
- 2. Metadata
@@ -227,7 +225,7 @@
Practical material
@@ -252,15 +250,15 @@
Practical material
💬 Learning Objectives:
- Organize and structure your data and data analysis with Cookiecutter templates
-- Establish metadata fields and collect metadata when creating a cookiecutter folder
+- Define metadata fields and collect metadata when creating a Cookiecutter folder
- Establish naming conventions for your data
-- Make a catalog of your data
-- Create GitHub repositories of your data analysis and display them as GitHub Pages
+- Create a catalog of your data
+- Use GitHub repositories of your data analysis and display them as GitHub Pages
- Archive GitHub repositories on Zenodo
-This is a practical version of the full RDM on NGS data workshop. The main key points of the exercises shown here are to help you organize and structure your NGS datasets and your data analyses. We will see how to keep track of your experiments metadata and how to safely version control and archive your data analyses using GitHub repositories and Zenodo. We hope that through these practical exercises and step-by-step guidance, you’ll gain valuable skills in efficiently managing and sharing your research data, enhancing the reproducibility and impact of your work.
+This practical version covers practical aspects of RDM applied to biodata. The exercises provided here aim to help you organize and structure your datasets and data analyses. You’ll learn how to manage your experimental metadata effectively and safely version control and archive your data analyses using GitHub repositories and Zenodo. Through these guided exercises and step-by-step instructions, we hope you will acquire essential skills for managing and sharing your research data efficiently, thereby enhancing the reproducibility and impact of your work.
-Ensure that all necessary tools and software are installed before proceeding with the practical exercises.
+Ensure all necessary tools and software are installed before beginning the practical exercises:
-
-
-
-
-
-Cookicutter to create folder structure templates (pip install cookiecutter
)
-cruft to version control your templates (pip install cruft
)
-Option a. Install Quarto. We recommend Quarto as is easy to use and provides native support for notebooks (R Markdown and Jupyter Notebooks). No extensions or dependencies are needed.
-Option b. Install MkDocs and MkDocs extensions using the command line.
+- A GitHub account for hosting and collaborating on projects
+- Git for version control of your projects
+- A Zenodo account for archiving and sharing your research outputs
+- Python
+- pip for managing Python packages
+- Cookicutter for creating folder structure templates (
pip install cookiecutter
)
+- cruft to version control your templates (
pip install cruft
)
+
+Two more tools will be required, choose the one you are familiar with or the first option:
+
+- Option a. Install Quarto. We recommend Quarto as is easy to use and provides native support for notebooks (both R Markdown and Jupyter Notebooks). It requires no additional extensions or dependencies.
+- Option b. Install MkDocs and MkDocs extensions using the command line. Additional extensions are optional but can be useful if you choose this approach.
pip install mkdocs # create webpages
pip install mkdocs-material # customize webpages
@@ -289,108 +290,208 @@ Practical material
pip install mkdocs-minify-plugin # Minimize html code
pip install mkdocs-git-revision-date-localized-plugin # display last updated date
pip install mkdocs-jupyter # include Jupyter notebooks
-pip install mkdocs-table-reader-plugin
-pip install mkdocs-bibtex # add references in your text (`.bib`)
-pip install neoteroi-mkdocs # create author cards
-pip install mkdocs-table-reader-plugin # embed tabular format files (`.tsv`)
+pip install mkdocs-bibtex # add references in your text (`.bib`)
+pip install neoteroi-mkdocs # create author cards
+pip install mkdocs-table-reader-plugin # embed tabular format files (`.tsv`)
-
-1. Organize and structure your NGS data and data analysis
-Applying a consistent file structure and naming conventions to your files will help you to efficiently manage your data. We will divide your NGS data and data analyses into two different types of folders:
+
+1. Organize and structure your datasets and data analysis
+Establishing a consistent file structure and naming conventions will help you efficiently manage your data. We will classify your data and data analyses into two distinct types of folders to ensure the data can be used and shared by many lab members while preventing modifications by any individual:
-- Assay folders: These folders contain the raw and processed NGS datasets, as well as the pipeline/workflow used to generate the processed data, provenance of the raw data, and quality control reports of the data. This data should be locked and read-only to prevent unwanted modifications.
-- Project folders: These folders contain all the necessary files for a specific research project. A project may use several assays or results from other projects. The assay data should not be copied or duplicated, but linked from the source.
+- Data folders (assay or external databases and resources): They house the raw and processed datasets, alongside the pipeline/workflow used to generate the processed data, the provenance of the raw data, and quality control reports of the data. The data should be locked and set to read-only to prevent unintended modifications. This applies to experimental data generated in your lab as well as external resources. Provide an MD5 checksum file when you download them yourself to verify their integrity.
+- Project folders: They contain all the essential files for a specific research project. Projects may use data from various resources or experiments, or build upon previous results from other projects. The data should not be copied or duplicated, instead, it should be linked directly from the source.
-Projects and Assays are separated from each other because a project may use one or more assays to answer a scientific question, and assays may be reused several times in different projects. This could be, for example, all the data analysis related to a publication (an RNAseq and a ChIPseq experiment), or a comparison between a previous ATACseq experiment (which was used for a older project) with a new laboratory protocol.
-You could also create Genomic resources folders things such as genome references (fasta files) and annotations (gtf files) for different species, as well as indexes for different alignment algorithms. If you want to know more, feel free to check the relevant full lesson
+Data and data analysis are kept separate because a project may utilize one or more datasets to address a scientific question. Data can be reused in multiple projects over time, combined with other datasets for comparison, or used to build larger datasets. Additionally, data may be utilized by different researchers to answer various research questions.
+
+
+Data folders
+Whether your lab generates its own experimental data, receives it from collaborators, or works with previously published datasets, the data folder should follow a similar structure to the one presented here. Create a separate folder for each dataset, including raw files and processed files alongside the corresponding documentation and pipeline that generated the processed data. Raw files should remain untouched, and you should consider locking modifications to the final results once data preprocessing is complete. This precaution helps prevent unwanted changes to the data. Each subfolder should be named in a way that is distinct, easily readable and clear at a glance. Check this lesson for tips on naming conventions.
+
+Let’s explore a potential folder structure and the types of files you might encounter within it.
+<data_type>_<keyword>_YYYYMMDD/
+├── README.md
+├── CHECKSUMS
+├── pipeline
+├── pipeline.md
+ ├── scripts/
+ ├── processed
+├── fastqc/
+ ├── multiqc/
+ ├── final_fastq/
+ └── raw
+├── .fastq.gz
+ └── samplesheet.csv
-- README.md: Long description of the assay in markdown format. It should contain provenance of the raw NGS data (samples, laboratory protocols used, the aim of the assay, etc)
-- metadata.yml: metadata file for the assay describing different keys and important information regarding that assay (see this lesson).
-- pipeline.md: description of the pipeline used to process raw data, as well as the commands used to run the pipeline.
-- processed: folder with results of the preprocessing pipeline. Contents depend on the pipeline used.
-- raw: folder with the raw data.
+
- README.md: This file contains a detailed description of the dataset commonly in markdown format. It should include the provenance of the raw data (such as samples, laboratory protocols used, the aim of the project, folder structure, naming conventions, etc.).
+- metadata.yml: This metadata file outlines different keys and essential information, usually presented in YAML format. For more details, refer to this lesson.
+- pipeline.md: This file provides an overview of the pipeline used to process raw data, as well as the commands to run the pipeline. The pipeline itself and all the required scripts should be collected in the same directory.
+- processed: This folder contains the results from the preprocessing pipeline. The content vary depending on the specific pipeline used (create additional subdirectories as needed).
+- raw: This folder holds the raw data.
-- .fastq.gz:In the case of NGS assays, there should be fastq files.
-- samplesheet.csv: file that contains metadata information for the samples. This file is used to run the nf-core pipelines. You can also add extra columns with info regarding the experimental variables and batches so it can be used for downstream analysis as well.
+- .fastq.gz: For example, in NGS assays, there should be ‘fastq’ files.
+- samplesheet.csv: This file holds essential metadata for the samples, including sample identification, experimental variables, batch information, and other metrics crucial for downstream analysis. It is important that this file is complete and current, as it is key to interpreting results. If you are considering running nf-core pipelines, this file will be required.
-
-Project folder
-On the other hand, we have the other type of folder called Projects
. In this folder, you will save a subfolder for each project that you (or your lab) work on. Each Project
subfolder will contain project information and all the data analysis notebooks and scripts used in that project.
-As like for an Assay folder, the Project folder should be named in a way that is unique, easily readable, distinguishable, and understood at a glance. For example, you could name it after the main author’s initials, a keyword that represents a unique descriptive element of that assay, and the date:
-<author_initials>_<keyword>_YYYYMMDD
-For example, JARH_Oct4_20230101
, is a project about the gene Oct4 owned by Jose Alejandro Romero Herrera, created on the 1st of January of 2023.
-Next, let’s take a look at a possible folder structure and what kind of files you can find there.
-<author_initials>_<keyword>_YYYYMMDD
+
+Project folders
+On the other hand, we have another type of folder called Projects which refers to data analyses that are specific to particular tasks, such as those involved in preparing a potential article. In this folder, you will create a subfolder for each project that you or your lab is working on. Each Project subfolder should include project-specific information, data analysis pipelines, notebooks, and scripts used for that particular project. Additionally, you should include an environment file with all the required software and dependencies needed for the project, including their versions. This helps ensure that the analyses can be easily replicated and shared with others.
+The Project folder should be named in a way that is unique, easy to read, distinguishable, and clear at a glance. For example, you might name it based on the main author’s initials, the dataset being analyzed, the project name, a unique descriptive element related to the project, or the part of the project you are responsible for, along with the date:
+<project>_<keyword>_YYYYMMDD
+
+Now, let’s explore an example of a folder structure and the types of files you might encounter within it.
+<project>_<keyword>_YYYYMMDD
├── data
-│ └── <Assay-ID>_<keyword>_YYYYMMDD/
+│ └── <ID>_<keyword>_YYYYMMDD <- symbolic link
├── documents
-│ └── Non-sensitive_NGS_research_project_template.docx
-├── notebooks
-│ └── 01_data_analysis.rmd
-├── README.md
-├── reports
-│ ├── figures
-│ │ └── 01_data_analysis/
-│ │ └── heatmap_sampleCor_20230102.png
-│ └── 01_data_analysis.html
-├── requirements.txt
-├── results
-│ └── 01_data_analysis/
-│ └── DEA_treat-control_LFC1_p01.tsv
-├── scripts
-└── metadata.yml
+│ └── research_project_template.docx
+├── metadata.yml
+├── notebooks
+│ └── 01_data_processing.rmd
+│ └── 02_data_analysis.rmd
+│ └── 03_data_visualization.rmd
+├── README.md
+├── reports
+│ └── 01_data_processing.html
+│ └── 02_data_analysis.html
+│ ├── 03_data_visualization.html
+│ │ └── figures
+│ │ └── tables
+├── requirements.txt // env.yaml
+├── results
+│ ├── figures
+│ │ └── 02_data_analysis/
+│ │ └── heatmap_sampleCor_20230102.png
+│ ├── tables
+│ │ └── 02_data_analysis/
+│ │ └── DEA_treat-control_LFC1_p01.tsv
+│ │ └── SumStats_sampleCor_20230102.tsv
+├── pipeline
+│ ├── rules // processes
+│ │ └── step1_data_processing.smk
+│ └── pipeline.md
+├── scratch
+└── scripts
-- data: a folder that contains symlinks or shortcuts to where the data is, avoiding copying and modification of original files.
-- documents: a folder containing Word documents, slides, or PDFs related to the project, such as explanations of the data or project, papers, etc. It also contains your Data Management Plan.
+
- data: This folder contains symlinks or shortcuts to the actual data files, ensuring that the original files remain unaltered.
+- documents: This folder houses Word documents, slides, or PDFs associated with the project, including data and project explanations, research papers, and more. It also includes the Data Management Plan.
-- Non-sensitive_NGS_research_project_template.docx. This is a pre-filled Data Management Plan based on the Horizon Europe guidelines.
+- research_project_template.docx. If you download our template you will find a is a pre-filled Data Management Plan based on the Horizon Europe guidelines named ‘Non-sensitive_NGS_research_project_template.docx’.
-- notebooks: a folder containing Jupyter, R markdown, or Quarto notebooks with the actual data analysis.
-- README.md: detailed description of the project in markdown format.
-- reports: notebooks rendered as HTML/docx/pdf versions, ideal for sharing with colleagues and also as a formal report of the data analysis procedure.
+
- metadata.yml: metadata file describing various keys of the project or experiment (see this lesson).
+- notebooks: This folder stores Jupyter, R Markdown, or Quarto notebooks containing the data analysis. Figures and tables used for the reports are organized under subfolders named after the notebook that created them for provenance purposes.
+- README.md: A detailed project description in markdown or plain-text format.
+- reports: Notebooks rendered as HTML, docx, or PDF files for sharing with colleagues or as formal data analysis reports.
- figures: figures produced upon rendering notebooks. The figures will be saved under a subfolder named after the notebook that created them. This is for provenance purposes so we know which notebook created which figures.
-- requirements.txt: file explaining what software and libraries/packages and their versions are necessary to reproduce the code.
-- results: results from the data analysis, such as tables with differentially expressed genes, enrichment results, etc.
-- scripts: folder containing helper scripts needed to run data analysis or reproduce the work of the folder
-- description.yml: a short description of the project.
-- metadata.yml: metadata file for the assay describing different keys (see this lesson).
+- requirements.txt: This file lists the necessary software, libraries, and their versions required to reproduce the code. If you’re using conda environments, you will also find the env.yaml file here, which outlines the specific environment configuration.
+- results: This folder contains analysis results, such as figures and tables. Organizing results by the pipeline, script, or notebook that generated them will make it easier to locate and interpret the data.
+- pipeline: A folder containing pipeline scripts or workflows for processing and analyzing data.
+- scratch: A folder designated for temporary files or workspace for experiments and development.
+- scripts: Folder for helper scripts needed to run data analysis or reproduce the work.
Template engine
-It is very easy to create a folder template using cookiecutter. Cookiecutter is a command-line utility that creates projects from cookiecutters (that is, a template), e.g. creating a Python package project from a Python package project template. Here you can find an example of a cookiecutter folder template-directed to NGS data, where we have applied the structures explained in the previous sections. You are very welcome to adapt it or modify it to your needs!
+Creating a folder template is straightforward with cookiecutter a command-line tool that generates projects from templates (called cookiecutters). For example, it can help you set up a Python package project based on a Python package project template.
+
+
+
+Cookiecutter templates
+
+
+
+
+Here are some template that you can use to get started, adapt and modify them to your own needs:
+
+Create your own template from scratch.
+
+
-
These are the questions users will be asked when generating a project based on your template. The values provided here will be used to replace the corresponding placeholders in the template files.
-In addition to replacing placeholders in files and directory names, Cookiecutter can also automatically fill in information within the contents of text files. This can be useful for providing default configurations or templates for code files. Let’s extend our previous example to include a placeholder inside a text file:
+When users generate a project based on your template, they will be prompted with these questions. The provided values (“responses”) will be used to substitute the placeholders in your template files.
+Beyond substituting placeholders in file and directory names, Cookiecutter can automatically populate text file contents with information. This feature is useful for offering default configurations or code file templates. Let’s enhance our earlier example by incorporating a placeholder within a text file:
First, modify the my_template/main.py
file to include a placeholder inside its contents:
# main.py
def hello():
print("Hello, {{cookiecutter.project_name}}!")
-Now, the {cookiecutter.project_name}
placeholder is inside the main.py
file. When you run Cookiecutter, it will automatically replace the placeholders not only in file and directory names but also within the contents of text files. After running Cookiecutter, your generated main.py
file might look like this:
+The ‘{{cookiecutter.project_name}}’ placeholder is now included within the main.py file. When you execute Cookiecutter, it will automatically replace the placeholders in both file and directory names and within text file contents.
+After running Cookiecutter, your generated ‘main.py’ file could appear as follows:
# main.py
def hello():
@@ -415,15 +517,15 @@ Step 2: Cr
-Step 4: Explore the Generated Project
-Once the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will see a project structure with the placeholders replaced by the values you provided.
+
+Step 4: Review the Generated Project
+After the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will find a project structure with the placeholders replaced by the values you provided.
-
Assay metadata field
-
-
@@ -1962,23 +1962,23 @@ Assay metadata field
-
-
diff --git a/develop/images/fork_repo_project.png b/develop/images/fork_repo_project.png
index 94ab2629..3d5ec960 100644
Binary files a/develop/images/fork_repo_project.png and b/develop/images/fork_repo_project.png differ
diff --git a/develop/practical_workshop.html b/develop/practical_workshop.html
index 06debd4f..1296e34b 100644
--- a/develop/practical_workshop.html
+++ b/develop/practical_workshop.html
@@ -172,10 +172,8 @@
On this page
- - 1. Organize and structure your NGS data and data analysis
+
- 1. Organize and structure your datasets and data analysis
- 2. Metadata
@@ -227,7 +225,7 @@
Practical material
@@ -252,15 +250,15 @@
Practical material
💬 Learning Objectives:
- Organize and structure your data and data analysis with Cookiecutter templates
-- Establish metadata fields and collect metadata when creating a cookiecutter folder
+- Define metadata fields and collect metadata when creating a Cookiecutter folder
- Establish naming conventions for your data
-- Make a catalog of your data
-- Create GitHub repositories of your data analysis and display them as GitHub Pages
+- Create a catalog of your data
+- Use GitHub repositories of your data analysis and display them as GitHub Pages
- Archive GitHub repositories on Zenodo
-This is a practical version of the full RDM on NGS data workshop. The main key points of the exercises shown here are to help you organize and structure your NGS datasets and your data analyses. We will see how to keep track of your experiments metadata and how to safely version control and archive your data analyses using GitHub repositories and Zenodo. We hope that through these practical exercises and step-by-step guidance, you’ll gain valuable skills in efficiently managing and sharing your research data, enhancing the reproducibility and impact of your work.
+This practical version covers practical aspects of RDM applied to biodata. The exercises provided here aim to help you organize and structure your datasets and data analyses. You’ll learn how to manage your experimental metadata effectively and safely version control and archive your data analyses using GitHub repositories and Zenodo. Through these guided exercises and step-by-step instructions, we hope you will acquire essential skills for managing and sharing your research data efficiently, thereby enhancing the reproducibility and impact of your work.
-Ensure that all necessary tools and software are installed before proceeding with the practical exercises.
+Ensure all necessary tools and software are installed before beginning the practical exercises:
-
-
-
-
-
-Cookicutter to create folder structure templates (pip install cookiecutter
)
-cruft to version control your templates (pip install cruft
)
-Option a. Install Quarto. We recommend Quarto as is easy to use and provides native support for notebooks (R Markdown and Jupyter Notebooks). No extensions or dependencies are needed.
-Option b. Install MkDocs and MkDocs extensions using the command line.
+- A GitHub account for hosting and collaborating on projects
+- Git for version control of your projects
+- A Zenodo account for archiving and sharing your research outputs
+- Python
+- pip for managing Python packages
+- Cookicutter for creating folder structure templates (
pip install cookiecutter
)
+- cruft to version control your templates (
pip install cruft
)
+
+Two more tools will be required, choose the one you are familiar with or the first option:
+
+- Option a. Install Quarto. We recommend Quarto as is easy to use and provides native support for notebooks (both R Markdown and Jupyter Notebooks). It requires no additional extensions or dependencies.
+- Option b. Install MkDocs and MkDocs extensions using the command line. Additional extensions are optional but can be useful if you choose this approach.
pip install mkdocs # create webpages
pip install mkdocs-material # customize webpages
@@ -289,108 +290,208 @@ Practical material
pip install mkdocs-minify-plugin # Minimize html code
pip install mkdocs-git-revision-date-localized-plugin # display last updated date
pip install mkdocs-jupyter # include Jupyter notebooks
-pip install mkdocs-table-reader-plugin
-pip install mkdocs-bibtex # add references in your text (`.bib`)
-pip install neoteroi-mkdocs # create author cards
-pip install mkdocs-table-reader-plugin # embed tabular format files (`.tsv`)
+pip install mkdocs-bibtex # add references in your text (`.bib`)
+pip install neoteroi-mkdocs # create author cards
+pip install mkdocs-table-reader-plugin # embed tabular format files (`.tsv`)
-
-1. Organize and structure your NGS data and data analysis
-Applying a consistent file structure and naming conventions to your files will help you to efficiently manage your data. We will divide your NGS data and data analyses into two different types of folders:
+
+1. Organize and structure your datasets and data analysis
+Establishing a consistent file structure and naming conventions will help you efficiently manage your data. We will classify your data and data analyses into two distinct types of folders to ensure the data can be used and shared by many lab members while preventing modifications by any individual:
-- Assay folders: These folders contain the raw and processed NGS datasets, as well as the pipeline/workflow used to generate the processed data, provenance of the raw data, and quality control reports of the data. This data should be locked and read-only to prevent unwanted modifications.
-- Project folders: These folders contain all the necessary files for a specific research project. A project may use several assays or results from other projects. The assay data should not be copied or duplicated, but linked from the source.
+- Data folders (assay or external databases and resources): They house the raw and processed datasets, alongside the pipeline/workflow used to generate the processed data, the provenance of the raw data, and quality control reports of the data. The data should be locked and set to read-only to prevent unintended modifications. This applies to experimental data generated in your lab as well as external resources. Provide an MD5 checksum file when you download them yourself to verify their integrity.
+- Project folders: They contain all the essential files for a specific research project. Projects may use data from various resources or experiments, or build upon previous results from other projects. The data should not be copied or duplicated, instead, it should be linked directly from the source.
-Projects and Assays are separated from each other because a project may use one or more assays to answer a scientific question, and assays may be reused several times in different projects. This could be, for example, all the data analysis related to a publication (an RNAseq and a ChIPseq experiment), or a comparison between a previous ATACseq experiment (which was used for a older project) with a new laboratory protocol.
-You could also create Genomic resources folders things such as genome references (fasta files) and annotations (gtf files) for different species, as well as indexes for different alignment algorithms. If you want to know more, feel free to check the relevant full lesson
+Data and data analysis are kept separate because a project may utilize one or more datasets to address a scientific question. Data can be reused in multiple projects over time, combined with other datasets for comparison, or used to build larger datasets. Additionally, data may be utilized by different researchers to answer various research questions.
+
+
+Data folders
+Whether your lab generates its own experimental data, receives it from collaborators, or works with previously published datasets, the data folder should follow a similar structure to the one presented here. Create a separate folder for each dataset, including raw files and processed files alongside the corresponding documentation and pipeline that generated the processed data. Raw files should remain untouched, and you should consider locking modifications to the final results once data preprocessing is complete. This precaution helps prevent unwanted changes to the data. Each subfolder should be named in a way that is distinct, easily readable and clear at a glance. Check this lesson for tips on naming conventions.
+
+Let’s explore a potential folder structure and the types of files you might encounter within it.
+<data_type>_<keyword>_YYYYMMDD/
+├── README.md
+├── CHECKSUMS
+├── pipeline
+├── pipeline.md
+ ├── scripts/
+ ├── processed
+├── fastqc/
+ ├── multiqc/
+ ├── final_fastq/
+ └── raw
+├── .fastq.gz
+ └── samplesheet.csv
-- README.md: Long description of the assay in markdown format. It should contain provenance of the raw NGS data (samples, laboratory protocols used, the aim of the assay, etc)
-- metadata.yml: metadata file for the assay describing different keys and important information regarding that assay (see this lesson).
-- pipeline.md: description of the pipeline used to process raw data, as well as the commands used to run the pipeline.
-- processed: folder with results of the preprocessing pipeline. Contents depend on the pipeline used.
-- raw: folder with the raw data.
+
- README.md: This file contains a detailed description of the dataset commonly in markdown format. It should include the provenance of the raw data (such as samples, laboratory protocols used, the aim of the project, folder structure, naming conventions, etc.).
+- metadata.yml: This metadata file outlines different keys and essential information, usually presented in YAML format. For more details, refer to this lesson.
+- pipeline.md: This file provides an overview of the pipeline used to process raw data, as well as the commands to run the pipeline. The pipeline itself and all the required scripts should be collected in the same directory.
+- processed: This folder contains the results from the preprocessing pipeline. The content vary depending on the specific pipeline used (create additional subdirectories as needed).
+- raw: This folder holds the raw data.
-- .fastq.gz:In the case of NGS assays, there should be fastq files.
-- samplesheet.csv: file that contains metadata information for the samples. This file is used to run the nf-core pipelines. You can also add extra columns with info regarding the experimental variables and batches so it can be used for downstream analysis as well.
+- .fastq.gz: For example, in NGS assays, there should be ‘fastq’ files.
+- samplesheet.csv: This file holds essential metadata for the samples, including sample identification, experimental variables, batch information, and other metrics crucial for downstream analysis. It is important that this file is complete and current, as it is key to interpreting results. If you are considering running nf-core pipelines, this file will be required.
-
-Project folder
-On the other hand, we have the other type of folder called Projects
. In this folder, you will save a subfolder for each project that you (or your lab) work on. Each Project
subfolder will contain project information and all the data analysis notebooks and scripts used in that project.
-As like for an Assay folder, the Project folder should be named in a way that is unique, easily readable, distinguishable, and understood at a glance. For example, you could name it after the main author’s initials, a keyword that represents a unique descriptive element of that assay, and the date:
-<author_initials>_<keyword>_YYYYMMDD
-For example, JARH_Oct4_20230101
, is a project about the gene Oct4 owned by Jose Alejandro Romero Herrera, created on the 1st of January of 2023.
-Next, let’s take a look at a possible folder structure and what kind of files you can find there.
-<author_initials>_<keyword>_YYYYMMDD
+
+Project folders
+On the other hand, we have another type of folder called Projects which refers to data analyses that are specific to particular tasks, such as those involved in preparing a potential article. In this folder, you will create a subfolder for each project that you or your lab is working on. Each Project subfolder should include project-specific information, data analysis pipelines, notebooks, and scripts used for that particular project. Additionally, you should include an environment file with all the required software and dependencies needed for the project, including their versions. This helps ensure that the analyses can be easily replicated and shared with others.
+The Project folder should be named in a way that is unique, easy to read, distinguishable, and clear at a glance. For example, you might name it based on the main author’s initials, the dataset being analyzed, the project name, a unique descriptive element related to the project, or the part of the project you are responsible for, along with the date:
+<project>_<keyword>_YYYYMMDD
+
+Now, let’s explore an example of a folder structure and the types of files you might encounter within it.
+<project>_<keyword>_YYYYMMDD
├── data
-│ └── <Assay-ID>_<keyword>_YYYYMMDD/
+│ └── <ID>_<keyword>_YYYYMMDD <- symbolic link
├── documents
-│ └── Non-sensitive_NGS_research_project_template.docx
-├── notebooks
-│ └── 01_data_analysis.rmd
-├── README.md
-├── reports
-│ ├── figures
-│ │ └── 01_data_analysis/
-│ │ └── heatmap_sampleCor_20230102.png
-│ └── 01_data_analysis.html
-├── requirements.txt
-├── results
-│ └── 01_data_analysis/
-│ └── DEA_treat-control_LFC1_p01.tsv
-├── scripts
-└── metadata.yml
+│ └── research_project_template.docx
+├── metadata.yml
+├── notebooks
+│ └── 01_data_processing.rmd
+│ └── 02_data_analysis.rmd
+│ └── 03_data_visualization.rmd
+├── README.md
+├── reports
+│ └── 01_data_processing.html
+│ └── 02_data_analysis.html
+│ ├── 03_data_visualization.html
+│ │ └── figures
+│ │ └── tables
+├── requirements.txt // env.yaml
+├── results
+│ ├── figures
+│ │ └── 02_data_analysis/
+│ │ └── heatmap_sampleCor_20230102.png
+│ ├── tables
+│ │ └── 02_data_analysis/
+│ │ └── DEA_treat-control_LFC1_p01.tsv
+│ │ └── SumStats_sampleCor_20230102.tsv
+├── pipeline
+│ ├── rules // processes
+│ │ └── step1_data_processing.smk
+│ └── pipeline.md
+├── scratch
+└── scripts
-- data: a folder that contains symlinks or shortcuts to where the data is, avoiding copying and modification of original files.
-- documents: a folder containing Word documents, slides, or PDFs related to the project, such as explanations of the data or project, papers, etc. It also contains your Data Management Plan.
+
- data: This folder contains symlinks or shortcuts to the actual data files, ensuring that the original files remain unaltered.
+- documents: This folder houses Word documents, slides, or PDFs associated with the project, including data and project explanations, research papers, and more. It also includes the Data Management Plan.
-- Non-sensitive_NGS_research_project_template.docx. This is a pre-filled Data Management Plan based on the Horizon Europe guidelines.
+- research_project_template.docx. If you download our template you will find a is a pre-filled Data Management Plan based on the Horizon Europe guidelines named ‘Non-sensitive_NGS_research_project_template.docx’.
-- notebooks: a folder containing Jupyter, R markdown, or Quarto notebooks with the actual data analysis.
-- README.md: detailed description of the project in markdown format.
-- reports: notebooks rendered as HTML/docx/pdf versions, ideal for sharing with colleagues and also as a formal report of the data analysis procedure.
+
- metadata.yml: metadata file describing various keys of the project or experiment (see this lesson).
+- notebooks: This folder stores Jupyter, R Markdown, or Quarto notebooks containing the data analysis. Figures and tables used for the reports are organized under subfolders named after the notebook that created them for provenance purposes.
+- README.md: A detailed project description in markdown or plain-text format.
+- reports: Notebooks rendered as HTML, docx, or PDF files for sharing with colleagues or as formal data analysis reports.
- figures: figures produced upon rendering notebooks. The figures will be saved under a subfolder named after the notebook that created them. This is for provenance purposes so we know which notebook created which figures.
-- requirements.txt: file explaining what software and libraries/packages and their versions are necessary to reproduce the code.
-- results: results from the data analysis, such as tables with differentially expressed genes, enrichment results, etc.
-- scripts: folder containing helper scripts needed to run data analysis or reproduce the work of the folder
-- description.yml: a short description of the project.
-- metadata.yml: metadata file for the assay describing different keys (see this lesson).
+- requirements.txt: This file lists the necessary software, libraries, and their versions required to reproduce the code. If you’re using conda environments, you will also find the env.yaml file here, which outlines the specific environment configuration.
+- results: This folder contains analysis results, such as figures and tables. Organizing results by the pipeline, script, or notebook that generated them will make it easier to locate and interpret the data.
+- pipeline: A folder containing pipeline scripts or workflows for processing and analyzing data.
+- scratch: A folder designated for temporary files or workspace for experiments and development.
+- scripts: Folder for helper scripts needed to run data analysis or reproduce the work.
Template engine
-It is very easy to create a folder template using cookiecutter. Cookiecutter is a command-line utility that creates projects from cookiecutters (that is, a template), e.g. creating a Python package project from a Python package project template. Here you can find an example of a cookiecutter folder template-directed to NGS data, where we have applied the structures explained in the previous sections. You are very welcome to adapt it or modify it to your needs!
+Creating a folder template is straightforward with cookiecutter a command-line tool that generates projects from templates (called cookiecutters). For example, it can help you set up a Python package project based on a Python package project template.
+
+
+
+Cookiecutter templates
+
+
+
+
+Here are some template that you can use to get started, adapt and modify them to your own needs:
+
+Create your own template from scratch.
+
+
-
These are the questions users will be asked when generating a project based on your template. The values provided here will be used to replace the corresponding placeholders in the template files.
-In addition to replacing placeholders in files and directory names, Cookiecutter can also automatically fill in information within the contents of text files. This can be useful for providing default configurations or templates for code files. Let’s extend our previous example to include a placeholder inside a text file:
+When users generate a project based on your template, they will be prompted with these questions. The provided values (“responses”) will be used to substitute the placeholders in your template files.
+Beyond substituting placeholders in file and directory names, Cookiecutter can automatically populate text file contents with information. This feature is useful for offering default configurations or code file templates. Let’s enhance our earlier example by incorporating a placeholder within a text file:
First, modify the my_template/main.py
file to include a placeholder inside its contents:
# main.py
def hello():
print("Hello, {{cookiecutter.project_name}}!")
-Now, the {cookiecutter.project_name}
placeholder is inside the main.py
file. When you run Cookiecutter, it will automatically replace the placeholders not only in file and directory names but also within the contents of text files. After running Cookiecutter, your generated main.py
file might look like this:
+The ‘{{cookiecutter.project_name}}’ placeholder is now included within the main.py file. When you execute Cookiecutter, it will automatically replace the placeholders in both file and directory names and within text file contents.
+After running Cookiecutter, your generated ‘main.py’ file could appear as follows:
# main.py
def hello():
@@ -415,15 +517,15 @@ Step 2: Cr
-Step 4: Explore the Generated Project
-Once the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will see a project structure with the placeholders replaced by the values you provided.
+
+Step 4: Review the Generated Project
+After the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will find a project structure with the placeholders replaced by the values you provided.
-
Assay metadata field
-
-
diff --git a/develop/images/fork_repo_project.png b/develop/images/fork_repo_project.png
index 94ab2629..3d5ec960 100644
Binary files a/develop/images/fork_repo_project.png and b/develop/images/fork_repo_project.png differ
diff --git a/develop/practical_workshop.html b/develop/practical_workshop.html
index 06debd4f..1296e34b 100644
--- a/develop/practical_workshop.html
+++ b/develop/practical_workshop.html
@@ -172,10 +172,8 @@
On this page
- - 1. Organize and structure your NGS data and data analysis
+
- 1. Organize and structure your datasets and data analysis
- 2. Metadata
@@ -227,7 +225,7 @@
Practical material
@@ -252,15 +250,15 @@
Practical material
💬 Learning Objectives:
- Organize and structure your data and data analysis with Cookiecutter templates
-- Establish metadata fields and collect metadata when creating a cookiecutter folder
+- Define metadata fields and collect metadata when creating a Cookiecutter folder
- Establish naming conventions for your data
-- Make a catalog of your data
-- Create GitHub repositories of your data analysis and display them as GitHub Pages
+- Create a catalog of your data
+- Use GitHub repositories of your data analysis and display them as GitHub Pages
- Archive GitHub repositories on Zenodo
-This is a practical version of the full RDM on NGS data workshop. The main key points of the exercises shown here are to help you organize and structure your NGS datasets and your data analyses. We will see how to keep track of your experiments metadata and how to safely version control and archive your data analyses using GitHub repositories and Zenodo. We hope that through these practical exercises and step-by-step guidance, you’ll gain valuable skills in efficiently managing and sharing your research data, enhancing the reproducibility and impact of your work.
+This practical version covers practical aspects of RDM applied to biodata. The exercises provided here aim to help you organize and structure your datasets and data analyses. You’ll learn how to manage your experimental metadata effectively and safely version control and archive your data analyses using GitHub repositories and Zenodo. Through these guided exercises and step-by-step instructions, we hope you will acquire essential skills for managing and sharing your research data efficiently, thereby enhancing the reproducibility and impact of your work.
-Ensure that all necessary tools and software are installed before proceeding with the practical exercises.
+Ensure all necessary tools and software are installed before beginning the practical exercises:
-
-
-
-
-
-Cookicutter to create folder structure templates (pip install cookiecutter
)
-cruft to version control your templates (pip install cruft
)
-Option a. Install Quarto. We recommend Quarto as is easy to use and provides native support for notebooks (R Markdown and Jupyter Notebooks). No extensions or dependencies are needed.
-Option b. Install MkDocs and MkDocs extensions using the command line.
+- A GitHub account for hosting and collaborating on projects
+- Git for version control of your projects
+- A Zenodo account for archiving and sharing your research outputs
+- Python
+- pip for managing Python packages
+- Cookicutter for creating folder structure templates (
pip install cookiecutter
)
+- cruft to version control your templates (
pip install cruft
)
+
+Two more tools will be required, choose the one you are familiar with or the first option:
+
+- Option a. Install Quarto. We recommend Quarto as is easy to use and provides native support for notebooks (both R Markdown and Jupyter Notebooks). It requires no additional extensions or dependencies.
+- Option b. Install MkDocs and MkDocs extensions using the command line. Additional extensions are optional but can be useful if you choose this approach.
pip install mkdocs # create webpages
pip install mkdocs-material # customize webpages
@@ -289,108 +290,208 @@ Practical material
pip install mkdocs-minify-plugin # Minimize html code
pip install mkdocs-git-revision-date-localized-plugin # display last updated date
pip install mkdocs-jupyter # include Jupyter notebooks
-pip install mkdocs-table-reader-plugin
-pip install mkdocs-bibtex # add references in your text (`.bib`)
-pip install neoteroi-mkdocs # create author cards
-pip install mkdocs-table-reader-plugin # embed tabular format files (`.tsv`)
+pip install mkdocs-bibtex # add references in your text (`.bib`)
+pip install neoteroi-mkdocs # create author cards
+pip install mkdocs-table-reader-plugin # embed tabular format files (`.tsv`)
-
-1. Organize and structure your NGS data and data analysis
-Applying a consistent file structure and naming conventions to your files will help you to efficiently manage your data. We will divide your NGS data and data analyses into two different types of folders:
+
+1. Organize and structure your datasets and data analysis
+Establishing a consistent file structure and naming conventions will help you efficiently manage your data. We will classify your data and data analyses into two distinct types of folders to ensure the data can be used and shared by many lab members while preventing modifications by any individual:
-- Assay folders: These folders contain the raw and processed NGS datasets, as well as the pipeline/workflow used to generate the processed data, provenance of the raw data, and quality control reports of the data. This data should be locked and read-only to prevent unwanted modifications.
-- Project folders: These folders contain all the necessary files for a specific research project. A project may use several assays or results from other projects. The assay data should not be copied or duplicated, but linked from the source.
+- Data folders (assay or external databases and resources): They house the raw and processed datasets, alongside the pipeline/workflow used to generate the processed data, the provenance of the raw data, and quality control reports of the data. The data should be locked and set to read-only to prevent unintended modifications. This applies to experimental data generated in your lab as well as external resources. Provide an MD5 checksum file when you download them yourself to verify their integrity.
+- Project folders: They contain all the essential files for a specific research project. Projects may use data from various resources or experiments, or build upon previous results from other projects. The data should not be copied or duplicated, instead, it should be linked directly from the source.
-Projects and Assays are separated from each other because a project may use one or more assays to answer a scientific question, and assays may be reused several times in different projects. This could be, for example, all the data analysis related to a publication (an RNAseq and a ChIPseq experiment), or a comparison between a previous ATACseq experiment (which was used for a older project) with a new laboratory protocol.
-You could also create Genomic resources folders things such as genome references (fasta files) and annotations (gtf files) for different species, as well as indexes for different alignment algorithms. If you want to know more, feel free to check the relevant full lesson
+Data and data analysis are kept separate because a project may utilize one or more datasets to address a scientific question. Data can be reused in multiple projects over time, combined with other datasets for comparison, or used to build larger datasets. Additionally, data may be utilized by different researchers to answer various research questions.
+
+
+Data folders
+Whether your lab generates its own experimental data, receives it from collaborators, or works with previously published datasets, the data folder should follow a similar structure to the one presented here. Create a separate folder for each dataset, including raw files and processed files alongside the corresponding documentation and pipeline that generated the processed data. Raw files should remain untouched, and you should consider locking modifications to the final results once data preprocessing is complete. This precaution helps prevent unwanted changes to the data. Each subfolder should be named in a way that is distinct, easily readable and clear at a glance. Check this lesson for tips on naming conventions.
+
+Let’s explore a potential folder structure and the types of files you might encounter within it.
+<data_type>_<keyword>_YYYYMMDD/
+├── README.md
+├── CHECKSUMS
+├── pipeline
+├── pipeline.md
+ ├── scripts/
+ ├── processed
+├── fastqc/
+ ├── multiqc/
+ ├── final_fastq/
+ └── raw
+├── .fastq.gz
+ └── samplesheet.csv
-- README.md: Long description of the assay in markdown format. It should contain provenance of the raw NGS data (samples, laboratory protocols used, the aim of the assay, etc)
-- metadata.yml: metadata file for the assay describing different keys and important information regarding that assay (see this lesson).
-- pipeline.md: description of the pipeline used to process raw data, as well as the commands used to run the pipeline.
-- processed: folder with results of the preprocessing pipeline. Contents depend on the pipeline used.
-- raw: folder with the raw data.
+
- README.md: This file contains a detailed description of the dataset commonly in markdown format. It should include the provenance of the raw data (such as samples, laboratory protocols used, the aim of the project, folder structure, naming conventions, etc.).
+- metadata.yml: This metadata file outlines different keys and essential information, usually presented in YAML format. For more details, refer to this lesson.
+- pipeline.md: This file provides an overview of the pipeline used to process raw data, as well as the commands to run the pipeline. The pipeline itself and all the required scripts should be collected in the same directory.
+- processed: This folder contains the results from the preprocessing pipeline. The content vary depending on the specific pipeline used (create additional subdirectories as needed).
+- raw: This folder holds the raw data.
-- .fastq.gz:In the case of NGS assays, there should be fastq files.
-- samplesheet.csv: file that contains metadata information for the samples. This file is used to run the nf-core pipelines. You can also add extra columns with info regarding the experimental variables and batches so it can be used for downstream analysis as well.
+- .fastq.gz: For example, in NGS assays, there should be ‘fastq’ files.
+- samplesheet.csv: This file holds essential metadata for the samples, including sample identification, experimental variables, batch information, and other metrics crucial for downstream analysis. It is important that this file is complete and current, as it is key to interpreting results. If you are considering running nf-core pipelines, this file will be required.
-
-Project folder
-On the other hand, we have the other type of folder called Projects
. In this folder, you will save a subfolder for each project that you (or your lab) work on. Each Project
subfolder will contain project information and all the data analysis notebooks and scripts used in that project.
-As like for an Assay folder, the Project folder should be named in a way that is unique, easily readable, distinguishable, and understood at a glance. For example, you could name it after the main author’s initials, a keyword that represents a unique descriptive element of that assay, and the date:
-<author_initials>_<keyword>_YYYYMMDD
-For example, JARH_Oct4_20230101
, is a project about the gene Oct4 owned by Jose Alejandro Romero Herrera, created on the 1st of January of 2023.
-Next, let’s take a look at a possible folder structure and what kind of files you can find there.
-<author_initials>_<keyword>_YYYYMMDD
+
+Project folders
+On the other hand, we have another type of folder called Projects which refers to data analyses that are specific to particular tasks, such as those involved in preparing a potential article. In this folder, you will create a subfolder for each project that you or your lab is working on. Each Project subfolder should include project-specific information, data analysis pipelines, notebooks, and scripts used for that particular project. Additionally, you should include an environment file with all the required software and dependencies needed for the project, including their versions. This helps ensure that the analyses can be easily replicated and shared with others.
+The Project folder should be named in a way that is unique, easy to read, distinguishable, and clear at a glance. For example, you might name it based on the main author’s initials, the dataset being analyzed, the project name, a unique descriptive element related to the project, or the part of the project you are responsible for, along with the date:
+<project>_<keyword>_YYYYMMDD
+
+Now, let’s explore an example of a folder structure and the types of files you might encounter within it.
+<project>_<keyword>_YYYYMMDD
├── data
-│ └── <Assay-ID>_<keyword>_YYYYMMDD/
+│ └── <ID>_<keyword>_YYYYMMDD <- symbolic link
├── documents
-│ └── Non-sensitive_NGS_research_project_template.docx
-├── notebooks
-│ └── 01_data_analysis.rmd
-├── README.md
-├── reports
-│ ├── figures
-│ │ └── 01_data_analysis/
-│ │ └── heatmap_sampleCor_20230102.png
-│ └── 01_data_analysis.html
-├── requirements.txt
-├── results
-│ └── 01_data_analysis/
-│ └── DEA_treat-control_LFC1_p01.tsv
-├── scripts
-└── metadata.yml
+│ └── research_project_template.docx
+├── metadata.yml
+├── notebooks
+│ └── 01_data_processing.rmd
+│ └── 02_data_analysis.rmd
+│ └── 03_data_visualization.rmd
+├── README.md
+├── reports
+│ └── 01_data_processing.html
+│ └── 02_data_analysis.html
+│ ├── 03_data_visualization.html
+│ │ └── figures
+│ │ └── tables
+├── requirements.txt // env.yaml
+├── results
+│ ├── figures
+│ │ └── 02_data_analysis/
+│ │ └── heatmap_sampleCor_20230102.png
+│ ├── tables
+│ │ └── 02_data_analysis/
+│ │ └── DEA_treat-control_LFC1_p01.tsv
+│ │ └── SumStats_sampleCor_20230102.tsv
+├── pipeline
+│ ├── rules // processes
+│ │ └── step1_data_processing.smk
+│ └── pipeline.md
+├── scratch
+└── scripts
-- data: a folder that contains symlinks or shortcuts to where the data is, avoiding copying and modification of original files.
-- documents: a folder containing Word documents, slides, or PDFs related to the project, such as explanations of the data or project, papers, etc. It also contains your Data Management Plan.
+
- data: This folder contains symlinks or shortcuts to the actual data files, ensuring that the original files remain unaltered.
+- documents: This folder houses Word documents, slides, or PDFs associated with the project, including data and project explanations, research papers, and more. It also includes the Data Management Plan.
-- Non-sensitive_NGS_research_project_template.docx. This is a pre-filled Data Management Plan based on the Horizon Europe guidelines.
+- research_project_template.docx. If you download our template you will find a is a pre-filled Data Management Plan based on the Horizon Europe guidelines named ‘Non-sensitive_NGS_research_project_template.docx’.
-- notebooks: a folder containing Jupyter, R markdown, or Quarto notebooks with the actual data analysis.
-- README.md: detailed description of the project in markdown format.
-- reports: notebooks rendered as HTML/docx/pdf versions, ideal for sharing with colleagues and also as a formal report of the data analysis procedure.
+
- metadata.yml: metadata file describing various keys of the project or experiment (see this lesson).
+- notebooks: This folder stores Jupyter, R Markdown, or Quarto notebooks containing the data analysis. Figures and tables used for the reports are organized under subfolders named after the notebook that created them for provenance purposes.
+- README.md: A detailed project description in markdown or plain-text format.
+- reports: Notebooks rendered as HTML, docx, or PDF files for sharing with colleagues or as formal data analysis reports.
- figures: figures produced upon rendering notebooks. The figures will be saved under a subfolder named after the notebook that created them. This is for provenance purposes so we know which notebook created which figures.
-- requirements.txt: file explaining what software and libraries/packages and their versions are necessary to reproduce the code.
-- results: results from the data analysis, such as tables with differentially expressed genes, enrichment results, etc.
-- scripts: folder containing helper scripts needed to run data analysis or reproduce the work of the folder
-- description.yml: a short description of the project.
-- metadata.yml: metadata file for the assay describing different keys (see this lesson).
+- requirements.txt: This file lists the necessary software, libraries, and their versions required to reproduce the code. If you’re using conda environments, you will also find the env.yaml file here, which outlines the specific environment configuration.
+- results: This folder contains analysis results, such as figures and tables. Organizing results by the pipeline, script, or notebook that generated them will make it easier to locate and interpret the data.
+- pipeline: A folder containing pipeline scripts or workflows for processing and analyzing data.
+- scratch: A folder designated for temporary files or workspace for experiments and development.
+- scripts: Folder for helper scripts needed to run data analysis or reproduce the work.
Template engine
-It is very easy to create a folder template using cookiecutter. Cookiecutter is a command-line utility that creates projects from cookiecutters (that is, a template), e.g. creating a Python package project from a Python package project template. Here you can find an example of a cookiecutter folder template-directed to NGS data, where we have applied the structures explained in the previous sections. You are very welcome to adapt it or modify it to your needs!
+Creating a folder template is straightforward with cookiecutter a command-line tool that generates projects from templates (called cookiecutters). For example, it can help you set up a Python package project based on a Python package project template.
+
+
+
+Cookiecutter templates
+
+
+
+
+Here are some template that you can use to get started, adapt and modify them to your own needs:
+
+Create your own template from scratch.
+
+
-
On this page
-
-
- 1. Organize and structure your NGS data and data analysis +
- 1. Organize and structure your datasets and data analysis
- 2. Metadata
@@ -227,7 +225,7 @@
Practical material
Practical material
💬 Learning Objectives:- Organize and structure your data and data analysis with Cookiecutter templates -
- Establish metadata fields and collect metadata when creating a cookiecutter folder +
- Define metadata fields and collect metadata when creating a Cookiecutter folder
- Establish naming conventions for your data -
- Make a catalog of your data -
- Create GitHub repositories of your data analysis and display them as GitHub Pages +
- Create a catalog of your data +
- Use GitHub repositories of your data analysis and display them as GitHub Pages
- Archive GitHub repositories on Zenodo
This is a practical version of the full RDM on NGS data workshop. The main key points of the exercises shown here are to help you organize and structure your NGS datasets and your data analyses. We will see how to keep track of your experiments metadata and how to safely version control and archive your data analyses using GitHub repositories and Zenodo. We hope that through these practical exercises and step-by-step guidance, you’ll gain valuable skills in efficiently managing and sharing your research data, enhancing the reproducibility and impact of your work.
+This practical version covers practical aspects of RDM applied to biodata. The exercises provided here aim to help you organize and structure your datasets and data analyses. You’ll learn how to manage your experimental metadata effectively and safely version control and archive your data analyses using GitHub repositories and Zenodo. Through these guided exercises and step-by-step instructions, we hope you will acquire essential skills for managing and sharing your research data efficiently, thereby enhancing the reproducibility and impact of your work.
Ensure that all necessary tools and software are installed before proceeding with the practical exercises.
+Ensure all necessary tools and software are installed before beginning the practical exercises:
-
-
- -
- -
- -
- -
- -
Cookicutter to create folder structure templates (
pip install cookiecutter
)
-cruft to version control your templates (
pip install cruft
)
-Option a. Install Quarto. We recommend Quarto as is easy to use and provides native support for notebooks (R Markdown and Jupyter Notebooks). No extensions or dependencies are needed.
-Option b. Install MkDocs and MkDocs extensions using the command line.
+- A GitHub account for hosting and collaborating on projects +
- Git for version control of your projects +
- A Zenodo account for archiving and sharing your research outputs +
- Python +
- pip for managing Python packages +
- Cookicutter for creating folder structure templates (
pip install cookiecutter
)
+ - cruft to version control your templates (
pip install cruft
)
+
Two more tools will be required, choose the one you are familiar with or the first option:
+-
+
- Option a. Install Quarto. We recommend Quarto as is easy to use and provides native support for notebooks (both R Markdown and Jupyter Notebooks). It requires no additional extensions or dependencies. +
- Option b. Install MkDocs and MkDocs extensions using the command line. Additional extensions are optional but can be useful if you choose this approach.
pip install mkdocs # create webpages
pip install mkdocs-material # customize webpages
@@ -289,108 +290,208 @@ Practical material
pip install mkdocs-minify-plugin # Minimize html code
pip install mkdocs-git-revision-date-localized-plugin # display last updated date
pip install mkdocs-jupyter # include Jupyter notebooks
-pip install mkdocs-table-reader-plugin
-pip install mkdocs-bibtex # add references in your text (`.bib`)
-pip install neoteroi-mkdocs # create author cards
-pip install mkdocs-table-reader-plugin # embed tabular format files (`.tsv`)
1. Organize and structure your NGS data and data analysis
-Applying a consistent file structure and naming conventions to your files will help you to efficiently manage your data. We will divide your NGS data and data analyses into two different types of folders:
+1. Organize and structure your datasets and data analysis
+Establishing a consistent file structure and naming conventions will help you efficiently manage your data. We will classify your data and data analyses into two distinct types of folders to ensure the data can be used and shared by many lab members while preventing modifications by any individual:
-
-
- Assay folders: These folders contain the raw and processed NGS datasets, as well as the pipeline/workflow used to generate the processed data, provenance of the raw data, and quality control reports of the data. This data should be locked and read-only to prevent unwanted modifications. -
- Project folders: These folders contain all the necessary files for a specific research project. A project may use several assays or results from other projects. The assay data should not be copied or duplicated, but linked from the source. +
- Data folders (assay or external databases and resources): They house the raw and processed datasets, alongside the pipeline/workflow used to generate the processed data, the provenance of the raw data, and quality control reports of the data. The data should be locked and set to read-only to prevent unintended modifications. This applies to experimental data generated in your lab as well as external resources. Provide an MD5 checksum file when you download them yourself to verify their integrity. +
- Project folders: They contain all the essential files for a specific research project. Projects may use data from various resources or experiments, or build upon previous results from other projects. The data should not be copied or duplicated, instead, it should be linked directly from the source.
Projects and Assays are separated from each other because a project may use one or more assays to answer a scientific question, and assays may be reused several times in different projects. This could be, for example, all the data analysis related to a publication (an RNAseq and a ChIPseq experiment), or a comparison between a previous ATACseq experiment (which was used for a older project) with a new laboratory protocol.
-You could also create Genomic resources folders things such as genome references (fasta files) and annotations (gtf files) for different species, as well as indexes for different alignment algorithms. If you want to know more, feel free to check the relevant full lesson
+Data and data analysis are kept separate because a project may utilize one or more datasets to address a scientific question. Data can be reused in multiple projects over time, combined with other datasets for comparison, or used to build larger datasets. Additionally, data may be utilized by different researchers to answer various research questions.
+ +Data folders
+Whether your lab generates its own experimental data, receives it from collaborators, or works with previously published datasets, the data folder should follow a similar structure to the one presented here. Create a separate folder for each dataset, including raw files and processed files alongside the corresponding documentation and pipeline that generated the processed data. Raw files should remain untouched, and you should consider locking modifications to the final results once data preprocessing is complete. This precaution helps prevent unwanted changes to the data. Each subfolder should be named in a way that is distinct, easily readable and clear at a glance. Check this lesson for tips on naming conventions.
+ +Let’s explore a potential folder structure and the types of files you might encounter within it.
+<data_type>_<keyword>_YYYYMMDD/
+├── README.md
+├── CHECKSUMS
+├── pipeline
+├── pipeline.md
+ ├── scripts/
+ ├── processed
+├── fastqc/
+ ├── multiqc/
+ ├── final_fastq/
+ └── raw
+├── .fastq.gz
+ └── samplesheet.csv
-
-
- README.md: Long description of the assay in markdown format. It should contain provenance of the raw NGS data (samples, laboratory protocols used, the aim of the assay, etc) -
- metadata.yml: metadata file for the assay describing different keys and important information regarding that assay (see this lesson). -
- pipeline.md: description of the pipeline used to process raw data, as well as the commands used to run the pipeline. -
- processed: folder with results of the preprocessing pipeline. Contents depend on the pipeline used. -
- raw: folder with the raw data. +
- README.md: This file contains a detailed description of the dataset commonly in markdown format. It should include the provenance of the raw data (such as samples, laboratory protocols used, the aim of the project, folder structure, naming conventions, etc.). +
- metadata.yml: This metadata file outlines different keys and essential information, usually presented in YAML format. For more details, refer to this lesson. +
- pipeline.md: This file provides an overview of the pipeline used to process raw data, as well as the commands to run the pipeline. The pipeline itself and all the required scripts should be collected in the same directory. +
- processed: This folder contains the results from the preprocessing pipeline. The content vary depending on the specific pipeline used (create additional subdirectories as needed). +
- raw: This folder holds the raw data.
-
-
- .fastq.gz:In the case of NGS assays, there should be fastq files. -
- samplesheet.csv: file that contains metadata information for the samples. This file is used to run the nf-core pipelines. You can also add extra columns with info regarding the experimental variables and batches so it can be used for downstream analysis as well. +
- .fastq.gz: For example, in NGS assays, there should be ‘fastq’ files. +
- samplesheet.csv: This file holds essential metadata for the samples, including sample identification, experimental variables, batch information, and other metrics crucial for downstream analysis. It is important that this file is complete and current, as it is key to interpreting results. If you are considering running nf-core pipelines, this file will be required.
Project folder
-On the other hand, we have the other type of folder called Projects
. In this folder, you will save a subfolder for each project that you (or your lab) work on. Each Project
subfolder will contain project information and all the data analysis notebooks and scripts used in that project.
As like for an Assay folder, the Project folder should be named in a way that is unique, easily readable, distinguishable, and understood at a glance. For example, you could name it after the main author’s initials, a keyword that represents a unique descriptive element of that assay, and the date:
-<author_initials>_<keyword>_YYYYMMDD
For example, JARH_Oct4_20230101
, is a project about the gene Oct4 owned by Jose Alejandro Romero Herrera, created on the 1st of January of 2023.
Next, let’s take a look at a possible folder structure and what kind of files you can find there.
-<author_initials>_<keyword>_YYYYMMDD
+
+Project folders
+On the other hand, we have another type of folder called Projects which refers to data analyses that are specific to particular tasks, such as those involved in preparing a potential article. In this folder, you will create a subfolder for each project that you or your lab is working on. Each Project subfolder should include project-specific information, data analysis pipelines, notebooks, and scripts used for that particular project. Additionally, you should include an environment file with all the required software and dependencies needed for the project, including their versions. This helps ensure that the analyses can be easily replicated and shared with others.
+The Project folder should be named in a way that is unique, easy to read, distinguishable, and clear at a glance. For example, you might name it based on the main author’s initials, the dataset being analyzed, the project name, a unique descriptive element related to the project, or the part of the project you are responsible for, along with the date:
+<project>_<keyword>_YYYYMMDD
+
+Now, let’s explore an example of a folder structure and the types of files you might encounter within it.
+<project>_<keyword>_YYYYMMDD
├── data
-│ └── <Assay-ID>_<keyword>_YYYYMMDD/
+│ └── <ID>_<keyword>_YYYYMMDD <- symbolic link
├── documents
-│ └── Non-sensitive_NGS_research_project_template.docx
-├── notebooks
-│ └── 01_data_analysis.rmd
-├── README.md
-├── reports
-│ ├── figures
-│ │ └── 01_data_analysis/
-│ │ └── heatmap_sampleCor_20230102.png
-│ └── 01_data_analysis.html
-├── requirements.txt
-├── results
-│ └── 01_data_analysis/
-│ └── DEA_treat-control_LFC1_p01.tsv
-├── scripts
-└── metadata.yml
+│ └── research_project_template.docx
+├── metadata.yml
+├── notebooks
+│ └── 01_data_processing.rmd
+│ └── 02_data_analysis.rmd
+│ └── 03_data_visualization.rmd
+├── README.md
+├── reports
+│ └── 01_data_processing.html
+│ └── 02_data_analysis.html
+│ ├── 03_data_visualization.html
+│ │ └── figures
+│ │ └── tables
+├── requirements.txt // env.yaml
+├── results
+│ ├── figures
+│ │ └── 02_data_analysis/
+│ │ └── heatmap_sampleCor_20230102.png
+│ ├── tables
+│ │ └── 02_data_analysis/
+│ │ └── DEA_treat-control_LFC1_p01.tsv
+│ │ └── SumStats_sampleCor_20230102.tsv
+├── pipeline
+│ ├── rules // processes
+│ │ └── step1_data_processing.smk
+│ └── pipeline.md
+├── scratch
+└── scripts
-
-
- data: a folder that contains symlinks or shortcuts to where the data is, avoiding copying and modification of original files. -
- documents: a folder containing Word documents, slides, or PDFs related to the project, such as explanations of the data or project, papers, etc. It also contains your Data Management Plan. +
- data: This folder contains symlinks or shortcuts to the actual data files, ensuring that the original files remain unaltered. +
- documents: This folder houses Word documents, slides, or PDFs associated with the project, including data and project explanations, research papers, and more. It also includes the Data Management Plan.
-
-
- Non-sensitive_NGS_research_project_template.docx. This is a pre-filled Data Management Plan based on the Horizon Europe guidelines. +
- research_project_template.docx. If you download our template you will find a is a pre-filled Data Management Plan based on the Horizon Europe guidelines named ‘Non-sensitive_NGS_research_project_template.docx’.
- - notebooks: a folder containing Jupyter, R markdown, or Quarto notebooks with the actual data analysis. -
- README.md: detailed description of the project in markdown format. -
- reports: notebooks rendered as HTML/docx/pdf versions, ideal for sharing with colleagues and also as a formal report of the data analysis procedure. +
- metadata.yml: metadata file describing various keys of the project or experiment (see this lesson). +
- notebooks: This folder stores Jupyter, R Markdown, or Quarto notebooks containing the data analysis. Figures and tables used for the reports are organized under subfolders named after the notebook that created them for provenance purposes. +
- README.md: A detailed project description in markdown or plain-text format. +
- reports: Notebooks rendered as HTML, docx, or PDF files for sharing with colleagues or as formal data analysis reports.
- figures: figures produced upon rendering notebooks. The figures will be saved under a subfolder named after the notebook that created them. This is for provenance purposes so we know which notebook created which figures.
- - requirements.txt: file explaining what software and libraries/packages and their versions are necessary to reproduce the code. -
- results: results from the data analysis, such as tables with differentially expressed genes, enrichment results, etc. -
- scripts: folder containing helper scripts needed to run data analysis or reproduce the work of the folder -
- description.yml: a short description of the project. -
- metadata.yml: metadata file for the assay describing different keys (see this lesson). +
- requirements.txt: This file lists the necessary software, libraries, and their versions required to reproduce the code. If you’re using conda environments, you will also find the env.yaml file here, which outlines the specific environment configuration. +
- results: This folder contains analysis results, such as figures and tables. Organizing results by the pipeline, script, or notebook that generated them will make it easier to locate and interpret the data. +
- pipeline: A folder containing pipeline scripts or workflows for processing and analyzing data. +
- scratch: A folder designated for temporary files or workspace for experiments and development. +
- scripts: Folder for helper scripts needed to run data analysis or reproduce the work.
Template engine
-It is very easy to create a folder template using cookiecutter. Cookiecutter is a command-line utility that creates projects from cookiecutters (that is, a template), e.g. creating a Python package project from a Python package project template. Here you can find an example of a cookiecutter folder template-directed to NGS data, where we have applied the structures explained in the previous sections. You are very welcome to adapt it or modify it to your needs!
+Creating a folder template is straightforward with cookiecutter a command-line tool that generates projects from templates (called cookiecutters). For example, it can help you set up a Python package project based on a Python package project template.
+Here are some template that you can use to get started, adapt and modify them to your own needs:
+ +Create your own template from scratch.
+These are the questions users will be asked when generating a project based on your template. The values provided here will be used to replace the corresponding placeholders in the template files.
-In addition to replacing placeholders in files and directory names, Cookiecutter can also automatically fill in information within the contents of text files. This can be useful for providing default configurations or templates for code files. Let’s extend our previous example to include a placeholder inside a text file:
+When users generate a project based on your template, they will be prompted with these questions. The provided values (“responses”) will be used to substitute the placeholders in your template files.
+Beyond substituting placeholders in file and directory names, Cookiecutter can automatically populate text file contents with information. This feature is useful for offering default configurations or code file templates. Let’s enhance our earlier example by incorporating a placeholder within a text file:
First, modify the my_template/main.py
file to include a placeholder inside its contents:
# main.py
def hello():
print("Hello, {{cookiecutter.project_name}}!")
Now, the {cookiecutter.project_name}
placeholder is inside the main.py
file. When you run Cookiecutter, it will automatically replace the placeholders not only in file and directory names but also within the contents of text files. After running Cookiecutter, your generated main.py
file might look like this:
The ‘{{cookiecutter.project_name}}’ placeholder is now included within the main.py file. When you execute Cookiecutter, it will automatically replace the placeholders in both file and directory names and within text file contents.
+After running Cookiecutter, your generated ‘main.py’ file could appear as follows:
# main.py
def hello():
@@ -415,15 +517,15 @@ Step 2: Cr
-Step 4: Explore the Generated Project
-Once the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will see a project structure with the placeholders replaced by the values you provided.
+
+Step 4: Review the Generated Project
+After the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will find a project structure with the placeholders replaced by the values you provided.
-