diff --git a/_site/develop/01_RDM_intro.html b/_site/develop/01_RDM_intro.html index f0a871b0..7907b444 100644 --- a/_site/develop/01_RDM_intro.html +++ b/_site/develop/01_RDM_intro.html @@ -295,7 +295,7 @@

1. Introduction to RDM

Modified
-

April 17, 2024

+

April 25, 2024

@@ -343,7 +343,7 @@

FAIR Research Data Management and the Data Lifecycle

-

The definition of Management is “the practice of managing; handling, supervision, or control.”.

+

The definition of Management is “the practice of managing; handling, supervision, or control”.

In accordance with the UCPH Policy for Research Data Management, research data encompasses both physical material and digital information gathered, observed, produced, or formulated during research activities carried out during research. This broad definition includes various types of data serving as the foundation for the research, such as specimens, notebooks, interviews, texts, literature, digital raw data, recordings, computer code, and meticulous documentation of these materials and data, forming the core of the analysis that underlies the research outcomes.

diff --git a/_site/develop/03_DOD.html b/_site/develop/03_DOD.html index 2498718b..17ee3c63 100644 --- a/_site/develop/03_DOD.html +++ b/_site/develop/03_DOD.html @@ -311,7 +311,7 @@

3. Data organization and storage

Modified
-

April 17, 2024

+

April 26, 2024

@@ -742,7 +742,7 @@

Quick tutor

Learn how to create your own template here.

-

We offer workshops on practical RDM for NGS data. Keep an eye on the upcoming events on the Sandbox website.

+

We offer workshops on practical RDM for biodata. Keep an eye on the upcoming events on the Sandbox website.

@@ -867,7 +867,7 @@

Manual Download

Naming conventions

-

Consistent naming conventions play a crucial role in scientific research by enhancing organization and data retrieval. By adopting standardized naming conventions, researchers ensure that files, experiments, or datasets are labeled logically, facilitating easy location and comparison of similar data. For instance, in fields like genomics, uniform naming conventions for files associated with particular experiments or samples allow for swift identification and comparison of relevant data, streamlining the research process and contributing to the reproducibility of findings. Overall, promotes efficiency, collaboration, and the integrity of scientific work.

+

Consistent naming conventions play a crucial role in scientific research by enhancing organization and data retrieval. By adopting standardized naming conventions, researchers ensure that files, experiments, or datasets are labeled logically, facilitating easy location and comparison of similar data. The importance of uniform naming conventions extends to various fields, in fields like genomics or health data science, uniform naming conventions for files associated with particular experiments or samples allow for swift identification and comparison of relevant data, streamlining the research process and contributing to the reproducibility of findings. Overall, promotes efficiency, collaboration, and the integrity of scientific work.

@@ -893,648 +893,33 @@

Naming conventions

  • Not all search tools may work well with spaces (messy to indicate paths)
  • If the length is a concern, use capital letters to delimit words camelCase.
  • -
  • Sequential numbering: Use a two-‑digit format for single-digit numbers (0–9) to ensure correct numerical sequence order (for example, 01 and not 1)
  • +
  • Sequential numbering: Use a two-‑digit format for single-digit numbers (0–9) to ensure correct numerical sequence order (for example, 01 and not, 1 if your sequence only goes up to 99)
  • Version control: Indicate the version (“V”) or revision (“R”) as the last element, using the two-digit format (e.g., v01, v02)
  • Write down your naming convention pattern and document it in the README file
  • -
    +
    -Define your file name conventions +Create your own naming conventions
    -
    +
    -

    Avoid long and complicated names and ensure your file names are both informative and easy to manage:

    -
      -
    1. For saving a new plot, a heatmap representing sample correlations
    2. -
    3. When naming the file for the document containing the Research Data Management Course Objectives (Version 2, 2nd May 2024) from the University of Copenhagen
    4. -
    5. Consider the most common file types you work with, such as visualizations, tables, etc., and create logical and clear file names
    6. -
    -
    - -
    -
    -
    -
    -
      -
    1. heatmap_sampleCor_20240101.png
    2. -
    3. KU_RDM-objectives_20240502_v02.doc or KU_RDMObj_20240502_v02.doc
    4. -
    -
    +

    Consider the most common types of files and folders you will be working with, such as visualizations, results tables, and processed files. Develop a logical and clear naming system for these files based on the tips provided above. Aim for concise and straightforward names to avoid complexity.

    -
    -
    -
    -
    -
    - -Additional file naming conventions - -

    -

    -
    -
    -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    namedescriptionnaming_conventionfile formatexample
    .fastqraw sequencing readsnannansampleID_run_read1.fastq
    .fastqcquality control from fastqcnannansampleID_run_read1.fastqc
    .bamaligned readsnannansampleID_run_read1.bam
    GTFsequence annotationnannanone of https://www.gencodegenes.org/
    GFFsequence annotationnannanone of https://www.gencodegenes.org/
    .bedgenome locationsnannannan
    .bigwiggenome coveragenannannan
    .fastasequence data (nucleotide/aminoacid)nannanone of https://www.gencodegenes.org/
    Multiqc reportQC aggregated report<assayID\>_YYYYMMDD.multiqcmultiqcRNA_20200101.multiqc
    Count matrixfinal count matrix<assayID\>_cm_aligner_YYYYMMDD.tsvtsvRNA_cm_salmon_20200101.tsv
    DEAdifferential expression analysis resultsDEA_<condition1-condition2\>_LFC<absolute_threshold\>_p<pvalue decimals\>_YYYYMMDD.tsvtsvDEA_treat-untreat_LFC1_p01_20200101.tsv
    DBAdifferential binding analysis resultsDBA_<condition1-condition2\>_LFC<absolute_threshold\>_p<pvalue decimals\>_YYYYMMDD.tsvtsvDBA_treat-untreat_LFC1_p01_20200101.tsv
    MAplotMA plotMAplot_<condition1-condition2\>_YYYYMMDD.jpegjpegMAplot_treat-untreat_20200101.jpeg
    Heatmap plotHeatmap plot of anythingheatmap_<type\>_YYYYMMDD.jpegjpegheatmap_sampleCor_20200101.jpeg
    Volcano plotVolcano plotvolcano_<condition1-condition2\>_YYYYMMDD.jpegjpegvolcano_treat-untreat_20200101.jpeg
    Venn diagramVenn diagramvenn_<type\>_YYYYMMDD.jpegjpegvenn_consensus_20200101.jpeg
    Enrichment tableEnrichment resultsnantsvnan
    - -
    -
    -
    -
    -

    -
    +

    To learn more about naming conventions for NGS analysis and see additional examples, click here.

    Wrap up

    diff --git a/_site/develop/04_metadata.html b/_site/develop/04_metadata.html index a52af690..9449ee2d 100644 --- a/_site/develop/04_metadata.html +++ b/_site/develop/04_metadata.html @@ -261,7 +261,7 @@

    On this page

    - -
    +
    @@ -884,7 +925,7 @@

    Catalog browser

    -
    +
    @@ -893,7 +934,7 @@

    Catalog browser

    -
    +
    diff --git a/_site/develop/examples/NGS_management.html b/_site/develop/examples/NGS_management.html index 8ee49823..228de2e0 100644 --- a/_site/develop/examples/NGS_management.html +++ b/_site/develop/examples/NGS_management.html @@ -8,7 +8,7 @@ -RDM for NGS +RDM for biodata + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    namedescriptionnaming_conventionfile formatexample
    .fastqraw sequencing readsnannansampleID_run_read1.fastq
    .fastqcquality control from fastqcnannansampleID_run_read1.fastqc
    .bamaligned readsnannansampleID_run_read1.bam
    GTFsequence annotationnannanone of https://www.gencodegenes.org/
    GFFsequence annotationnannanone of https://www.gencodegenes.org/
    .bedgenome locationsnannannan
    .bigwiggenome coveragenannannan
    .fastasequence data (nucleotide/aminoacid)nannanone of https://www.gencodegenes.org/
    Multiqc reportQC aggregated report<assayID\>_YYYYMMDD.multiqcmultiqcRNA_20200101.multiqc
    Count matrixfinal count matrix<assayID\>_cm_aligner_YYYYMMDD.tsvtsvRNA_cm_salmon_20200101.tsv
    DEAdifferential expression analysis resultsDEA_<condition1-condition2\>_LFC<absolute_threshold\>_p<pvalue decimals\>_YYYYMMDD.tsvtsvDEA_treat-untreat_LFC1_p01_20200101.tsv
    DBAdifferential binding analysis resultsDBA_<condition1-condition2\>_LFC<absolute_threshold\>_p<pvalue decimals\>_YYYYMMDD.tsvtsvDBA_treat-untreat_LFC1_p01_20200101.tsv
    MAplotMA plotMAplot_<condition1-condition2\>_YYYYMMDD.jpegjpegMAplot_treat-untreat_20200101.jpeg
    Heatmap plotHeatmap plot of anythingheatmap_<type\>_YYYYMMDD.jpegjpegheatmap_sampleCor_20200101.jpeg
    Volcano plotVolcano plotvolcano_<condition1-condition2\>_YYYYMMDD.jpegjpegvolcano_treat-untreat_20200101.jpeg
    Venn diagramVenn diagramvenn_<type\>_YYYYMMDD.jpegjpegvenn_consensus_20200101.jpeg
    Enrichment tableEnrichment resultsnantsvnan
    + +
    +
    +
    +

    Click below to access a list of the most common file formats used when working with NGS data.

    @@ -508,8 +1108,6 @@

    4. Pipelin

    Explore more data types at the UCSC webpage. Check out this tutorial for more detailed explanations.

    -
    -

    Wrap up

    In this lesson, we have taken a look a the vast and diverse landscape of bioinformatics data.

    diff --git a/_site/develop/examples/NGS_metadata.html b/_site/develop/examples/NGS_metadata.html index 4c9a6fe7..fe5710cd 100644 --- a/_site/develop/examples/NGS_metadata.html +++ b/_site/develop/examples/NGS_metadata.html @@ -8,7 +8,7 @@ -RDM for NGS - NGS Assay and Project metadata +RDM for biodata - NGS Assay and Project metadata @@ -713,129 +727,38 @@

    Sample metadata fie Metadata field Definition Format -Ontology +Ontology Example -sample -Name of the sample -NA -NA -control_rep1, treat_rep1 - - -fastq_1 -Path to fastq file 1 -NA -NA -AEG588A1_S1_L002_R1_001.fastq.gz - - -fastq_2 -Path to paired fastq file, if it is a paired experiment -NA -NA -AEG588A1_S1_L002_R2_001.fastq.gz - - -strandedness -The strandedness of the cDNA library -<unstranded OR forward OR reverse \> -NA -unstranded - - -condition -Variable of interest of the experiment, such as "control", "treatment", etc -wordWord -camelCase -control, treat1, treat2 - - -cell_type -The cell type(s) known or selected to be present in the sample -NA -ontology field- e.g. EFO or OBI -NA - - -tissue -The tissue from which the sample was taken -NA -Uberon -NA - - -sex -The biological/genetic sex of the sample -NA -ontology field- e.g. EFO or OBI -NA - - -cell_line -Cell line of the sample -NA -ontology field- e.g. EFO or OBI -NA - - -organism -Organism origin of the sample -<Genus species> -Taxonomy -Mus musculus - - -replicate -Replicate number -<integer\> -NA -1 - - -batch -Batch information -wordWord -camelCase -1 - - -disease -Any diseases that may affect the sample -NA -Disease Ontology or MONDO -NA +project +Project ID +<surname\>_et_al_2023 +NA +proks_et_al_2023 -developmental_stage -The developmental stage of the sample -NA -NA -NA +author +Owner of the project +<First name\> <Surname\> +NA +Martin Proks -sample_type -The type of the collected specimen, eg tissue biopsy, blood draw or throat swab -NA -NA -NA +date +Date of creation +YYYYMMDD +NA +20230101 -strain -Strain of the species from which the sample was collected, if applicable -NA -ontology field - e.g. NCBITaxonomy -NA - - -genetic variation -Any relevant genetic differences from the specimen or sample to the expected genomic information for this species, eg abnormal chromosome counts, major translocations or indels -NA -NA -NA +description +Short description of the project +Plain text +NA +This is a project describing the effect of Oct4 perturbation after pERK activation @@ -845,29 +768,29 @@

    Sample metadata fie

    -
    -

    Project metadata fields

    -

    Here you will find a table with possible metadata fields that you can use to annotate and track your Project folders:

    +
    +

    Sample metadata fields

    +

    Some details might be specific to your samples. For example, which samples are treated, which are controlled, which tissue they come from, which cell type, the age, etc. Here is a list of possible metadata fields that you can use:

    -
    - @@ -1298,38 +1221,129 @@

    Project metadata f Metadata field Definition Format -Ontology +Ontology Example -project -Project ID -<surname\>_et_al_2023 -NA -proks_et_al_2023 +sample +Name of the sample +NA +NA +control_rep1, treat_rep1 -author -Owner of the project -<First name\> <Surname\> -NA -Martin Proks +fastq_1 +Path to fastq file 1 +NA +NA +AEG588A1_S1_L002_R1_001.fastq.gz -date -Date of creation -YYYYMMDD -NA -20230101 +fastq_2 +Path to paired fastq file, if it is a paired experiment +NA +NA +AEG588A1_S1_L002_R2_001.fastq.gz -description -Short description of the project -Plain text -NA -This is a project describing the effect of Oct4 perturbation after pERK activation +strandedness +The strandedness of the cDNA library +<unstranded OR forward OR reverse \> +NA +unstranded + + +condition +Variable of interest of the experiment, such as "control", "treatment", etc +wordWord +camelCase +control, treat1, treat2 + + +cell_type +The cell type(s) known or selected to be present in the sample +NA +ontology field- e.g. EFO or OBI +NA + + +tissue +The tissue from which the sample was taken +NA +Uberon +NA + + +sex +The biological/genetic sex of the sample +NA +ontology field- e.g. EFO or OBI +NA + + +cell_line +Cell line of the sample +NA +ontology field- e.g. EFO or OBI +NA + + +organism +Organism origin of the sample +<Genus species> +Taxonomy +Mus musculus + + +replicate +Replicate number +<integer\> +NA +1 + + +batch +Batch information +wordWord +camelCase +1 + + +disease +Any diseases that may affect the sample +NA +Disease Ontology or MONDO +NA + + +developmental_stage +The developmental stage of the sample +NA +NA +NA + + +sample_type +The type of the collected specimen, eg tissue biopsy, blood draw or throat swab +NA +NA +NA + + +strain +Strain of the species from which the sample was collected, if applicable +NA +ontology field - e.g. NCBITaxonomy +NA + + +genetic variation +Any relevant genetic differences from the specimen or sample to the expected genomic information for this species, eg abnormal chromosome counts, major translocations or indels +NA +NA +NA @@ -1345,23 +1359,23 @@

    Assay metadata field
    -
    - @@ -1949,23 +1963,23 @@

    Assay metadata field
    -
    - @@ -2533,6 +2547,13 @@

    Assay metadata field

    +

    +
    +

    Sources

    +
    diff --git a/_site/develop/practical_workshop.html b/_site/develop/practical_workshop.html index aef81417..78a07f45 100644 --- a/_site/develop/practical_workshop.html +++ b/_site/develop/practical_workshop.html @@ -176,19 +176,13 @@

    On this page

    -
  • 2. Metadata +
  • 2. Data documentation
  • -
  • 3. Naming conventions -
  • -
  • 4. Create a catalog of your assay folder
  • +
  • 3. Naming conventions
  • +
  • 4. Create a catalog of your data folder
  • 5. Version control of your data analysis using Git and GitHub
    • Creating a git repo online and copying your project folder
    • @@ -225,7 +219,7 @@

      Practical material

      Modified
      -

      April 24, 2024

      +

      April 26, 2024

      @@ -539,46 +533,35 @@
      Step 4

      Use Cookiecutter to create custom templates for your folders. You can do it from scratch (see Exercise 1, part B) or opt for one of our pre-made templates available as a Github repository (recommended for this workshop). Feel free to tailor the template to your specific requirements—you don’t have to follow our examples exactly.

      -

      Requirements We assume you have already gone through the requirements at the beginning of the practical lesson. This includes installing the necessary tools and setting up accounts as needed.

      +

      Requirements

      +

      We assume you have already gone through the requirements at the beginning of the practical lesson. This includes installing the necessary tools and setting up accounts as needed.

      Project

        -
      1. Go to our Cookicutter template and click on the **Fork*
      2. -
      -
        -
      • button at the top-right corner of the repository page to create a copy of the repository on your own GitHub account or organization. fork_repo_example
      • -
      -
        -
      1. Open a terminal on your computer, copy the URL of your fork and clone the repository to your local machine (the URL should look something like https://github.com/your_username/cookiecutter-template):
      2. -
      +
    • Go to our Cookicutter template and click on the Fork button at the top-right corner of the repository page to create a copy of the repository on your own GitHub account or organization. fork_repo_example

    • +
    • Open a terminal on your computer, copy the URL of your fork and clone the repository to your local machine (the URL should look something like https://github.com/your_username/cookiecutter-template):

      git clone <your URL to the template>
      -

      If you have a GitHub Desktop, click Add and select “Clone repository” from the options 3. Open the repository and navigate through the different directories 4. Modify the contents of the repository as needed to fit your project’s requirements. You can change files, add new ones. remove existing one or adjust the folder structure. For inspiration, review the data structure above under ‘Project folder’. For instance, this template is missing the ‘reports’ directory. Consider creating it, along with a subdirectory named ‘figures’. Here’s an example of how to do it:

      -
      cd \{\{\ cookiecutter.project_name\ \}\}/  
      -mkdir reports 
      -touch requirements.txt
      -
        -
      1. Modify the cookiecutter.json file. You could add new variables or change the default values:
      2. -
      -
      # open a text editor
      - "author": "Alba Refoyo",
      -
        -
      1. Commit and push changes when you are done with your modifications
      2. +

        If you have a GitHub Desktop, click Add and select “Clone repository” from the options

        +
      3. Open the repository and navigate through the different directories

      4. +
      5. Modify the contents of the repository as needed to fit your project’s requirements. You can change files, add new ones. remove existing one or adjust the folder structure. For inspiration, review the data structure above under ‘Project folder’. For instance, this template is missing the ‘reports’ directory and add the ‘requirements.txt’ file. Consider creating it, along with a subdirectory named ‘reports/figures’.

        +
        ├── results/
        +│   ├── figures/
        +├── requirements.txt
        +

        Here’s an example of how to do it:

        +
        # Open your terminal and navigate to your template directory. Then: 
        +cd \{\{\ cookiecutter.project_name\ \}\}/  
        +mkdir reports 
        +touch requirements.txt
      6. +
      7. Commit and push changes when you are done with your modifications

        -
      • Stage the changes with ‘git add’
      • -
      • Commit the changes with a meaningful commit message ‘git commit -m “update cookicutter template”’
      • -
      • Push the changes to your forked repository on Github ‘git push origin main’ (or the appropriate branch name)
      • +
      • Stage the changes with git add
      • +
      • Commit the changes with a meaningful commit message git commit -m "update cookicutter template"
      • +
      • Push the changes to your forked repository on Github git push origin main (or the appropriate branch name)
        -
      1. Test your template by using cookiecutter <URL to your GitHub repository "cookicutter-template"> Fill up the variables and verify that the modified template looks like you would expect.
        -
      2. -
      3. Optional: You can customize or remove this prompt message entirely, allowing you to tailor the text to your preferences for a unique experience each time you use the template.
      4. +
      5. Test your template by using cookiecutter <URL to your GitHub repository "cookicutter-template">

        +

        Fill up the variables and verify that the new structure (and folders) looks like you would expect. Have any new folders been added, or have some been removed?

      -
      "__prompts__": {
      -    "project_name": "Project directory name [Example: project_short_description_202X]",
      -    "author": "Author of the project",
      -    "date": "Date of project creation, default is today's date",
      -    "short_description": "Provide a detailed description of the project (context/content)"
      -  },
    • @@ -621,1905 +604,339 @@
      Step 4
  • -
    -

    2. Metadata

    -

    Metadata is the behind-the-scenes information that makes sense of data and gives context and structure. For biodata, metadata includes information such as when and where the data was collected, what it represents, and how it was processed. Let’s check what kind of relevant metadata is available for NGS data and how to capture it in your Assay or Project folders. Both of these folders contain a metadata.yml file and a README.md file. In this section, we will check what kind of information you should collect in each of these files.

    -
    +
    +

    2. Data documentation

    +

    Data documentation involves organizing, describing, and providing context for datasets and projects. While metadata concentrates on the data itself, README files provide a broader perspective on the overall project or resource.

    +
    +

    Metadata

    +
    -Metadata and controlled vocabularies +metadata.yml
    -

    In order for metadata to be most useful, you should try to use controlled vocabularies for all your fields. For example, tissue could be described with the UBERON ontologies, species using the NCBI taxonomy, diseases using the Mondo database, etc. Unfortunately, implementing a systematic way of using these vocabularies is rather complex and outside the scope of this workshop, but you are very welcome to try to implement them on your own!

    -
    -
    -
    -

    README.md file

    -

    The README.md file is a markdown file that allows you to write a long description of the data placed in a folder. Since it is a markdown file, you are able to write in rich text format (bold, italic, include links, etc) what is inside the folder, why it was created/collected, and how and when. If it is an Assay folder, you could include the laboratory protocol used to generate the samples, images explaining the experiment design, a summary of the results of the experiment, and any sort of comments that would help to understand the context of the experiment. On the other hand, a ‘Project’ README file may contain a description of the project, what are its aims, why is it important, what ‘Assays’ is it using, how to interpret the code notebooks, a summary of the results and, again, any sort of comments that would help to understand the project.

    -

    Here is an example of a README file for a Project folder:

    -
    # NGS Analysis Project: Exploring Gene Expression in Human Tissues
    -
    -## Aims
    -
    -This project aims to investigate gene expression patterns across various human tissues using Next Generation Sequencing (NGS) data. By analyzing the transcriptomes of different tissues, we seek to uncover tissue-specific gene expression profiles and identify potential markers associated with specific biological functions or diseases.
    -
    -## Why It's Important
    -
    -Understanding tissue-specific gene expression is crucial for deciphering the molecular basis of health and disease. Identifying genes that are uniquely expressed in certain tissues can provide insights into tissue function, development, and potential therapeutic targets. This project contributes to our broader understanding of human biology and has implications for personalized medicine and disease research.
    -
    -## Datasets
    -
    -We have used internal datasets with IDs: RNA_humanSkin_20201030, RNA_humanBrain_20210102, RNA_humanLung_20220304.
    -
    -In addition, we utilized publicly available NGS datasets from the GTEx (Genotype-Tissue Expression) project, which provides comprehensive RNA-seq data across multiple human tissues. These datasets offer a wealth of information on gene expression levels and isoform variations across diverse tissues, making them ideal for our analysis.
    -
    -## Summary of Results
    -
    -Our analysis revealed distinct gene expression patterns among different human tissues. We identified tissue-specific genes enriched in brain tissues, highlighting their potential roles in neurodevelopment and function. Additionally, we found a set of genes that exhibit consistent expression across a range of tissues, suggesting their fundamental importance in basic cellular processes.
    -
    -Furthermore, our differential expression analysis unveiled significant changes in gene expression between healthy and diseased tissues, shedding light on potential molecular factors underlying various diseases. Overall, this project underscores the power of NGS data in unraveling intricate gene expression networks and their implications for human health.
    -
    ----
    -
    -For more details, refer to our [Jupyter Notebook](link-to-jupyter-notebook.ipynb) for the complete analysis pipeline and code.
    -
    -
    -

    metadata.yml

    -

    The metadata file is a yml file, which is a text document that contains data formatted using a human-readable data format for data serialization.

    -
    -
    -

    -
    yaml file example
    -
    +

    Choose the format that best suits the project’s needs. In this workshop, we will focus on YAMl as it is highly used for configuration files (e.g., in conda or pipelines).

    +
    +
    -
    -

    Metadata fields

    -

    There is a ton of information you can collect regarding an NGS assay or a project. Some information fields are very general, such as author or date, while others are specific to the Assay or Project folder. Below, we will take a look at the minimal information you should collect in each of the folders.

    -
    -

    General metadata fields

    -

    Here you can find a list of suggestions for general metadata fields that can be used for both assays and project folders:

    +
    +File formats +
    +
    +
    +
    +
    +
    +
      -
    • Title: A brief yet informative name for the dataset.
    • -
    • Author(s): The individual(s) or organization responsible for creating the dataset. You can use your ORCID
    • -
    • Date Created: The date when the dataset was originally generated or compiled. Use YYYY-MM-DD format!
    • -
    • Description: A short narrative explaining the content, purpose, and context.
    • -
    • Keywords: A set of descriptive terms or phrases that capture the folder’s main topics and attributes.
    • -
    • Version: The version number or identifier for the folder, useful for tracking changes.
    • -
    • License: The type of license or terms of use associated with the dataset/project.
    • +
    • XML (eXtensible Markup Language): uses custom tags to describe data and allows for a hierarchical structure.
    • +
    • JSON (JavaScript Object Notation): lightweight and human-readable format that is easy to parse and generate.
    • +
    • CSV (Comma-Separated Values) or TSV (tabulate-separate values): simple and widely supported for representing tabular formats. Easy to manipulate using software or programming languages. It is often use for sample metadata.
    • +
    • YAML (YAML Ain’t Markup Language): human-readable data serialization format, commonly used as project configuration files.
    -
    -
    -

    Assay metadata fields

    -

    Here you will find a table with possible metadata fields that you can use to annotate and track your Assay folders:

    -
    -
    -
    -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Metadata fieldDefinitionFormatOntologyExample
    assay_IDIdentifier for the assay that is at least unique within the project<Assay-ID\>_<keyword\>_YYYYMMDDNACHIP_Oct4_20200101
    assay_typeThe type of experiment performed, eg ATAC-seq or seqFISHNAontology field- e.g. EFO or OBIChIPseq
    assay_subtypeMore specific type or assay like bulk nascent RNAseq or single cell ATACseqNAontology field- e.g. EFO or OBIbulk ChIPseq
    ownerOwner of the assay (who made the experiment?).<First Name\> <Last Name\>NAJose Romero
    platformThe type of instrument used to perform the assay, eg Illumina HiSeq 4000 or Fluidigm C1 microfluidics platformNAontology field- e.g. EFO or OBIIllumina
    extraction_methodTechnique used to extract the nucleic acid from the cellNAontology field- e.g. EFO or OBINA
    library_methodTechnique used to amplify a cDNA libraryNAontology field- e.g. EFO or OBINA
    external_accessionsAccession numbers from external resources to which assay or protocol information was submittedNAeg protocols.io, AE, GEO accession number, etcGSEXXXXX
    keywordKeyword for easy identificationwordWordcamelCaseOct4ChIP
    dateDate of assay creationYYYYMMDDNA20200101
    nsamplesNumber of samples analyzed in this assay<integer\>NA9
    is_pairedPaired fastq files or not<single OR paired\>NAsingle
    pipelinePipeline used to process data and versionNANAnf-core/chipseq -r 1.0
    strandednessThe strandedness of the cDNA library<+ OR - OR *\>NA*
    processed_byWho processed the data<First Name\> <Last Name\>NASarah Lundregan
    organismOrganism origin<Genus species\>Taxonomy nameMus musculus
    originIs internal or external (from a public resources) data<internal OR external\>NAinternal
    pathPath to files</path/to/file\>NANA
    short_descShort description of the assayplain textNAOct4 ChIP after pERK activation
    ELN_IDID of the experiment/assay in your Electronic Lab Notebook software, like labguru or benchlingplain textNANA
    - +

    Others such as RDF or HDF5.

    -
    -
    -

    Project metadata fields

    -

    Here you will find a table with possible metadata fields that you can use to annotate and track your Project folders:

    -
    -
    -
    -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Metadata fieldDefinitionFormatOntologyExample
    projectProject ID<surname\>_et_al_2023NAproks_et_al_2023
    authorOwner of the project<First name\> <Surname\>NAMartin Proks
    dateDate of creationYYYYMMDDNA20230101
    descriptionShort description of the projectPlain textNAThis is a project describing the effect of Oct4 perturbation after pERK activation
    -
    +

    Link to the file format database.

    +

    Metadata in biological datasets refers to the information that describes the data and provides context for how the data was collected, processed, and analyzed. Metadata is crucial for understanding, interpreting, and using biological datasets effectively. It also ensures that datasets are reusable, reproducible and understandable by other researchers. Some of the components may differ depending on the type of project, but there are general concepts that will always be shared across different projects:

    +
      +
    • Sample information and collection details
    • +
    • Biological context (such experimental conditions if applicable)
    • +
    • Data description
    • +
    • Data processing steps applied to the raw data
    • +
    • Annotation and Ontology terms
    • +
    • File metadata (file type, file format, etc.)
    • +
    • Ethical and Legal Compliance (ownership, access, provenance)
    • +
    +
    +
    +
    + +
    +
    +Metadata and controlled vocabularies +
    +
    +
    +

    To maximize the usefulness of metadata, aim to use controlled vocabularies across all fields. Read more about data documentation and find ontology services examples in lesson 4. We encourage you to begin implementing them systematically on your own (under the “sources” section, you will find some helpful links to guide you putting them in practice).

    +

    If you work with NGS data, check out this recommendations and examples of metadata for samples, projects and datasets.

    +
    - -
    -

    More info

    -

    The information provided in this lesson is not at all exhaustive. There might be many more fields and controlled vocabularies that could be useful for your NGS data. We recommend that you take a look at the following sources for more information!

    +
    +

    README file

    +
    +
    +
    + +
    +
    +README.md +
    +
    +
    +

    Choose the format that best suits the project’s needs. In this workshop, we will focused on Markdown as it is the most used format due to its balance of simplicity and expressive formatting options.

    +
    + +
    +
    +
    +
      -
    • Transcriptomics metadata standards and fields
    • -
    • Bionty: Biological ontologies for data scientists.
    • +
    • Markdown (.md): commonly used because is easy to read and write and is compatible across platforms (e.g., GitHub, GitLab). Supports formatting like headings, lists, links, images, and code blocks.
    • +
    • Plain Text (.txt): Simple and straightforward format without any rich formatting and great for basic instructions. Lack the ability of structure content effectively.
    • +
    • ReStructuredText (.rst): commonly used for python projects. Supports advanced formatting (takes, links, images and code blocks) .
    -
    -
    +

    Others such as HTML, YAML and Notebooks.

    +
    +
    +
    +
    +
    +

    Link to the file format database

    +
    +
    +

    The README.md file is a markdown file that provides a comprehensive description of the data within a folder. Its rich text format (including bold, italic, links, etc.) allows you to explain the contents of the folder, as well as the reasons and methods behind its creation or collection. The content will vary depending on what it described (data or assays, project, software…).

    +

    Here is an example of a README file for a bioinformatics project:

    +
    +
    -Exercise 2: modify the metadata.yml files in your Cookiecutter templates +README
    -
    -
    -
    -

    We have seen some examples of metadata for NGS data. It is time now to customize your Cookiecutter templates and modify the metadata.yml files so that they fit your needs!

    -
      -
    1. Think about what kind of metadata you would like to include.
    2. -
    3. Modify the cookiecutter.json file so that when you create a new folder template, all the metadata is filled accordingly.
    4. -
    -
    -
    -
    -
    -

    3. Naming conventions

    -

    Using consistent naming conventions is important in scientific research as it helps with the organization and retrieval of data or results. By adopting standardized naming conventions, researchers ensure that files, experiments, or data sets are labeled in a clear, logical manner. This makes it easier to locate and compare similar types of data or results, even when dealing with large datasets or multiple experiments. For instance, in genomics, employing uniform naming conventions for files related to specific experiments or samples allows for swift identification and comparison of relevant data, streamlining the research process and contributing to the reproducibility of findings. This practice promotes efficiency, collaboration, and the integrity of scientific work.

    -
    -

    General tips

    -

    Below you will find a small list of general tips to follow when you name a folder or a file:

    -
      -
    • Use only alphanumeric characters to write a word: a to z and 0 to 9
    • -
    • Avoid special characters: ~!@#$%^&*()`“|
    • -
    • Date format: use YYYYMMDD format. For example: 20230101.
    • -
    • Authors: use initials. For example: JARH
    • -
    • Don’t use spaces! Computers get very confused when you need to point a path to a file and it contains spaces! Instead: +
      + +
      +
      +
      +
      +

      It is time now to customize your Cookiecutter templates and modify the metadata.yml files so that they fit your needs!

      +
        +
      1. Consider changing variables (add/remove) in the metadata.yml file from the cookicutter template.

      2. +
      3. Modify the cookiecutter.json file. You could add new variables or change the default key and/or values:

        +
        {
        +"project_name": "myProject",
        +"project_slug": "{{ cookiecutter.project_name.lower().replace(' ', '_').replace('-', '_') }}",
        +"authors": "myName",
        +"start_date": "{% now 'utc', '%Y%m%d' %}",
        +"short_desc": "",
        +"version": "0.1.0"
        +}
        +

        The metadata file will be filled accordingly.

      4. +
      5. Optional: You can customize or remove this prompt message entirely, allowing you to tailor the text to your preferences for a unique experience each time you use the template.

        +
        "__prompts__": {
        +    "project_name": "Project directory name [Example: project_short_description_202X]",
        +    "author": "Author of the project",
        +    "date": "Date of project creation, default is today's date",
        +    "short_description": "Provide a detailed description of the project (context/content)"
        +},
      6. +
      7. Modify the metadata.yml file so that it includes the metadata recorded by the cookiecutter.json file. Hint below:

        +
        project: {{ cookiecutter.project_name }}
        +author: {{ cookiecutter.author }}
        +date: {{ cookiecutter.date }}
        +description: {{ cookiecutter.short_description }}
      8. +
      9. Modify the README.md file so that it includes the short description recorded by the cookiecutter.json file and the metadata at the top of the markdown file (top between lines of dashed).

        +
        ---
        +title: {{ cookiecutter.project_name }}
        +date: "{{ cookiecutter.date }}"
        +author: {{ cookiecutter.author }}
        +version: {{ cookiecutter.version }}
        +---
        +
        +Project description
        +----
        +
        +{{ cookiecutter.short_description }}
      10. +
      11. Commit and push changes when you are done with your modifications

      12. +
        -
      • Separate field sections are separated by underscores _.
      • -
      • Words in each section are written in camelCase. It would look then like this: field1_word1Word2.txt. For example: heatmap_sampleCor_20230101.png. The first field indicates what this file is, i.e., a heatmap. The second field is what is being plotted, i.e., sample correlations; since the field contains two words, they are written in camelCase. The third field is the date when the image was created.
      • -
    • -
    • Use as short fields as possible. You can try to use understandable abbreviations, like LFC for LogFoldChange, Cor for correlations, Dist for distances, etc.
    • -
    • Avoid long names as much as you can, be concise!
    • -
    • Avoid creating many sublevels of folders.
    • -
    • Write down your naming convention pattern and document it in the README file
    • -
    • When using a sequential numbering system, use leading zeros to make sure files are sorted in sequential order. Use 01 instead of just 1 if your sequence only goes up to 99.
    • -
    • Versions should be used as the last element, and use at least two digits with a leading 0 (e.g. v01, v02)
    • +
    • Stage the changes with git add
    • +
    • Commit the changes with a meaningful commit message git commit -m "update cookicutter template"
    • +
    • Push the changes to your forked repository on Github git push origin main (or the appropriate branch name)
    -
    -
    -

    Suggestions for NGS data

    -

    More info on naming conventions for different types of files and analysis is in development.

    -
    -
    -
    -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    namedescriptionnaming_conventionfile formatexample
    .fastqraw sequencing readsnannansampleID_run_read1.fastq
    .fastqcquality control from fastqcnannansampleID_run_read1.fastqc
    .bamaligned readsnannansampleID_run_read1.bam
    GTFsequence annotationnannanone of https://www.gencodegenes.org/
    GFFsequence annotationnannanone of https://www.gencodegenes.org/
    .bedgenome locationsnannannan
    .bigwiggenome coveragenannannan
    .fastasequence data (nucleotide/aminoacid)nannanone of https://www.gencodegenes.org/
    Multiqc reportQC aggregated report<assayID\>_YYYYMMDD.multiqcmultiqcRNA_20200101.multiqc
    Count matrixfinal count matrix<assayID\>_cm_aligner_YYYYMMDD.tsvtsvRNA_cm_salmon_20200101.tsv
    DEAdifferential expression analysis resultsDEA_<condition1-condition2\>_LFC<absolute_threshold\>_p<pvalue decimals\>_YYYYMMDD.tsvtsvDEA_treat-untreat_LFC1_p01_20200101.tsv
    DBAdifferential binding analysis resultsDBA_<condition1-condition2\>_LFC<absolute_threshold\>_p<pvalue decimals\>_YYYYMMDD.tsvtsvDBA_treat-untreat_LFC1_p01_20200101.tsv
    MAplotMA plotMAplot_<condition1-condition2\>_YYYYMMDD.jpegjpegMAplot_treat-untreat_20200101.jpeg
    Heatmap plotHeatmap plot of anythingheatmap_<type\>_YYYYMMDD.jpegjpegheatmap_sampleCor_20200101.jpeg
    Volcano plotVolcano plotvolcano_<condition1-condition2\>_YYYYMMDD.jpegjpegvolcano_treat-untreat_20200101.jpeg
    Venn diagramVenn diagramvenn_<type\>_YYYYMMDD.jpegjpegvenn_consensus_20200101.jpeg
    Enrichment tableEnrichment resultsnantsvnan
    - +
      +
    1. Test your template by using cookiecutter <URL to your GitHub repository "cookicutter-template">

      +

      Fill up the variables and verify that the modified information looks like you would expect.

    2. +
    +
    +
    +
    +
    +

    3. Naming conventions

    +

    As discussed in lesson 3, consistent naming conventions are key for interpreting, comparing, and reproducing findings in scientific research. Standardized naming helps organize and retrieve data or results, allowing researchers to locate and compare similar types of data within or across large datasets.

    -
    +
    -Exercise 3: Create your own naming conventions +Exercise 3: Define your file name conventions
    -
    +
    -

    Think about the most common types of files and folders you will be working on, such as visualizations, results tables, processed files, etc. Then come up with a logical and clear way of naming those files using the tips suggested above. Remember to avoid making long and complicated names!

    +

    Avoid long and complicated names and ensure your file names are both informative and easy to manage:

    +
      +
    1. For saving a new plot, a heatmap representing sample correlations
    2. +
    3. When naming the file for the document containing the Research Data Management Course Objectives (Version 2, 2nd May 2024) from the University of Copenhagen
    4. +
    5. Consider the most common file types you work with, such as visualizations, figures, tables, etc., and create logical and clear file names
    6. +
    +
    + +
    +
    +
    +
    +
      +
    1. heatmap_sampleCor_20240101.png
    2. +
    3. KU_RDM-objectives_20240502_v02.doc or KU_RDMObj_20240502_v02.doc
    4. +
    +
    +
    +
    +
    +
    - -
    -

    4. Create a catalog of your assay folder

    +
    +

    4. Create a catalog of your data folder

    The next step is to collect all the NGS datasets that you have created in the manner explained above. Since your folders all should contain the metadata.yml file in the same place with the same metadata, it should be very easy to iteratively go through all the folders and merge all the metadata.yml files into a one single table. This table can be then browsed easily with Microsoft Excel, for example. If you are interested in making a Shiny app or Python Panel tool to interactively browse the catalog, check out this lesson.

    -
    +
    @@ -2528,45 +945,55 @@

    4. C

    -
    +

    We will make a small script in R (or you can make one with Python) that recursively goes through all the folders inside an input path (like your Assays folder), fetches all the metadata.yml files, and merges them. Finally, it will write a TSV file as an output.

      -
    1. Create a folder called Assays
    2. -
    3. Under that folder, make three new Assay folders from your cookiecutter template
    4. -
    5. Run the script below with R (or create your own with Python). Modify the folder_path variable so it matches the path to the folder Assays. The table will be written under the same folder_path.
    6. -
    7. Visualize your Assays table with Excel
    8. +
    9. Create a folder called dataset and change directory cd dataset
    10. +
    11. Fork this repository: a Cookiecutter template designed for NGS datasets. While you are welcome to create your own template from scratch, we recommend using this one to save time.
    12. +
    13. Run the cookiecutter cc-data-template command at least twice to create multiple datasets or projects. Use different values each time to simulate various scenarios (do this in the dataset directory that you have previously created). Execute the script below using R (or create your own script in Python). Adjust the folder_path variable so that it matches the path to the Assays folder. The resulting table will be saved in the same folder_path.
    14. +
    15. Open your database_YYYYMMDD.tsv table in a text editor from the command-line, or view it in Excel for better visualization.
    -
    
    -library(yaml)
    -library(dplyr)
    -library(lubridate)
    -
    -# Function to recursively fetch metadata.yml files
    -get_metadata <- function(folder_path) {
    -    file_list <- list.files(path = folder_path, pattern = "metadata\\.yml$", recursive = TRUE, full.names = TRUE)
    -    metadata_list <- lapply(file_list, yaml::yaml.load_file)
    -    return(metadata_list)
    -    }
    -
    -# Specify the folder path
    -    folder_path <- "/path/to/your/folder"
    -
    -    # Fetch metadata from the specified folder
    -    metadata <- get_metadata(folder_path)
    -
    -    # Convert metadata to a data frame
    -    metadata_df <- data.frame(matrix(unlist(metadata), ncol = length(metadata), byrow = TRUE))
    -    colnames(metadata_df) <- names(metadata[[1]])
    -
    -    # Save the data frame as a TSV file
    -    output_file <- paste0("database_", format(Sys.Date(), "%Y%m%d"), ".tsv")
    -    write.table(metadata_df, file = output_file, sep = "\t", quote = FALSE, row.names = FALSE)
    -
    -    # Print confirmation message
    -    cat("Database saved as", output_file, "\n")
    +
    
    +library(yaml)
    +library(dplyr)
    +library(lubridate)
    +
    +# Function to read a YAML file and transform it into a dataframe format.
    +read_yaml <- function(file_path) {
    +  # Read the YAML file and convert it to a data frame
    +  df <- yaml::yaml.load_file(file_path) %>% as.data.frame(stringsAsFactors = FALSE)
    +  
    +  # Return the data frame
    +  return(df)
    +}
    +
    +# Function to recursively fetch metadata.yml files
    +get_metadata <- function(folder_path) {
    +  file_list <- list.files(path = folder_path, pattern = "metadata\\.yml$", recursive = TRUE, full.names = TRUE)
    +
    +  metadata_list <- lapply(file_list, read_yaml)
    +  
    +  # Combine the list of data frames into a single data frame using dplyr::bind_rows()
    +  combined_metadata <- bind_rows(metadata_list)
    +
    +  return(combined_metadata)
    +}
    +
    +# Specify the folder path
    +folder_path <- "/path/to/your/folder"
    +
    +# Fetch metadata from the specified folder
    +metadata <- get_metadata(folder_path)
    +
    +# Save the data frame as a TSV file
    +output_file <- paste0("database_", format(Sys.Date(), "%Y%m%d"), ".tsv")
    +write.table(metadata, file = output_file, sep = "\t", quote = FALSE, row.names = FALSE)
    +
    +# Print confirmation message
    +cat("Database saved as", output_file, "\n")
    @@ -2615,7 +1042,7 @@

    GitHub Pages

    Once you have created your repository (and put it in GitHub), you have now the opportunity to add your data analysis reports that you created, in either Jupyter Notebooks, Rmarkdowns, or HTML reports, in a GitHub Page website. Creating a GitHub page is very simple, and we really recommend that you follow the nice tutorial that GitHub has put for you. Nonetheless, we will see the main steps in the exercise below.

    There are many different ways to create your web pages. We recommend using Mkdocs and Mkdocs materials as a framework to create a nice webpage simply. The folder templates that we used as an example in the previous exercise already contain everything you need to start a webpage. Nonetheless, you will need to understand the basics of MkDocs and MkDocs materials to design a webpage to your liking. MkDocs is a static webpage generator that is very easy to use, while MkDocs materials is an extension of the tool that gives you many more options to customize your website. Check out their web pages to get started!

    -
    +
    @@ -2624,7 +1051,7 @@

    GitHub Pages

    -
    +
    @@ -2683,7 +1110,7 @@

    Zenodo

    Zenodo[https://zenodo.org/] is an open-access digital repository designed to facilitate the archiving of scientific research outputs. It operates under the umbrella of the European Organization for Nuclear Research (CERN) and is supported by the European Commission. Zenodo accommodates a broad spectrum of research outputs, including datasets, papers, software, and multimedia files. This versatility makes it an invaluable resource for researchers across a wide array of domains, promoting transparency, collaboration, and the advancement of knowledge on a global scale.

    Operating on a user-friendly web platform, Zenodo allows researchers to easily upload, share, and preserve their research data and related materials. Upon deposit, each item is assigned a unique Digital Object Identifier (DOI), granting it a citable status and ensuring its long-term accessibility. Additionally, Zenodo provides robust metadata capabilities, enabling researchers to enrich their submissions with detailed contextual information. In addition, it allows you to link your GitHub account, providing a streamlined way to archive a specific release of your GitHub repository directly into Zenodo. This integration simplifies the process of preserving a snapshot of your project’s progress for long-term accessibility and citation.

    -
    +
    @@ -2692,7 +1119,7 @@

    Zenodo

    -
    +
    diff --git a/_site/search.json b/_site/search.json index 2b9b26b9..5c2e2185 100644 --- a/_site/search.json +++ b/_site/search.json @@ -79,7 +79,7 @@ "href": "develop/practical_workshop.html#naming-conventions", "title": "Practical material", "section": "3. Naming conventions", - "text": "3. Naming conventions\nUsing consistent naming conventions is important in scientific research as it helps with the organization and retrieval of data or results. By adopting standardized naming conventions, researchers ensure that files, experiments, or data sets are labeled in a clear, logical manner. This makes it easier to locate and compare similar types of data or results, even when dealing with large datasets or multiple experiments. For instance, in genomics, employing uniform naming conventions for files related to specific experiments or samples allows for swift identification and comparison of relevant data, streamlining the research process and contributing to the reproducibility of findings. This practice promotes efficiency, collaboration, and the integrity of scientific work.\n\nGeneral tips\nBelow you will find a small list of general tips to follow when you name a folder or a file:\n\nUse only alphanumeric characters to write a word: a to z and 0 to 9\nAvoid special characters: ~!@#$%^&*()`“|\nDate format: use YYYYMMDD format. For example: 20230101.\nAuthors: use initials. For example: JARH\nDon’t use spaces! Computers get very confused when you need to point a path to a file and it contains spaces! Instead:\n\nSeparate field sections are separated by underscores _.\nWords in each section are written in camelCase. It would look then like this: field1_word1Word2.txt. For example: heatmap_sampleCor_20230101.png. The first field indicates what this file is, i.e., a heatmap. The second field is what is being plotted, i.e., sample correlations; since the field contains two words, they are written in camelCase. The third field is the date when the image was created.\n\nUse as short fields as possible. You can try to use understandable abbreviations, like LFC for LogFoldChange, Cor for correlations, Dist for distances, etc.\nAvoid long names as much as you can, be concise!\nAvoid creating many sublevels of folders.\nWrite down your naming convention pattern and document it in the README file\nWhen using a sequential numbering system, use leading zeros to make sure files are sorted in sequential order. Use 01 instead of just 1 if your sequence only goes up to 99.\nVersions should be used as the last element, and use at least two digits with a leading 0 (e.g. v01, v02)\n\n\n\nSuggestions for NGS data\nMore info on naming conventions for different types of files and analysis is in development.\n\n\n\n\n\n\n\n\n\nname\ndescription\nnaming_convention\nfile format\nexample\n\n\n\n\n.fastq\nraw sequencing reads\nnan\nnan\nsampleID_run_read1.fastq\n\n\n.fastqc\nquality control from fastqc\nnan\nnan\nsampleID_run_read1.fastqc\n\n\n.bam\naligned reads\nnan\nnan\nsampleID_run_read1.bam\n\n\nGTF\nsequence annotation\nnan\nnan\none of https://www.gencodegenes.org/\n\n\nGFF\nsequence annotation\nnan\nnan\none of https://www.gencodegenes.org/\n\n\n.bed\ngenome locations\nnan\nnan\nnan\n\n\n.bigwig\ngenome coverage\nnan\nnan\nnan\n\n\n.fasta\nsequence data (nucleotide/aminoacid)\nnan\nnan\none of https://www.gencodegenes.org/\n\n\nMultiqc report\nQC aggregated report\n<assayID\\>_YYYYMMDD.multiqc\nmultiqc\nRNA_20200101.multiqc\n\n\nCount matrix\nfinal count matrix\n<assayID\\>_cm_aligner_YYYYMMDD.tsv\ntsv\nRNA_cm_salmon_20200101.tsv\n\n\nDEA\ndifferential expression analysis results\nDEA_<condition1-condition2\\>_LFC<absolute_threshold\\>_p<pvalue decimals\\>_YYYYMMDD.tsv\ntsv\nDEA_treat-untreat_LFC1_p01_20200101.tsv\n\n\nDBA\ndifferential binding analysis results\nDBA_<condition1-condition2\\>_LFC<absolute_threshold\\>_p<pvalue decimals\\>_YYYYMMDD.tsv\ntsv\nDBA_treat-untreat_LFC1_p01_20200101.tsv\n\n\nMAplot\nMA plot\nMAplot_<condition1-condition2\\>_YYYYMMDD.jpeg\njpeg\nMAplot_treat-untreat_20200101.jpeg\n\n\nHeatmap plot\nHeatmap plot of anything\nheatmap_<type\\>_YYYYMMDD.jpeg\njpeg\nheatmap_sampleCor_20200101.jpeg\n\n\nVolcano plot\nVolcano plot\nvolcano_<condition1-condition2\\>_YYYYMMDD.jpeg\njpeg\nvolcano_treat-untreat_20200101.jpeg\n\n\nVenn diagram\nVenn diagram\nvenn_<type\\>_YYYYMMDD.jpeg\njpeg\nvenn_consensus_20200101.jpeg\n\n\nEnrichment table\nEnrichment results\nnan\ntsv\nnan\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nExercise 3: Create your own naming conventions\n\n\n\n\n\n\n\nThink about the most common types of files and folders you will be working on, such as visualizations, results tables, processed files, etc. Then come up with a logical and clear way of naming those files using the tips suggested above. Remember to avoid making long and complicated names!" + "text": "3. Naming conventions\nAs discussed in lesson 3, consistent naming conventions are key for interpreting, comparing, and reproducing findings in scientific research. Standardized naming helps organize and retrieve data or results, allowing researchers to locate and compare similar types of data within or across large datasets.\n\n\n\n\n\n\nExercise 3: Define your file name conventions\n\n\n\n\n\n\n\nAvoid long and complicated names and ensure your file names are both informative and easy to manage:\n\nFor saving a new plot, a heatmap representing sample correlations\nWhen naming the file for the document containing the Research Data Management Course Objectives (Version 2, 2nd May 2024) from the University of Copenhagen\nConsider the most common file types you work with, such as visualizations, figures, tables, etc., and create logical and clear file names\n\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\n\nheatmap_sampleCor_20240101.png\nKU_RDM-objectives_20240502_v02.doc or KU_RDMObj_20240502_v02.doc" }, { "objectID": "develop/practical_workshop.html#create-a-catalog-of-your-assays-folder", @@ -682,9 +682,9 @@ { "objectID": "develop/practical_workshop.html#datasets", "href": "develop/practical_workshop.html#datasets", - "title": "DTU workshop 2023", - "section": "", - "text": "We have used internal datasets with IDs: RNA_humanSkin_20201030, RNA_humanBrain_20210102, RNA_humanLung_20220304.\nIn addition, we utilized publicly available NGS datasets from the GTEx (Genotype-Tissue Expression) project, which provides comprehensive RNA-seq data across multiple human tissues. These datasets offer a wealth of information on gene expression levels and isoform variations across diverse tissues, making them ideal for our analysis." + "title": "Practical material", + "section": "Datasets", + "text": "Datasets\nDescribe the data,, including its sources, format, and how to access it. If the data has undergone preprocessing, provide a description of the processes applied or the pipeline used.\n\n\n\n\n\n\nExample text\n\n\n\n\n\n\n\nWe have used internal datasets with IDs: RNA_humanSkin_20201030, RNA_humanBrain_20210102, RNA_humanLung_20220304.\nIn addition, we utilized publicly available NGS datasets from the GTEx (Genotype-Tissue Expression) project, which provides comprehensive RNA-seq data across multiple human tissues. These datasets offer a wealth of information on gene expression levels and isoform variations across diverse tissues, making them ideal for our analysis." }, { "objectID": "develop/practical_workshop.html#summary-of-results", @@ -962,37 +962,37 @@ { "objectID": "develop/examples/NGS_management.html", "href": "develop/examples/NGS_management.html", - "title": "NGS data strategies", + "title": "Effective RDM Practices in NGS Analysis", "section": "", - "text": "Section Overview\n\n\n\n⏰ Time Estimation: X minutes\n💬 Learning Objectives:\n\nNext Generation Sequencing data types and metadata\nBest practices for software and code management\nPipelines and workflows\n\n\n\nIn the data life cycle for Next Generation Sequencing (NGS) technology data, processing, and analyzing are critical phases that involve transforming raw sequencing data into meaningful biological insights. Researchers apply computational methods and bioinformatics tools to extract valuable information from the vast amount of sequencing data generated in NGS experiments. We’ll first explore the primary data types generated pre- and post-processing and the importance of detailed documentation. We will then focus on good practices used when performing data analysis and software development.\n\n\n\n\n\n\nNext Generation Sequencing\n\n\n\n\n\n\n\nNext Generation Sequencing (NGS), or high-throughput sequencing, has revolutionized genomics research. It encompasses advanced techniques for rapid and cost-effective analysis of DNA or RNA molecules. Unlike traditional methods, NGS can analyze millions of DNA fragments simultaneously, enhancing the speed, efficiency, and scale of sequencing and becoming integral to modern genomics and biomedical studies. As NGS technologies continue to advance and become more accessible, they will remain at the front of cutting-edge genomics research, driving innovations that contribute to our understanding of complex genetic interactions and their implications for human health and biology.\nApplications\nIt is widely utilized in various applications, including genomic sequencing, transcriptome analysis (RNA-Seq), epigenetic profiling (ChIP-Seq), metagenomics, and targeted sequencing. In addition, it plays a crucial role in fields such as oncology, infectious disease research, and personalized medicine.\nData production\nNGS workflows involve key steps, from sample preparation to data analysis. Samples undergo extraction and fragmentation, followed by the addition of unique identifiers, known as library preparation, for multiplexed sequencing. Then, fragments are amplified and sequences in parallel sequencing using state-of-the-art NGS platforms. Subsequent data analysis processes reconstruct the original sequence and identify genetic variations, structural changes, or functional elements. The unique identifiers are specific adapter sequences that allow future identification of individual samples within a multiplexed sequencing run.\n\n\n\n\n\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\n\nDo you ensure that all the data you collect or generate is accompanied by metadata? Have you ever encountered missing information when reading a provided file?\nDo you utilize specific databases or repositories for storing and accessing your research data?\nWhat are the typical data formats you encounter during data processing? As outputs of your analysis, what are the common data formats you encounter for visualization or further analysis?\nDo you document and track the workflows you use for data processing and analysis, including the software employed? How do you ensure reproducibility?\n\n\n\n\n\n\n\n\n\n\nThoroughly document your datasets and the experimental setup to ensure reproducibility. Adhering to standards will ensure interoperability. Data types’ examples:\n\nElectronic Laboratory Notebook (ELN): digital description of the experimental design, and measurement devices. ELNs offer features like data entry, text editing, file attachments, collaboration tools, and search capabilities.\nLaboratory protocols: methodologies to prepare and manage samples.\nSamples: refers to the biological material (extraction of DNA, RNA, or proteins). Specification of sample identifier, sample type, source organism, etc.\nSequencing: details on the platform (e.g., Illumina, Oxford Nanopore), library preparation method, coverage, quality control metrics (e.g., Phred score)…\nRaw sequencing data: sequences and quality scores (e.g., FASTQ files)\n\n\n\n\n\n\n\nNote\n\n\n\nA metadata file is crucial during data analysis as it contains information about the experimental conditions (such as sequencing details, treatment, sample type, time points, tissue…).\n\n\n\n\n\nExamples of data types generated during processing:\n\nQuality control metrics: to filter out potential artifacts and ensure the reliability of downstream analyses (e.g., bioinformatics tool like FastQC or MultiQC for results’ aggregation)\nData alignments: in genomics to determine the location of the read in the genome and in transcriptomics to identify gene expression levels.\nDNA analysis results: such as variant calling, genome annotation, functional genomics, phylogenetics, metagenomics, etc. Results are usually presented in tabular format.\nRNA Expression analysis results: from differential gene expression, gene ontology (GO) enrichment, alternative splicing, pathway analysis, etc. Results are usually presented in tabular format.\nEpigenetic profiling outputs: to assess gene regulation and chromatin structure (e.g., ChIP-Seq). Usually presented in BED format.\n\nThe interpretation of NGS data relies heavily on the results of data analysis, which are pivotal for understanding the biological significance of the findings and formulating hypotheses for further exploration. Clear and effective visualization methods are crucial for communicating and interpreting the vast amount of information generated by NGS experiments.\n\n\n\n\n\n\nOther types of data: databases and visualizations\n\n\n\n\n\n\n\n\nKnowledge databases\nA knowledge database is a structured repository of biological information that categorizes and annotates genes, proteins, and their functions, facilitating comprehensive understanding and analysis of biological systems. Here are five examples of knowledge databases:\n\n\nGene Ontology (GO): A comprehensive resource that classifies gene functions into defined terms, allowing for standardized annotation and comparison of genes across different organisms.\nDisease Ontology: A database that provides structured, standardized terminology for various diseases and their relationships, aiding in the systematic analysis of disease-related data.\nKEGG Pathways: A collection of manually curated pathway maps representing molecular interactions and reaction networks within cells, enabling the interpretation of high-throughput data in the context of biological systems.\nReactome: An open-access database that offers curated descriptions of biological processes, including pathways, reactions, and molecular events, facilitating the interpretation of large-scale biological data.\nUniProt: An extensive protein knowledgebase that provides detailed information about proteins, including their sequences, functions, and related annotations, supporting a wide range of biological research endeavors.\n\n\nVisualizations\n\n\nHeatmaps: frequently used to visualize gene expression patterns, epigenetic modifications, or microbial abundances across samples/conditions.\nVolcano Plots: commonly used in differential gene expression analysis\nGenome Browser Snapshots: display alignments and genomic features in genomic regions (e.g., gene annotations, ChIP-Seq peaks)\nNetwork Visualizations:utilized to explore gene regulatory networks or protein-protein interaction\nGenomic Annotations: to annotate genetic variations (functional impact on genes, genomic regions, or regulatory element)\n\n\n\n\n\n\n\n\n\nBest practices for software and code management (don’t forget to read about FAIR software):\n\nCommenting your code: to enhance readability and comprehension\nMake your source code accessible using a repository (GitHub, GitLab, Bitbucket, SourceForge, etc.) that provides version control (VC) solutions. This step is one of the most important ones as version control systems (Git or SVN) track changes in your code over time and enable collaboration and easy version management. Most Danish institutions provide courses on Git/GitHub, check yours! We also highly recommend reading this paper (Perez-Riverol et al. 2016).\nREADME file: with comprehensive information about the project including installation instructions, usage examples or tutorials, licensing details, citation information, etc.\nRegister your code in a research software registry and include a clear and accessible software usage license: enabling other researchers to discover and reuse software packages (alongside metadata). More recommendations here.\nUse domain-relevant community standards to ensure consistency and interoperability (e.g., CodeMeta).\n\n\n\n\n\n\n\nGit and Github courses and other resources\n\n\n\n\n\n\n\n\nUniversity of Copenhagen\nAarhus University\nAalborg University\nDTU Git guidelines Find more resources on the Berkeley Library website\n\n\n\n\n\n\n\n\n\nYou might use standard workflows or generate new ones during data processing and data analysis steps.\n\nCode notebooks: tools for data documentation (e.g. Jupyter Notebook, Rmarkdown) enabling the combination of code with descriptive text and visualizations.\n\nIntegrated development environments (knitr or MLflow).\nPipeline frameworks or workflow management systems: designed to streamline and automate various steps involved in data analysis (data extraction, transformation, validation, visualization, and more). Additionally, they contribute to ensuring interoperability by facilitating seamless integration and interaction between different components or stages. There are two very popular systems, Nextflow and Snakemake.\n\nA great example of community-curated workflows is the nf-core community. Nf-core is a collaborative and open-source initiative comprising bioinformaticians and researchers dedicated to developing and maintaining a collection of curated and reproducible Nextflow-based pipelines for NGS data analysis, ensuring standardized and efficient data processing workflows.\n\n\n\n\n\n\nCourse on pipelines and workflows\n\n\n\nTake our course on Reproducible Research Practices LINK\n\n\nClick below to access a list of the most common file formats used when working with NGS data.\n\n\nData types summary\n\n\nSelect appropriate file formats that balance data accessibility, storage efficiency, and compatibility with downstream analysis tools. Standardized file formats facilitate data sharing and collaboration among researchers in the scientific community.\n\nBAM/SAM: stores the alignment information (binary and text-based respectively)\nFASTA: store nucleotide or amino acid sequence, commonly used for reference sequences or assembled contigs.\nGene Transfer Format (GTF) and General Feature Format (GFF): annotates genomic features such as genes, exons, and transcripts.\nAlignment indexes: data structures for efficient and rapid mapping of sequencing reads to a reference.\nVariant Call Format (VCF): stores genetic variation such as single nucleotide variants (SNVs), insertions, deletions, and structural variants (and their position, quality score, etc.)\nCount matrix: quantifies the abundance of RNA transcripts or genomic features across samples\nBED/BEDGraph: represent genomic intervals or coverage information (e.g., peak calling identifies regions of enriched signal intensity)\nWIG/BigWig: store genome-wide data\n\nGeneral formats\n\nTabular formats: File formats like CSV, TSV, and XLSX are used to store data in rows and columns for easy data analysis and sharing\nImage formats: File formats such as PNG and SVG are used to store graphical visualizations, making them easily viewable and shareable\nBinary formats: File formats like NPZ and H5 are used to store large datasets, ensuring efficient data access and storage\nJSON: A lightweight data-interchange format for storing hierarchical data structures, commonly used in bioinformatics tools\nHTML: A format used to create interactive reports that include both visualizations and textual descriptions of analysis results\nCode notebooks: Interactive documents combining code, visualizations, and explanatory text, aiding in data analysis reproducibility and documentation\nScripts: Text files containing sets of commands or code instructions for automating data processing and analysis tasks\n\nExplore more data types at the UCSC webpage. Check out this tutorial for more detailed explanations.\n\n\n\n\n\n\nIn this lesson, we have taken a look a the vast and diverse landscape of bioinformatics data.", + "text": "Section Overview\n\n\n\n⏰ Time Estimation: X minutes\n💬 Learning Objectives:\n\nNGS data strategies\nFile naming conventions examples", "crumbs": [ "Use cases", "NGS data", - "NGS data strategies" + "Effective RDM Practices in NGS Analysis" ] }, { "objectID": "develop/examples/NGS_management.html#practical-tips-for-computational-research", "href": "develop/examples/NGS_management.html#practical-tips-for-computational-research", - "title": "NGS data strategies", - "section": "", - "text": "Thoroughly document your datasets and the experimental setup to ensure reproducibility. Adhering to standards will ensure interoperability. Data types’ examples:\n\nElectronic Laboratory Notebook (ELN): digital description of the experimental design, and measurement devices. ELNs offer features like data entry, text editing, file attachments, collaboration tools, and search capabilities.\nLaboratory protocols: methodologies to prepare and manage samples.\nSamples: refers to the biological material (extraction of DNA, RNA, or proteins). Specification of sample identifier, sample type, source organism, etc.\nSequencing: details on the platform (e.g., Illumina, Oxford Nanopore), library preparation method, coverage, quality control metrics (e.g., Phred score)…\nRaw sequencing data: sequences and quality scores (e.g., FASTQ files)\n\n\n\n\n\n\n\nNote\n\n\n\nA metadata file is crucial during data analysis as it contains information about the experimental conditions (such as sequencing details, treatment, sample type, time points, tissue…).\n\n\n\n\n\nExamples of data types generated during processing:\n\nQuality control metrics: to filter out potential artifacts and ensure the reliability of downstream analyses (e.g., bioinformatics tool like FastQC or MultiQC for results’ aggregation)\nData alignments: in genomics to determine the location of the read in the genome and in transcriptomics to identify gene expression levels.\nDNA analysis results: such as variant calling, genome annotation, functional genomics, phylogenetics, metagenomics, etc. Results are usually presented in tabular format.\nRNA Expression analysis results: from differential gene expression, gene ontology (GO) enrichment, alternative splicing, pathway analysis, etc. Results are usually presented in tabular format.\nEpigenetic profiling outputs: to assess gene regulation and chromatin structure (e.g., ChIP-Seq). Usually presented in BED format.\n\nThe interpretation of NGS data relies heavily on the results of data analysis, which are pivotal for understanding the biological significance of the findings and formulating hypotheses for further exploration. Clear and effective visualization methods are crucial for communicating and interpreting the vast amount of information generated by NGS experiments.\n\n\n\n\n\n\nOther types of data: databases and visualizations\n\n\n\n\n\n\n\n\nKnowledge databases\nA knowledge database is a structured repository of biological information that categorizes and annotates genes, proteins, and their functions, facilitating comprehensive understanding and analysis of biological systems. Here are five examples of knowledge databases:\n\n\nGene Ontology (GO): A comprehensive resource that classifies gene functions into defined terms, allowing for standardized annotation and comparison of genes across different organisms.\nDisease Ontology: A database that provides structured, standardized terminology for various diseases and their relationships, aiding in the systematic analysis of disease-related data.\nKEGG Pathways: A collection of manually curated pathway maps representing molecular interactions and reaction networks within cells, enabling the interpretation of high-throughput data in the context of biological systems.\nReactome: An open-access database that offers curated descriptions of biological processes, including pathways, reactions, and molecular events, facilitating the interpretation of large-scale biological data.\nUniProt: An extensive protein knowledgebase that provides detailed information about proteins, including their sequences, functions, and related annotations, supporting a wide range of biological research endeavors.\n\n\nVisualizations\n\n\nHeatmaps: frequently used to visualize gene expression patterns, epigenetic modifications, or microbial abundances across samples/conditions.\nVolcano Plots: commonly used in differential gene expression analysis\nGenome Browser Snapshots: display alignments and genomic features in genomic regions (e.g., gene annotations, ChIP-Seq peaks)\nNetwork Visualizations:utilized to explore gene regulatory networks or protein-protein interaction\nGenomic Annotations: to annotate genetic variations (functional impact on genes, genomic regions, or regulatory element)\n\n\n\n\n\n\n\n\n\nBest practices for software and code management (don’t forget to read about FAIR software):\n\nCommenting your code: to enhance readability and comprehension\nMake your source code accessible using a repository (GitHub, GitLab, Bitbucket, SourceForge, etc.) that provides version control (VC) solutions. This step is one of the most important ones as version control systems (Git or SVN) track changes in your code over time and enable collaboration and easy version management. Most Danish institutions provide courses on Git/GitHub, check yours! We also highly recommend reading this paper (Perez-Riverol et al. 2016).\nREADME file: with comprehensive information about the project including installation instructions, usage examples or tutorials, licensing details, citation information, etc.\nRegister your code in a research software registry and include a clear and accessible software usage license: enabling other researchers to discover and reuse software packages (alongside metadata). More recommendations here.\nUse domain-relevant community standards to ensure consistency and interoperability (e.g., CodeMeta).\n\n\n\n\n\n\n\nGit and Github courses and other resources\n\n\n\n\n\n\n\n\nUniversity of Copenhagen\nAarhus University\nAalborg University\nDTU Git guidelines Find more resources on the Berkeley Library website\n\n\n\n\n\n\n\n\n\nYou might use standard workflows or generate new ones during data processing and data analysis steps.\n\nCode notebooks: tools for data documentation (e.g. Jupyter Notebook, Rmarkdown) enabling the combination of code with descriptive text and visualizations.\n\nIntegrated development environments (knitr or MLflow).\nPipeline frameworks or workflow management systems: designed to streamline and automate various steps involved in data analysis (data extraction, transformation, validation, visualization, and more). Additionally, they contribute to ensuring interoperability by facilitating seamless integration and interaction between different components or stages. There are two very popular systems, Nextflow and Snakemake.\n\nA great example of community-curated workflows is the nf-core community. Nf-core is a collaborative and open-source initiative comprising bioinformaticians and researchers dedicated to developing and maintaining a collection of curated and reproducible Nextflow-based pipelines for NGS data analysis, ensuring standardized and efficient data processing workflows.\n\n\n\n\n\n\nCourse on pipelines and workflows\n\n\n\nTake our course on Reproducible Research Practices LINK\n\n\nClick below to access a list of the most common file formats used when working with NGS data.\n\n\nData types summary\n\n\nSelect appropriate file formats that balance data accessibility, storage efficiency, and compatibility with downstream analysis tools. Standardized file formats facilitate data sharing and collaboration among researchers in the scientific community.\n\nBAM/SAM: stores the alignment information (binary and text-based respectively)\nFASTA: store nucleotide or amino acid sequence, commonly used for reference sequences or assembled contigs.\nGene Transfer Format (GTF) and General Feature Format (GFF): annotates genomic features such as genes, exons, and transcripts.\nAlignment indexes: data structures for efficient and rapid mapping of sequencing reads to a reference.\nVariant Call Format (VCF): stores genetic variation such as single nucleotide variants (SNVs), insertions, deletions, and structural variants (and their position, quality score, etc.)\nCount matrix: quantifies the abundance of RNA transcripts or genomic features across samples\nBED/BEDGraph: represent genomic intervals or coverage information (e.g., peak calling identifies regions of enriched signal intensity)\nWIG/BigWig: store genome-wide data\n\nGeneral formats\n\nTabular formats: File formats like CSV, TSV, and XLSX are used to store data in rows and columns for easy data analysis and sharing\nImage formats: File formats such as PNG and SVG are used to store graphical visualizations, making them easily viewable and shareable\nBinary formats: File formats like NPZ and H5 are used to store large datasets, ensuring efficient data access and storage\nJSON: A lightweight data-interchange format for storing hierarchical data structures, commonly used in bioinformatics tools\nHTML: A format used to create interactive reports that include both visualizations and textual descriptions of analysis results\nCode notebooks: Interactive documents combining code, visualizations, and explanatory text, aiding in data analysis reproducibility and documentation\nScripts: Text files containing sets of commands or code instructions for automating data processing and analysis tasks\n\nExplore more data types at the UCSC webpage. Check out this tutorial for more detailed explanations.", + "title": "Effective RDM Practices in NGS Analysis", + "section": "Practical tips for computational research", + "text": "Practical tips for computational research\n\n1. Experiments / raw data\nThoroughly document your datasets and the experimental setup to ensure reproducibility. Adhering to standards will ensure interoperability. Data types’ examples:\n\nElectronic Laboratory Notebook (ELN): digital description of the experimental design, and measurement devices. ELNs offer features like data entry, text editing, file attachments, collaboration tools, and search capabilities.\nLaboratory protocols: methodologies to prepare and manage samples.\nSamples: refers to the biological material (extraction of DNA, RNA, or proteins). Specification of sample identifier, sample type, source organism, etc.\nSequencing: details on the platform (e.g., Illumina, Oxford Nanopore), library preparation method, coverage, quality control metrics (e.g., Phred score)…\nRaw sequencing data: sequences and quality scores (e.g., FASTQ files)\n\n\n\n\n\n\n\nNote\n\n\n\nA metadata file is crucial during data analysis as it contains information about the experimental conditions (such as sequencing details, treatment, sample type, time points, tissue…).\n\n\n\n\n2. Input / Pre- and post-processing data\nExamples of data types generated during processing:\n\nQuality control metrics: to filter out potential artifacts and ensure the reliability of downstream analyses (e.g., bioinformatics tool like FastQC or MultiQC for results’ aggregation)\nData alignments: in genomics to determine the location of the read in the genome and in transcriptomics to identify gene expression levels.\nDNA analysis results: such as variant calling, genome annotation, functional genomics, phylogenetics, metagenomics, etc. Results are usually presented in tabular format.\nRNA Expression analysis results: from differential gene expression, gene ontology (GO) enrichment, alternative splicing, pathway analysis, etc. Results are usually presented in tabular format.\nEpigenetic profiling outputs: to assess gene regulation and chromatin structure (e.g., ChIP-Seq). Usually presented in BED format.\n\nThe interpretation of NGS data relies heavily on the results of data analysis, which are pivotal for understanding the biological significance of the findings and formulating hypotheses for further exploration. Clear and effective visualization methods are crucial for communicating and interpreting the vast amount of information generated by NGS experiments.\n\n\n\n\n\n\nOther types of data: databases and visualizations\n\n\n\n\n\n\n\n\nKnowledge databases\nA knowledge database is a structured repository of biological information that categorizes and annotates genes, proteins, and their functions, facilitating comprehensive understanding and analysis of biological systems. Here are five examples of knowledge databases:\n\n\nGene Ontology (GO): A comprehensive resource that classifies gene functions into defined terms, allowing for standardized annotation and comparison of genes across different organisms.\nDisease Ontology: A database that provides structured, standardized terminology for various diseases and their relationships, aiding in the systematic analysis of disease-related data.\nKEGG Pathways: A collection of manually curated pathway maps representing molecular interactions and reaction networks within cells, enabling the interpretation of high-throughput data in the context of biological systems.\nReactome: An open-access database that offers curated descriptions of biological processes, including pathways, reactions, and molecular events, facilitating the interpretation of large-scale biological data.\nUniProt: An extensive protein knowledgebase that provides detailed information about proteins, including their sequences, functions, and related annotations, supporting a wide range of biological research endeavors.\n\n\nVisualizations\n\n\nHeatmaps: frequently used to visualize gene expression patterns, epigenetic modifications, or microbial abundances across samples/conditions.\nVolcano Plots: commonly used in differential gene expression analysis\nGenome Browser Snapshots: display alignments and genomic features in genomic regions (e.g., gene annotations, ChIP-Seq peaks)\nNetwork Visualizations:utilized to explore gene regulatory networks or protein-protein interaction\nGenomic Annotations: to annotate genetic variations (functional impact on genes, genomic regions, or regulatory element)\n\n\n\n\n\n\n\n\n3. Software and code:\nBest practices for software and code management (don’t forget to read about FAIR software):\n\nCommenting your code: to enhance readability and comprehension\nMake your source code accessible using a repository (GitHub, GitLab, Bitbucket, SourceForge, etc.) that provides version control (VC) solutions. This step is one of the most important ones as version control systems (Git or SVN) track changes in your code over time and enable collaboration and easy version management. Most Danish institutions provide courses on Git/GitHub, check yours! We also highly recommend reading this paper (Perez-Riverol et al. 2016).\nREADME file: with comprehensive information about the project including installation instructions, usage examples or tutorials, licensing details, citation information, etc.\nRegister your code in a research software registry and include a clear and accessible software usage license: enabling other researchers to discover and reuse software packages (alongside metadata). More recommendations here.\nUse domain-relevant community standards to ensure consistency and interoperability (e.g., CodeMeta).\n\n\n\n\n\n\n\nGit and Github courses and other resources\n\n\n\n\n\n\n\n\nUniversity of Copenhagen\nAarhus University\nAalborg University\nDTU Git guidelines Find more resources on the Berkeley Library website\n\n\n\n\n\n\n\n\n4. Pipelines and workflows\nYou might use standard workflows or generate new ones during data processing and data analysis steps.\n\nCode notebooks: tools for data documentation (e.g. Jupyter Notebook, Rmarkdown) enabling the combination of code with descriptive text and visualizations.\n\nIntegrated development environments (knitr or MLflow).\nPipeline frameworks or workflow management systems: designed to streamline and automate various steps involved in data analysis (data extraction, transformation, validation, visualization, and more). Additionally, they contribute to ensuring interoperability by facilitating seamless integration and interaction between different components or stages. There are two very popular systems, Nextflow and Snakemake.\n\nA great example of community-curated workflows is the nf-core community. Nf-core is a collaborative and open-source initiative comprising bioinformaticians and researchers dedicated to developing and maintaining a collection of curated and reproducible Nextflow-based pipelines for NGS data analysis, ensuring standardized and efficient data processing workflows.\n\n\n\n\n\n\nCourse on pipelines and workflows\n\n\n\nTake our course on Reproducible Research Practices LINK", "crumbs": [ "Use cases", "NGS data", - "NGS data strategies" + "Effective RDM Practices in NGS Analysis" ] }, { "objectID": "develop/examples/NGS_management.html#wrap-up", "href": "develop/examples/NGS_management.html#wrap-up", - "title": "NGS data strategies", - "section": "", - "text": "In this lesson, we have taken a look a the vast and diverse landscape of bioinformatics data.", + "title": "Effective RDM Practices in NGS Analysis", + "section": "Wrap up", + "text": "Wrap up\nIn this lesson, we have taken a look a the vast and diverse landscape of bioinformatics data.", "crumbs": [ "Use cases", "NGS data", - "NGS data strategies" + "Effective RDM Practices in NGS Analysis" ] }, { @@ -1055,7 +1055,7 @@ "href": "develop/03_DOD.html#template-engine", "title": "3. Data organization and storage", "section": "Template engine", - "text": "Template engine\nSetting up folder structures manually for each new project can be time-consuming. Thankfully, tools like Cookiecutter offer a solution by allowing users to create project templates easily. These templates can ensure consistency across projects and save time. Additionally, using cruft alongside Cookiecutter can assist in maintaining older templates when updates are made (by synchronizing them with the latest version).\n\n\n\n\n\n\nCookiecutter templates\n\n\n\n\nCookiecutter template for Data science projects\nBrickmanlab template for NGS data: similar to the folder structures in the examples above. You can download and modify it to suit your needs.\n\n\n\n\nQuick tutorial on cookiecutter\n\n\n\n\n\n\nSandbox Tutorial\n\n\n\nLearn how to create your own template here.\nWe offer workshops on practical RDM for NGS data. Keep an eye on the upcoming events on the Sandbox website.", + "text": "Template engine\nSetting up folder structures manually for each new project can be time-consuming. Thankfully, tools like Cookiecutter offer a solution by allowing users to create project templates easily. These templates can ensure consistency across projects and save time. Additionally, using cruft alongside Cookiecutter can assist in maintaining older templates when updates are made (by synchronizing them with the latest version).\n\n\n\n\n\n\nCookiecutter templates\n\n\n\n\nCookiecutter template for Data science projects\nBrickmanlab template for NGS data: similar to the folder structures in the examples above. You can download and modify it to suit your needs.\n\n\n\n\nQuick tutorial on cookiecutter\n\n\n\n\n\n\nSandbox Tutorial\n\n\n\nLearn how to create your own template here.\nWe offer workshops on practical RDM for biodata. Keep an eye on the upcoming events on the Sandbox website.", "crumbs": [ "Course material", "Key practices", @@ -1079,7 +1079,7 @@ "href": "develop/03_DOD.html#naming-conventions", "title": "3. Data organization and storage", "section": "Naming conventions", - "text": "Naming conventions\nConsistent naming conventions play a crucial role in scientific research by enhancing organization and data retrieval. By adopting standardized naming conventions, researchers ensure that files, experiments, or datasets are labeled logically, facilitating easy location and comparison of similar data. For instance, in fields like genomics, uniform naming conventions for files associated with particular experiments or samples allow for swift identification and comparison of relevant data, streamlining the research process and contributing to the reproducibility of findings. Overall, promotes efficiency, collaboration, and the integrity of scientific work.\n\n\n\n\n\n\nGeneral tips for file and folder naming\n\n\n\nRemember to keep the folder structure simple.\n\nKeep it short and meaningful (use understandable abbreviation only, e.g., Cor for correlations or LFC for Log Fold Change)\nConsider including one of these elements: project name, category, descriptor, content, author…\n\nAuthor-based: use initials\n\nUse alphanumeric characters: letters (A-Z) and numbers (0-9)\nAvoid special characters: ~!@#$%^&*()`“|\nDate-based format: use YYYYMMDD format (year/month/day format helps with sorting and listing files in chronological order)\nUse underscores and hyphens as delimiters and avoid spaces.\n\nNot all search tools may work well with spaces (messy to indicate paths)\nIf the length is a concern, use capital letters to delimit words camelCase.\n\nSequential numbering: Use a two-‑digit format for single-digit numbers (0–9) to ensure correct numerical sequence order (for example, 01 and not 1)\nVersion control: Indicate the version (“V”) or revision (“R”) as the last element, using the two-digit format (e.g., v01, v02)\nWrite down your naming convention pattern and document it in the README file\n\n\n\n\n\n\n\n\n\nDefine your file name conventions\n\n\n\n\n\n\n\nAvoid long and complicated names and ensure your file names are both informative and easy to manage:\n\nFor saving a new plot, a heatmap representing sample correlations\nWhen naming the file for the document containing the Research Data Management Course Objectives (Version 2, 2nd May 2024) from the University of Copenhagen\nConsider the most common file types you work with, such as visualizations, tables, etc., and create logical and clear file names\n\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\n\nheatmap_sampleCor_20240101.png\nKU_RDM-objectives_20240502_v02.doc or KU_RDMObj_20240502_v02.doc\n\n\n\n\n\n\n\n\n\n\n\n\n\nAdditional file naming conventions\n\n\n\n\n\n\n\n\n\n\n\nname\ndescription\nnaming_convention\nfile format\nexample\n\n\n\n\n.fastq\nraw sequencing reads\nnan\nnan\nsampleID_run_read1.fastq\n\n\n.fastqc\nquality control from fastqc\nnan\nnan\nsampleID_run_read1.fastqc\n\n\n.bam\naligned reads\nnan\nnan\nsampleID_run_read1.bam\n\n\nGTF\nsequence annotation\nnan\nnan\none of https://www.gencodegenes.org/\n\n\nGFF\nsequence annotation\nnan\nnan\none of https://www.gencodegenes.org/\n\n\n.bed\ngenome locations\nnan\nnan\nnan\n\n\n.bigwig\ngenome coverage\nnan\nnan\nnan\n\n\n.fasta\nsequence data (nucleotide/aminoacid)\nnan\nnan\none of https://www.gencodegenes.org/\n\n\nMultiqc report\nQC aggregated report\n<assayID\\>_YYYYMMDD.multiqc\nmultiqc\nRNA_20200101.multiqc\n\n\nCount matrix\nfinal count matrix\n<assayID\\>_cm_aligner_YYYYMMDD.tsv\ntsv\nRNA_cm_salmon_20200101.tsv\n\n\nDEA\ndifferential expression analysis results\nDEA_<condition1-condition2\\>_LFC<absolute_threshold\\>_p<pvalue decimals\\>_YYYYMMDD.tsv\ntsv\nDEA_treat-untreat_LFC1_p01_20200101.tsv\n\n\nDBA\ndifferential binding analysis results\nDBA_<condition1-condition2\\>_LFC<absolute_threshold\\>_p<pvalue decimals\\>_YYYYMMDD.tsv\ntsv\nDBA_treat-untreat_LFC1_p01_20200101.tsv\n\n\nMAplot\nMA plot\nMAplot_<condition1-condition2\\>_YYYYMMDD.jpeg\njpeg\nMAplot_treat-untreat_20200101.jpeg\n\n\nHeatmap plot\nHeatmap plot of anything\nheatmap_<type\\>_YYYYMMDD.jpeg\njpeg\nheatmap_sampleCor_20200101.jpeg\n\n\nVolcano plot\nVolcano plot\nvolcano_<condition1-condition2\\>_YYYYMMDD.jpeg\njpeg\nvolcano_treat-untreat_20200101.jpeg\n\n\nVenn diagram\nVenn diagram\nvenn_<type\\>_YYYYMMDD.jpeg\njpeg\nvenn_consensus_20200101.jpeg\n\n\nEnrichment table\nEnrichment results\nnan\ntsv\nnan", + "text": "Naming conventions\nConsistent naming conventions play a crucial role in scientific research by enhancing organization and data retrieval. By adopting standardized naming conventions, researchers ensure that files, experiments, or datasets are labeled logically, facilitating easy location and comparison of similar data. The importance of uniform naming conventions extends to various fields, in fields like genomics or health data science, uniform naming conventions for files associated with particular experiments or samples allow for swift identification and comparison of relevant data, streamlining the research process and contributing to the reproducibility of findings. Overall, promotes efficiency, collaboration, and the integrity of scientific work.\n\n\n\n\n\n\nGeneral tips for file and folder naming\n\n\n\nRemember to keep the folder structure simple.\n\nKeep it short and meaningful (use understandable abbreviation only, e.g., Cor for correlations or LFC for Log Fold Change)\nConsider including one of these elements: project name, category, descriptor, content, author…\n\nAuthor-based: use initials\n\nUse alphanumeric characters: letters (A-Z) and numbers (0-9)\nAvoid special characters: ~!@#$%^&*()`“|\nDate-based format: use YYYYMMDD format (year/month/day format helps with sorting and listing files in chronological order)\nUse underscores and hyphens as delimiters and avoid spaces.\n\nNot all search tools may work well with spaces (messy to indicate paths)\nIf the length is a concern, use capital letters to delimit words camelCase.\n\nSequential numbering: Use a two-‑digit format for single-digit numbers (0–9) to ensure correct numerical sequence order (for example, 01 and not, 1 if your sequence only goes up to 99)\nVersion control: Indicate the version (“V”) or revision (“R”) as the last element, using the two-digit format (e.g., v01, v02)\nWrite down your naming convention pattern and document it in the README file\n\n\n\n\n\n\n\n\n\nCreate your own naming conventions\n\n\n\n\n\n\n\nConsider the most common types of files and folders you will be working with, such as visualizations, results tables, and processed files. Develop a logical and clear naming system for these files based on the tips provided above. Aim for concise and straightforward names to avoid complexity.\n\n\n\n\n\nTo learn more about naming conventions for NGS analysis and see additional examples, click here.", "crumbs": [ "Course material", "Key practices", @@ -1434,7 +1434,7 @@ "href": "develop/04_metadata.html#documentation-and-metadata", "title": "4. Documentation for biodata", "section": "Documentation and metadata", - "text": "Documentation and metadata\nEssential documentation comes in different forms and flavors, serving various purposes in research. Examples include protocols outlining experimental procedures, detailed lab journals recording experimental conditions and observations, codebooks explaining concepts, variables, and abbreviations used in the analysis, information about the structure and content of a dataset, software installation, and usage manual, code explanation within files or methodological information outlining data processing steps.\n From ontotext.com\nMetadata provides essential context and structure to (primary) data, enabling researchers to understand its significance and facilitate efficient data management. Some common elements found in metadata for bioinformatics data include:\n\nSample information and collection details\nExperimental conditions\nData processing steps applied to the raw data\nAnnotation and Ontology terms\nFile metadata (file type, file format, etc.)\nEthical and Legal Compliance\n\nMetadata serves as a crucial guide in navigating the complex landscape of data, akin to a cheat sheet for piecing together the puzzle of information. Much like identifying puzzle pieces, metadata provides essential details about data origin, structure, and context, such as sample collection details, experimental procedures, and equipment used. Metadata enables data exploration, interpretation, and future accessibility, promoting effective management and facilitating data usability and reuse.\n\n\n\n\n\n\nBenefits of collecting proper metadata\n\n\n\n\nData Context and Interpretation: Aiding in understanding experimental conditions, sample origins, and processing methods, is crucial for accurate results interpretation.\nData Discovery and Access: Metadata enables easy locating and accessing of specific datasets by quickly identifying relevant data through sample identifiers, experimental parameters, and timestamps.\nReproducibility and Collaboration: Metadata facilitates experiment replication and validation by enabling colleagues to reproduce analyses, compare results, and collaborate effectively, enhancing the integrity of scientific findings.\nQuality Control and Validation: Metadata supports data quality assessment by tracking the origin and handling of NGS data, allowing the identification of errors or biases to validate analysis accuracy and reliability.\nLong-Term Data Preservation: metadata ensures preservation over time, facilitating future understanding and utilization of archived datasets for continued scientific impact as research progresses.\n\n\n\n\nStreamlining Metadata Collection\nData and project directories should both include metadata and a README file.\n\n\n\n\n\n\nPractical tips\n\n\n\n\nImplement a logical structure with clear and descriptive file names.\nUse of controlled vocabularies and ontologies to ensure consistency and efficient data management and interpretation.\nUse a repository and a versioning system\nMake it Machine-readable, -actionable, and -interpretable.\nDevelop standards further within your research environment FAIRsharing standards.\nInclude all information for others to comprehend and effectively utilize the data.\n\n\n\n\n\nREADME.md\nThe README.md file, written in markdown format, provides a detailed description of the folder’s content. It includes information such as the purpose of the data, collection methods, and relevant details. The content might differ based on the purpose of the data.\n\n\n\n\n\n\nExercise 1: Identify README.md key components.\n\n\n\n\n\n\n\nSelect one of the examples below and reflect on how effectively the README communicates important information about the project. Please note that some of the links lead to README files describing databases, while others pertain to software and tools.\n\n1000 Genomes Project. You will find several readme files here.\n\nHomo Sapiens, fasta GRCh38\nIPD-IMGT/HLA Database\nDocker\nPython pandas\n\n\n\n\n\n\nStructure for bioinformatics projects.\n\nDescription of the project\nObjectives and aims\nDatasets and software requirements\nInstruction for data interpretation\nSummary of results\nContributions\nAdditional comments or notes\n\n\n\nmetadata.yml\nMetadata can be written in many file formats (commonly used: YAML, TXT, JSON, and CSV). We recommend YAML format, which is a text document that contains data formatted using a human-readable data format for data serialization. The content will be specific to the type of project.\nmetadata:\n project: \"Title\"\n author: \"Name\"\n date: \"YYYYMMDD\"\n description: \"Project short description\"\n version: \"1.0\"\n analysis:\n tool: \"software\"\n version: \"1.1.1\"\nSome general metadata fields used across different disciplines:\n\nProject Title: A concise and informative name for the dataset.\nAuthor(s): The individual(s) or organization responsible for creating the dataset. Include ORCID for identification.\nDate Created: The date when the dataset was originally generated or compiled, in YYYY-MM-DD format.\nDate Modified: The date when the dataset was last updated or modified (YYYY-MM-DD).\nObject ID: The project or assay ID for tracking and reference purposes.\nDescription: A short narrative explaining the content, purpose, and context of the project.\nKeywords: Descriptive terms or phrases capturing the main topics and attributes.\nEthical and Legal Considerations: Information about ethical approvals, consent, and any legal restrictions.\nVersion: The version number or identifier, useful for tracking changes.\nRelated Publications: Links or references to scientific publications associated with the folder. Always add the DOI.\nFunding Source: Details about the funding agency or source that supported the research or data generation.\nLicense: The type of license or terms of use associated with the dataset/project.\nContact Information: Contact details for individuals who can provide further information about the dataset/project.\n\n\n\n\n\n\n\nTip\n\n\n\nThere is an exercise in the practical material to streamline the creation of metadata files using Cookiecutter, a template-based scaffolding tool.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nCreate a metadata file with the following description fields: name, date, description, version, authors, keywords, license. Fill it up at the start of the project, when you generate the file structure.", + "text": "Documentation and metadata\nEssential documentation comes in different forms and flavors, serving various purposes in research. Examples include protocols outlining experimental procedures, detailed lab journals recording experimental conditions and observations, codebooks explaining concepts, variables, and abbreviations used in the analysis, information about the structure and content of a dataset, software installation, and usage manual, code explanation within files or methodological information outlining data processing steps.\n From ontotext.com\nMetadata provides essential context and structure to (primary) data, enabling researchers to understand its significance and facilitate efficient data management. Some common elements found in metadata for bioinformatics data include:\n\nSample information and collection details\nExperimental conditions\nData processing steps applied to the raw data\nAnnotation and Ontology terms\nFile metadata (file type, file format, etc.)\nEthical and Legal Compliance\n\nMetadata serves as a crucial guide in navigating the complex landscape of data, akin to a cheat sheet for piecing together the puzzle of information. Much like identifying puzzle pieces, metadata provides essential details about data origin, structure, and context, such as sample collection details, experimental procedures, and equipment used. Metadata enables data exploration, interpretation, and future accessibility, promoting effective management and facilitating data usability and reuse.\n\n\n\n\n\n\nBenefits of collecting proper metadata\n\n\n\n\nData Context and Interpretation: Aiding in understanding experimental conditions, sample origins, and processing methods, is crucial for accurate results interpretation.\nData Discovery and Access: Metadata enables easy locating and accessing of specific datasets by quickly identifying relevant data through sample identifiers, experimental parameters, and timestamps.\nReproducibility and Collaboration: Metadata facilitates experiment replication and validation by enabling colleagues to reproduce analyses, compare results, and collaborate effectively, enhancing the integrity of scientific findings.\nQuality Control and Validation: Metadata supports data quality assessment by tracking the origin and handling of NGS data, allowing the identification of errors or biases to validate analysis accuracy and reliability.\nLong-Term Data Preservation: metadata ensures preservation over time, facilitating future understanding and utilization of archived datasets for continued scientific impact as research progresses.\n\n\n\n\nStreamlining Metadata Collection\nData and project directories should both include metadata and a README file.\n\n\n\n\n\n\nPractical tips\n\n\n\n\nImplement a logical structure with clear and descriptive file names.\nUse of controlled vocabularies and ontologies to ensure consistency and efficient data management and interpretation.\nUse a repository and a versioning system\nMake it Machine-readable, -actionable, and -interpretable.\nDevelop standards further within your research environment FAIRsharing standards.\nInclude all information for others to comprehend and effectively utilize the data.\n\n\n\n\n\nREADME.md\nThe README.md file, written in markdown format, provides a detailed description of the folder’s content. It includes information such as the purpose of the data, collection methods, and relevant details. The content might differ based on the purpose of the data.\n\n\n\n\n\n\nExercise 1: Identify README.md key components.\n\n\n\n\n\n\n\nSelect one of the examples below and reflect on how effectively the README communicates important information about the project. Please note that some of the links lead to README files describing databases, while others pertain to software and tools.\n\n1000 Genomes Project. You will find several readme files here.\n\nHomo Sapiens, fasta GRCh38\nIPD-IMGT/HLA Database\nDocker\nPython pandas\n\n\n\n\n\n\nStructure for bioinformatics projects.\n\nDescription and relevance the project\nObjectives and aims\nDatasets and software requirements\nInstruction for data interpretation\nSummary of results\nContributions\nAdditional comments or notes\n\n\n\nmetadata.yml\nMetadata can be written in many file formats (commonly used: YAML, TXT, JSON, and CSV). We recommend YAML format, which is a text document that contains data formatted using a human-readable data format for data serialization. The content will be specific to the type of project.\nmetadata:\n project: \"Title\"\n author: \"Name\"\n date: \"YYYYMMDD\"\n description: \"Project short description\"\n version: \"1.0\"\n analysis:\n tool: \"software\"\n version: \"1.1.1\"\nSome general metadata fields used across different disciplines:\n\nProject Title: A concise and informative name for the dataset.\nAuthor(s): The individual(s) or organization responsible for creating the dataset. Include ORCID for identification.\nDate Created: The date when the dataset was originally generated or compiled, in YYYY-MM-DD format.\nDate Modified: The date when the dataset was last updated or modified (YYYY-MM-DD).\nObject ID: The project or assay ID for tracking and reference purposes.\nDescription: A short narrative explaining the content, purpose, and context of the project.\nKeywords: Descriptive terms or phrases capturing the main topics and attributes.\nEthical and Legal Considerations: Information about ethical approvals, consent, and any legal restrictions.\nVersion: The version number or identifier, useful for tracking changes.\nRelated Publications: Links or references to scientific publications associated with the folder. Always add the DOI.\nFunding Source: Details about the funding agency or source that supported the research or data generation.\nLicense: The type of license or terms of use associated with the dataset/project.\nContact Information: Contact details for individuals who can provide further information about the dataset/project.\n\n\n\n\n\n\n\nTip\n\n\n\nThere is an exercise in the practical material to streamline the creation of metadata files using Cookiecutter, a template-based scaffolding tool.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nCreate a metadata file with the following description fields: name, date, description, version, authors, keywords, license. Fill it up at the start of the project, when you generate the file structure.", "crumbs": [ "Course material", "Key practices", @@ -1446,7 +1446,7 @@ "href": "develop/04_metadata.html#controlled-vocabularies-and-ontologies", "title": "4. Documentation for biodata", "section": "Controlled vocabularies and ontologies", - "text": "Controlled vocabularies and ontologies\nResearchers encountering inconsistent and non-standardized terms (e.g., gene names, disease names, cell types, protein domains, etc.) across datasets may face challenges in data integration. Thus, requiring additional curation time to enable meaningful comparisons. Standardized vocabularies streamline integration, improving consistency and comparability in analysis. Leveraging widely accepted ontologies in the documentation ensures consistent capture of experiment details in metadata fields, aiding data interpretation.\n\n\n\n\n\n\nExamples of ontology services\n\n\n\n\nUberon anatomy ontology\nGene ontology\nEnsembl gene IDs.\nMedical Subject Headings (MeSH)\nChemical Entities of Biological Interest\nMicroarray Gene Expression Society Ontology (MGED)\n\n\n\n\n\n\n\n\n\nOntology definition\n\n\n\n\n\n\n\nAn ontology is a structured framework representing concepts, attributes, and relationships within a specific domain, aiding knowledge organization and integration. Employing standardized vocabularies, it facilitates effective communication and reasoning between humans and computers. Ontologies are crucial for knowledge representation, data integration, and semantic interoperability, enhancing understanding and collaboration across complex domains.\n\n\n\n\n\nStandardization improves data discoverability and interoperability, enabling robust analysis, accelerating knowledge sharing, and facilitating cross-study comparisons. Ontologies act as universal translators, fostering harmonious data interpretation and collaboration across scientific disciplines.\nYou can find three examples of metadata tailored for different purposes NGS data examples: sample metadata, project metadata, and experimental metadata. We suggest exploring controlled vocabularies and metadata standards within your field and seeking additional specialized sources. You will find a few sources at the end of the page.", + "text": "Controlled vocabularies and ontologies\nResearchers encountering inconsistent and non-standardized terms (e.g., gene names, disease names, cell types, protein domains, etc.) across datasets may face challenges in data integration. Thus, requiring additional curation time to enable meaningful comparisons. Standardized vocabularies streamline integration, improving consistency and comparability in analysis. Leveraging widely accepted ontologies in the documentation ensures consistent capture of experiment details in metadata fields, aiding data interpretation.\n\n\n\n\n\n\nExamples of ontology services\n\n\n\n\nUberon anatomy ontology\nGene ontology\nEnsembl gene IDs\nMedical Subject Headings (MeSH)\nChemical Entities of Biological Interest\nMicroarray Gene Expression Society Ontology (MGED)\nNCBI taxonomy\nMondo disease database\n\n\n\n\n\n\n\n\n\nOntology definition\n\n\n\n\n\n\n\nAn ontology is a structured framework representing concepts, attributes, and relationships within a specific domain, aiding knowledge organization and integration. Employing standardized vocabularies, it facilitates effective communication and reasoning between humans and computers. Ontologies are crucial for knowledge representation, data integration, and semantic interoperability, enhancing understanding and collaboration across complex domains.\n\n\n\n\n\nStandardization improves data discoverability and interoperability, enabling robust analysis, accelerating knowledge sharing, and facilitating cross-study comparisons. Ontologies act as universal translators, fostering harmonious data interpretation and collaboration across scientific disciplines.\nYou can find three examples of metadata tailored for different purposes NGS data examples: sample metadata, project metadata, and experimental metadata. We suggest exploring controlled vocabularies and metadata standards within your field and seeking additional specialized sources. You will find a few sources at the end of the page.", "crumbs": [ "Course material", "Key practices", @@ -1458,7 +1458,7 @@ "href": "develop/examples/NGS_metadata.html", "title": "NGS Assay and Project metadata", "section": "", - "text": "Section Overview\n\n\n\n⏰ Time Estimation: X minutes\n💬 Learning Objectives:\n\nDevelop your metadata\n\n\n\nYou should consider revisiting these examples after completing lesson 4 in the course material. Please review these three tables containing pre-filled data fields for metadata, each serving distinct purposes: sample metadata, project metadata, and experimental metadata.\n\nSample metadata fields\nSome details might be specific to your samples. For example, which samples are treated, which are controlled, which tissue they come from, which cell type, the age, etc. Here is a list of possible metadata fields that you can use:\n\n\n\n\n\n\n\n\n\nMetadata field\nDefinition\nFormat\nOntology\nExample\n\n\n\n\nsample\nName of the sample\nNA\nNA\ncontrol_rep1, treat_rep1\n\n\nfastq_1\nPath to fastq file 1\nNA\nNA\nAEG588A1_S1_L002_R1_001.fastq.gz\n\n\nfastq_2\nPath to paired fastq file, if it is a paired experiment\nNA\nNA\nAEG588A1_S1_L002_R2_001.fastq.gz\n\n\nstrandedness\nThe strandedness of the cDNA library\n<unstranded OR forward OR reverse \\>\nNA\nunstranded\n\n\ncondition\nVariable of interest of the experiment, such as \"control\", \"treatment\", etc\nwordWord\ncamelCase\ncontrol, treat1, treat2\n\n\ncell_type\nThe cell type(s) known or selected to be present in the sample\nNA\nontology field- e.g. EFO or OBI\nNA\n\n\ntissue\nThe tissue from which the sample was taken\nNA\nUberon\nNA\n\n\nsex\nThe biological/genetic sex of the sample\nNA\nontology field- e.g. EFO or OBI\nNA\n\n\ncell_line\nCell line of the sample\nNA\nontology field- e.g. EFO or OBI\nNA\n\n\norganism\nOrganism origin of the sample\n<Genus species>\nTaxonomy\nMus musculus\n\n\nreplicate\nReplicate number\n<integer\\>\nNA\n1\n\n\nbatch\nBatch information\nwordWord\ncamelCase\n1\n\n\ndisease\nAny diseases that may affect the sample\nNA\nDisease Ontology or MONDO\nNA\n\n\ndevelopmental_stage\nThe developmental stage of the sample\nNA\nNA\nNA\n\n\nsample_type\nThe type of the collected specimen, eg tissue biopsy, blood draw or throat swab\nNA\nNA\nNA\n\n\nstrain\nStrain of the species from which the sample was collected, if applicable\nNA\nontology field - e.g. NCBITaxonomy\nNA\n\n\ngenetic variation\nAny relevant genetic differences from the specimen or sample to the expected genomic information for this species, eg abnormal chromosome counts, major translocations or indels\nNA\nNA\nNA\n\n\n\n\n\n\n\n\n\n\nProject metadata fields\nHere you will find a table with possible metadata fields that you can use to annotate and track your Project folders:\n\n\n\n\n\n\n\n\n\nMetadata field\nDefinition\nFormat\nOntology\nExample\n\n\n\n\nproject\nProject ID\n<surname\\>_et_al_2023\nNA\nproks_et_al_2023\n\n\nauthor\nOwner of the project\n<First name\\> <Surname\\>\nNA\nMartin Proks\n\n\ndate\nDate of creation\nYYYYMMDD\nNA\n20230101\n\n\ndescription\nShort description of the project\nPlain text\nNA\nThis is a project describing the effect of Oct4 perturbation after pERK activation\n\n\n\n\n\n\n\n\n\n\nAssay metadata fields\nHere you will find a table with possible metadata fields that you can use to annotate and track your Assay folders:\n\n\n\n\n\n\n\n\n\nMetadata field\nDefinition\nFormat\nOntology\nExample\n\n\n\n\nassay_ID\nIdentifier for the assay that is at least unique within the project\n<Assay-ID\\>_<keyword\\>_YYYYMMDD\nNA\nCHIP_Oct4_20200101\n\n\nassay_type\nThe type of experiment performed, eg ATAC-seq or seqFISH\nNA\nontology field- e.g. EFO or OBI\nChIPseq\n\n\nassay_subtype\nMore specific type or assay like bulk nascent RNAseq or single cell ATACseq\nNA\nontology field- e.g. EFO or OBI\nbulk ChIPseq\n\n\nowner\nOwner of the assay (who made the experiment?).\n<First Name\\> <Last Name\\>\nNA\nJose Romero\n\n\nplatform\nThe type of instrument used to perform the assay, eg Illumina HiSeq 4000 or Fluidigm C1 microfluidics platform\nNA\nontology field- e.g. EFO or OBI\nIllumina\n\n\nextraction_method\nTechnique used to extract the nucleic acid from the cell\nNA\nontology field- e.g. EFO or OBI\nNA\n\n\nlibrary_method\nTechnique used to amplify a cDNA library\nNA\nontology field- e.g. EFO or OBI\nNA\n\n\nexternal_accessions\nAccession numbers from external resources to which assay or protocol information was submitted\nNA\neg protocols.io, AE, GEO accession number, etc\nGSEXXXXX\n\n\nkeyword\nKeyword for easy identification\nwordWord\ncamelCase\nOct4ChIP\n\n\ndate\nDate of assay creation\nYYYYMMDD\nNA\n20200101\n\n\nnsamples\nNumber of samples analyzed in this assay\n<integer\\>\nNA\n9\n\n\nis_paired\nPaired fastq files or not\n<single OR paired\\>\nNA\nsingle\n\n\npipeline\nPipeline used to process data and version\nNA\nNA\nnf-core/chipseq -r 1.0\n\n\nstrandedness\nThe strandedness of the cDNA library\n<+ OR - OR *\\>\nNA\n*\n\n\nprocessed_by\nWho processed the data\n<First Name\\> <Last Name\\>\nNA\nSarah Lundregan\n\n\norganism\nOrganism origin\n<Genus species\\>\nTaxonomy name\nMus musculus\n\n\norigin\nIs internal or external (from a public resources) data\n<internal OR external\\>\nNA\ninternal\n\n\npath\nPath to files\n</path/to/file\\>\nNA\nNA\n\n\nshort_desc\nShort description of the assay\nplain text\nNA\nOct4 ChIP after pERK activation\n\n\nELN_ID\nID of the experiment/assay in your Electronic Lab Notebook software, like labguru or benchling\nplain text\nNA\nNA\n\n\n\n\n\n\n\n\nThe metadata must include key details such as the project’s short description, author information, creation date, experimental protocol, assay ID, assay type, platform utilized, library details, keywords, sample count, paired-end status, processor information, organism studied, sample origin, and file path.\nIf you would create a database from the metadata files, your table should look like this (each row corresponding to one project):\n\n\n\n\n\n\n\n\n\nassay_ID\nassay_type\nassay_subtype\nowner\nplatform\nextraction_method\nlibrary_method\nexternal_accessions\nkeyword\ndate\nnsamples\nis_paired\npipeline\nstrandedness\nprocessed_by\norganism\norigin\npath\nshort_desc\nELN_ID\n\n\n\n\nRNA_oct4_20200101\nRNAseq\nbulk RNAseq\nSarah Lundregan\nNextSeq 2000\nNA\nNA\nNA\noct4\n20200101\n9\npaired\nnf-core/chipseq 2.3.1\n*\nSL\nMus musculus\ninternal\nNA\nBulk RNAseq of Oct4 knockout\n234\n\n\nCHIP_oct4_20200101\nChIPseq\nbulk ChIPseq\nJose Romero\nNextSeq 2000\nNA\nNA\nNA\noct4\n20200101\n9\nsingle\nnf-core/rnaseq 3.12.0\n*\nJARH\nMus musculus\ninternal\nNA\nBulk ChIPseq of Oct4 overexpression\n123\n\n\nCHIP_med1_20190204\nChIPseq\nbulk ChIPseq\nMartin Proks\nNextSeq 2000\nNA\nNA\nNA\nmed1\n20190204\n12\nsingle\nnf-core/rnaseq 3.12.0\n*\nMP\nMus musculus\ninternal\nNA\nBulk ChIPseq of Med1 overexpression\n345\n\n\nSCR_humanSkin_20210302\nRNAseq\nsingle cell RNAseq\nJose Romero\nNextSeq 2000\nNA\nNA\nNA\nhumanSkin\n20210302\n23123\npaired\nnf-core/scrnaseq 1.8.2\n*\nJARH\nHomo sapiens\nexternal\nNA\nscRNAseq analysis of human skin development\nNA\n\n\nSCR_humanBrain_20220610\nRNAseq\nsingle cell RNAseq\nMartin Proks\nNextSeq 2000\nNA\nNA\nNA\nhumanBrain\n20220610\n1234\npaired\ncustom\n*\nMP\nHomo sapiens\nexternal\nNA\nscRNAseq analysis of human brain development\nNA\n\n\n\n\n\n\n\n\n\n\n\n\nCopyrightCC-BY-SA 4.0 license", + "text": "Section Overview\n\n\n\n⏰ Time Estimation: X minutes\n💬 Learning Objectives:\n\nDevelop your metadata\n\n\n\nYou should consider revisiting these examples after completing lesson 4 in the course material. Please review these three tables containing pre-filled data fields for metadata, each serving distinct purposes: sample metadata, project metadata, and experimental metadata.\n\nProject metadata fields\nHere you will find a table with possible metadata fields that you can use to annotate and track your Project folders:\n\n\n\n\n\n\n\n\n\nMetadata field\nDefinition\nFormat\nOntology\nExample\n\n\n\n\nproject\nProject ID\n<surname\\>_et_al_2023\nNA\nproks_et_al_2023\n\n\nauthor\nOwner of the project\n<First name\\> <Surname\\>\nNA\nMartin Proks\n\n\ndate\nDate of creation\nYYYYMMDD\nNA\n20230101\n\n\ndescription\nShort description of the project\nPlain text\nNA\nThis is a project describing the effect of Oct4 perturbation after pERK activation\n\n\n\n\n\n\n\n\n\n\nSample metadata fields\nSome details might be specific to your samples. For example, which samples are treated, which are controlled, which tissue they come from, which cell type, the age, etc. Here is a list of possible metadata fields that you can use:\n\n\n\n\n\n\n\n\n\nMetadata field\nDefinition\nFormat\nOntology\nExample\n\n\n\n\nsample\nName of the sample\nNA\nNA\ncontrol_rep1, treat_rep1\n\n\nfastq_1\nPath to fastq file 1\nNA\nNA\nAEG588A1_S1_L002_R1_001.fastq.gz\n\n\nfastq_2\nPath to paired fastq file, if it is a paired experiment\nNA\nNA\nAEG588A1_S1_L002_R2_001.fastq.gz\n\n\nstrandedness\nThe strandedness of the cDNA library\n<unstranded OR forward OR reverse \\>\nNA\nunstranded\n\n\ncondition\nVariable of interest of the experiment, such as \"control\", \"treatment\", etc\nwordWord\ncamelCase\ncontrol, treat1, treat2\n\n\ncell_type\nThe cell type(s) known or selected to be present in the sample\nNA\nontology field- e.g. EFO or OBI\nNA\n\n\ntissue\nThe tissue from which the sample was taken\nNA\nUberon\nNA\n\n\nsex\nThe biological/genetic sex of the sample\nNA\nontology field- e.g. EFO or OBI\nNA\n\n\ncell_line\nCell line of the sample\nNA\nontology field- e.g. EFO or OBI\nNA\n\n\norganism\nOrganism origin of the sample\n<Genus species>\nTaxonomy\nMus musculus\n\n\nreplicate\nReplicate number\n<integer\\>\nNA\n1\n\n\nbatch\nBatch information\nwordWord\ncamelCase\n1\n\n\ndisease\nAny diseases that may affect the sample\nNA\nDisease Ontology or MONDO\nNA\n\n\ndevelopmental_stage\nThe developmental stage of the sample\nNA\nNA\nNA\n\n\nsample_type\nThe type of the collected specimen, eg tissue biopsy, blood draw or throat swab\nNA\nNA\nNA\n\n\nstrain\nStrain of the species from which the sample was collected, if applicable\nNA\nontology field - e.g. NCBITaxonomy\nNA\n\n\ngenetic variation\nAny relevant genetic differences from the specimen or sample to the expected genomic information for this species, eg abnormal chromosome counts, major translocations or indels\nNA\nNA\nNA\n\n\n\n\n\n\n\n\n\n\nAssay metadata fields\nHere you will find a table with possible metadata fields that you can use to annotate and track your Assay folders:\n\n\n\n\n\n\n\n\n\nMetadata field\nDefinition\nFormat\nOntology\nExample\n\n\n\n\nassay_ID\nIdentifier for the assay that is at least unique within the project\n<Assay-ID\\>_<keyword\\>_YYYYMMDD\nNA\nCHIP_Oct4_20200101\n\n\nassay_type\nThe type of experiment performed, eg ATAC-seq or seqFISH\nNA\nontology field- e.g. EFO or OBI\nChIPseq\n\n\nassay_subtype\nMore specific type or assay like bulk nascent RNAseq or single cell ATACseq\nNA\nontology field- e.g. EFO or OBI\nbulk ChIPseq\n\n\nowner\nOwner of the assay (who made the experiment?).\n<First Name\\> <Last Name\\>\nNA\nJose Romero\n\n\nplatform\nThe type of instrument used to perform the assay, eg Illumina HiSeq 4000 or Fluidigm C1 microfluidics platform\nNA\nontology field- e.g. EFO or OBI\nIllumina\n\n\nextraction_method\nTechnique used to extract the nucleic acid from the cell\nNA\nontology field- e.g. EFO or OBI\nNA\n\n\nlibrary_method\nTechnique used to amplify a cDNA library\nNA\nontology field- e.g. EFO or OBI\nNA\n\n\nexternal_accessions\nAccession numbers from external resources to which assay or protocol information was submitted\nNA\neg protocols.io, AE, GEO accession number, etc\nGSEXXXXX\n\n\nkeyword\nKeyword for easy identification\nwordWord\ncamelCase\nOct4ChIP\n\n\ndate\nDate of assay creation\nYYYYMMDD\nNA\n20200101\n\n\nnsamples\nNumber of samples analyzed in this assay\n<integer\\>\nNA\n9\n\n\nis_paired\nPaired fastq files or not\n<single OR paired\\>\nNA\nsingle\n\n\npipeline\nPipeline used to process data and version\nNA\nNA\nnf-core/chipseq -r 1.0\n\n\nstrandedness\nThe strandedness of the cDNA library\n<+ OR - OR *\\>\nNA\n*\n\n\nprocessed_by\nWho processed the data\n<First Name\\> <Last Name\\>\nNA\nSarah Lundregan\n\n\norganism\nOrganism origin\n<Genus species\\>\nTaxonomy name\nMus musculus\n\n\norigin\nIs internal or external (from a public resources) data\n<internal OR external\\>\nNA\ninternal\n\n\npath\nPath to files\n</path/to/file\\>\nNA\nNA\n\n\nshort_desc\nShort description of the assay\nplain text\nNA\nOct4 ChIP after pERK activation\n\n\nELN_ID\nID of the experiment/assay in your Electronic Lab Notebook software, like labguru or benchling\nplain text\nNA\nNA\n\n\n\n\n\n\n\n\nThe metadata must include key details such as the project’s short description, author information, creation date, experimental protocol, assay ID, assay type, platform utilized, library details, keywords, sample count, paired-end status, processor information, organism studied, sample origin, and file path.\nIf you would create a database from the metadata files, your table should look like this (each row corresponding to one project):\n\n\n\n\n\n\n\n\n\nassay_ID\nassay_type\nassay_subtype\nowner\nplatform\nextraction_method\nlibrary_method\nexternal_accessions\nkeyword\ndate\nnsamples\nis_paired\npipeline\nstrandedness\nprocessed_by\norganism\norigin\npath\nshort_desc\nELN_ID\n\n\n\n\nRNA_oct4_20200101\nRNAseq\nbulk RNAseq\nSarah Lundregan\nNextSeq 2000\nNA\nNA\nNA\noct4\n20200101\n9\npaired\nnf-core/chipseq 2.3.1\n*\nSL\nMus musculus\ninternal\nNA\nBulk RNAseq of Oct4 knockout\n234\n\n\nCHIP_oct4_20200101\nChIPseq\nbulk ChIPseq\nJose Romero\nNextSeq 2000\nNA\nNA\nNA\noct4\n20200101\n9\nsingle\nnf-core/rnaseq 3.12.0\n*\nJARH\nMus musculus\ninternal\nNA\nBulk ChIPseq of Oct4 overexpression\n123\n\n\nCHIP_med1_20190204\nChIPseq\nbulk ChIPseq\nMartin Proks\nNextSeq 2000\nNA\nNA\nNA\nmed1\n20190204\n12\nsingle\nnf-core/rnaseq 3.12.0\n*\nMP\nMus musculus\ninternal\nNA\nBulk ChIPseq of Med1 overexpression\n345\n\n\nSCR_humanSkin_20210302\nRNAseq\nsingle cell RNAseq\nJose Romero\nNextSeq 2000\nNA\nNA\nNA\nhumanSkin\n20210302\n23123\npaired\nnf-core/scrnaseq 1.8.2\n*\nJARH\nHomo sapiens\nexternal\nNA\nscRNAseq analysis of human skin development\nNA\n\n\nSCR_humanBrain_20220610\nRNAseq\nsingle cell RNAseq\nMartin Proks\nNextSeq 2000\nNA\nNA\nNA\nhumanBrain\n20220610\n1234\npaired\ncustom\n*\nMP\nHomo sapiens\nexternal\nNA\nscRNAseq analysis of human brain development\nNA\n\n\n\n\n\n\n\n\n\n\nSources\n\nTranscriptomics metadata standards and fields\nBiological ontologies for data scientists,Bionty\n\n\n\n\n\nCopyrightCC-BY-SA 4.0 license", "crumbs": [ "Use cases", "NGS data", @@ -1601,6 +1601,67 @@ "href": "develop/practical_workshop.html#organize-and-structure-your-datasets-and-data-analysis", "title": "Practical material", "section": "1. Organize and structure your datasets and data analysis", - "text": "1. Organize and structure your datasets and data analysis\nEstablishing a consistent file structure and naming conventions will help you efficiently manage your data. We will classify your data and data analyses into two distinct types of folders to ensure the data can be used and shared by many lab members while preventing modifications by any individual:\n\nData folders (assay or external databases and resources): They house the raw and processed datasets, alongside the pipeline/workflow used to generate the processed data, the provenance of the raw data, and quality control reports of the data. The data should be locked and set to read-only to prevent unintended modifications. This applies to experimental data generated in your lab as well as external resources. Provide an MD5 checksum file when you download them yourself to verify their integrity.\nProject folders: They contain all the essential files for a specific research project. Projects may use data from various resources or experiments, or build upon previous results from other projects. The data should not be copied or duplicated, instead, it should be linked directly from the source.\n\nData and data analysis are kept separate because a project may utilize one or more datasets to address a scientific question. Data can be reused in multiple projects over time, combined with other datasets for comparison, or used to build larger datasets. Additionally, data may be utilized by different researchers to answer various research questions.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nWhen organizing your data folders, separate assays from external resources and maintain a consistent structure. For example, organize genome references by species and further categorize them by versions. Make sure to include all relevant information, and refer to this lesson for additional tips on data organization.\nThis will help you to keep your data tidied up, especially if you are working in a big lab where assays may be used for different purposes and by different people!\n\n\n\n\n\n\nData folders\nWhether your lab generates its own experimental data, receives it from collaborators, or works with previously published datasets, the data folder should follow a similar structure to the one presented here. Create a separate folder for each dataset, including raw files and processed files alongside the corresponding documentation and pipeline that generated the processed data. Raw files should remain untouched, and you should consider locking modifications to the final results once data preprocessing is complete. This precaution helps prevent unwanted changes to the data. Each subfolder should be named in a way that is distinct, easily readable and clear at a glance. Check this lesson for tips on naming conventions.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nUse an acronym (1) that describes the type of NGS assay (RNAseq, ChIPseq, ATACseq) a keyword (2) that represents a unique element to that assay, and the date (3).\n<Assay-ID>_<keyword>_YYYYMMDD\nFor example CHIP_Oct4_20230101 is a ChIPseq assay made on 1st January 2023 with the keyword Oct4, so it is easily identifiable by the eye.\n\n\n\n\n\nLet’s explore a potential folder structure and the types of files you might encounter within it.\n<data_type>_<keyword>_YYYYMMDD/\n├── README.md \n├── CHECKSUMS\n├── pipeline\n ├── pipeline.md\n ├── scripts/\n├── processed\n ├── fastqc/\n ├── multiqc/\n ├── final_fastq/\n└── raw\n ├── .fastq.gz \n └── samplesheet.csv\n\nREADME.md: This file contains a detailed description of the dataset commonly in markdown format. It should include the provenance of the raw data (such as samples, laboratory protocols used, the aim of the project, folder structure, naming conventions, etc.).\nmetadata.yml: This metadata file outlines different keys and essential information, usually presented in YAML format. For more details, refer to this lesson.\npipeline.md: This file provides an overview of the pipeline used to process raw data, as well as the commands to run the pipeline. The pipeline itself and all the required scripts should be collected in the same directory.\nprocessed: This folder contains the results from the preprocessing pipeline. The content vary depending on the specific pipeline used (create additional subdirectories as needed).\nraw: This folder holds the raw data.\n\n.fastq.gz: For example, in NGS assays, there should be ‘fastq’ files.\nsamplesheet.csv: This file holds essential metadata for the samples, including sample identification, experimental variables, batch information, and other metrics crucial for downstream analysis. It is important that this file is complete and current, as it is key to interpreting results. If you are considering running nf-core pipelines, this file will be required.\n\n\n\n\nProject folders\nOn the other hand, we have another type of folder called Projects which refers to data analyses that are specific to particular tasks, such as those involved in preparing a potential article. In this folder, you will create a subfolder for each project that you or your lab is working on. Each Project subfolder should include project-specific information, data analysis pipelines, notebooks, and scripts used for that particular project. Additionally, you should include an environment file with all the required software and dependencies needed for the project, including their versions. This helps ensure that the analyses can be easily replicated and shared with others.\nThe Project folder should be named in a way that is unique, easy to read, distinguishable, and clear at a glance. For example, you might name it based on the main author’s initials, the dataset being analyzed, the project name, a unique descriptive element related to the project, or the part of the project you are responsible for, along with the date:\n<project>_<keyword>_YYYYMMDD\n\n\n\n\n\n\nNaming examples\n\n\n\n\n\n\n\n\nRNASeq_Mouse_Brain_20230512: a project RNA sequencing data from a mouse brain experiment, created on May 12, 2023\nEHR_COVID19_Study_20230115: a project around electronic health records data for a COVID-19 study, created on January 15, 2023.\n\n\n\n\n\n\nNow, let’s explore an example of a folder structure and the types of files you might encounter within it.\n<project>_<keyword>_YYYYMMDD\n├── data\n│ └── <ID>_<keyword>_YYYYMMDD <- symbolic link\n├── documents\n│ └── research_project_template.docx\n├── metadata.yml\n├── notebooks\n│ └── 01_data_processing.rmd\n│ └── 02_data_analysis.rmd\n│ └── 03_data_visualization.rmd\n├── README.md\n├── reports\n│ └── 01_data_processing.html\n│ └── 02_data_analysis.html\n│ ├── 03_data_visualization.html\n│ │ └── figures\n│ │ └── tables\n├── requirements.txt // env.yaml\n├── results\n│ ├── figures\n│ │ └── 02_data_analysis/\n│ │ └── heatmap_sampleCor_20230102.png\n│ ├── tables\n│ │ └── 02_data_analysis/\n│ │ └── DEA_treat-control_LFC1_p01.tsv\n│ │ └── SumStats_sampleCor_20230102.tsv\n├── pipeline\n│ ├── rules // processes \n│ │ └── step1_data_processing.smk\n│ └── pipeline.md\n├── scratch\n└── scripts\n\ndata: This folder contains symlinks or shortcuts to the actual data files, ensuring that the original files remain unaltered.\ndocuments: This folder houses Word documents, slides, or PDFs associated with the project, including data and project explanations, research papers, and more. It also includes the Data Management Plan.\n\nresearch_project_template.docx. If you download our template you will find a is a pre-filled Data Management Plan based on the Horizon Europe guidelines named ‘Non-sensitive_NGS_research_project_template.docx’.\n\nmetadata.yml: metadata file describing various keys of the project or experiment (see this lesson).\nnotebooks: This folder stores Jupyter, R Markdown, or Quarto notebooks containing the data analysis. Figures and tables used for the reports are organized under subfolders named after the notebook that created them for provenance purposes.\nREADME.md: A detailed project description in markdown or plain-text format.\nreports: Notebooks rendered as HTML, docx, or PDF files for sharing with colleagues or as formal data analysis reports.\n\nfigures: figures produced upon rendering notebooks. The figures will be saved under a subfolder named after the notebook that created them. This is for provenance purposes so we know which notebook created which figures.\n\nrequirements.txt: This file lists the necessary software, libraries, and their versions required to reproduce the code. If you’re using conda environments, you will also find the env.yaml file here, which outlines the specific environment configuration.\nresults: This folder contains analysis results, such as figures and tables. Organizing results by the pipeline, script, or notebook that generated them will make it easier to locate and interpret the data.\npipeline: A folder containing pipeline scripts or workflows for processing and analyzing data.\nscratch: A folder designated for temporary files or workspace for experiments and development.\nscripts: Folder for helper scripts needed to run data analysis or reproduce the work.\n\n\n\nTemplate engine\nCreating a folder template is straightforward with cookiecutter a command-line tool that generates projects from templates (called cookiecutters). For example, it can help you set up a Python package project based on a Python package project template.\n\n\n\n\n\n\nCookiecutter templates\n\n\n\nHere are some template that you can use to get started, adapt and modify them to your own needs:\n\nPython package project\nSandbox test\nData science\nNGS data\n\nCreate your own template from scratch.\n\n\n\nQuick tutorial on cookiecutter\nBuilding a Cookiecutter template from scratch requires defining a folder structure, crafting a cookiecutter.json file, and outlining placeholders (keywords) that will be substituted when generating a new project. Here’s a step-by-step guide on how to proceed:\n\nStep 1: Create a Folder Template\nFirst, begin by creating a folder structure that aligns with your desired template design. For instance, let’s set up a simple Python project template:\nmy_template/\n|-- {{cookiecutter.project_name}}\n| |-- main.py\n|-- tests\n| |-- test_{{cookiecutter.project_name}}.py\n|-- README.md\nIn this example, {cookiecutter.project_name} is a placeholder that will be replaced with the actual project name when the template is used. This directory contains a python script (‘main.py’), a subdirectory (‘tests’) with a second python script named after the project (‘test_{{cookiecutter.project_name}}.py’) and a ‘README.md’ file.\n\n\nStep 2: Create cookiecutter.json\nIn the root of your template folder, create a file named cookiecutter.json. This file will define the variables (keywords) that users will be prompted to fill in. For our Python project template, it might look like this:\n{\n \"project_name\": \"MyProject\",\n \"author_name\": \"Your Name\",\n \"description\": \"A short description of your project\"\n}\nWhen users generate a project based on your template, they will be prompted with these questions. The provided values (“responses”) will be used to substitute the placeholders in your template files.\nBeyond substituting placeholders in file and directory names, Cookiecutter can automatically populate text file contents with information. This feature is useful for offering default configurations or code file templates. Let’s enhance our earlier example by incorporating a placeholder within a text file:\nFirst, modify the my_template/main.py file to include a placeholder inside its contents:\n# main.py\n\ndef hello():\n print(\"Hello, {{cookiecutter.project_name}}!\")\nThe ‘{{cookiecutter.project_name}}’ placeholder is now included within the main.py file. When you execute Cookiecutter, it will automatically replace the placeholders in both file and directory names and within text file contents.\nAfter running Cookiecutter, your generated ‘main.py’ file could appear as follows:\n# main.py\n\ndef hello():\n print(\"Hello, MyProject!\") # Assuming \"MyProject\" was entered as the project_name\n\n\nStep 3: Use Cookiecutter\nOnce your template is prepared, you can utilize Cookiecutter to create a project from it. Open a terminal and execute:\ncookiecutter path/to/your/template\nCookiecutter will prompt you to provide values for project_name, author_name, and description. Once you input these values, Cookiecutter will replace the placeholders in your template files with the entered values.\n\n\nStep 4: Review the Generated Project\nAfter the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will find a project structure with the placeholders replaced by the values you provided.\n\n\n\n\n\n\nExercise 1: Create your own template\n\n\n\n\n\n\n\nUse Cookiecutter to create custom templates for your folders. You can do it from scratch (see Exercise 1, part B) or opt for one of our pre-made templates available as a Github repository (recommended for this workshop). Feel free to tailor the template to your specific requirements—you don’t have to follow our examples exactly.\nRequirements We assume you have already gone through the requirements at the beginning of the practical lesson. This includes installing the necessary tools and setting up accounts as needed.\nProject\n\nGo to our Cookicutter template and click on the **Fork*\n\n\nbutton at the top-right corner of the repository page to create a copy of the repository on your own GitHub account or organization. \n\n\nOpen a terminal on your computer, copy the URL of your fork and clone the repository to your local machine (the URL should look something like https://github.com/your_username/cookiecutter-template):\n\ngit clone <your URL to the template>\nIf you have a GitHub Desktop, click Add and select “Clone repository” from the options 3. Open the repository and navigate through the different directories 4. Modify the contents of the repository as needed to fit your project’s requirements. You can change files, add new ones. remove existing one or adjust the folder structure. For inspiration, review the data structure above under ‘Project folder’. For instance, this template is missing the ‘reports’ directory. Consider creating it, along with a subdirectory named ‘figures’. Here’s an example of how to do it:\ncd \\{\\{\\ cookiecutter.project_name\\ \\}\\}/ \nmkdir reports \ntouch requirements.txt\n\nModify the cookiecutter.json file. You could add new variables or change the default values:\n\n# open a text editor\n \"author\": \"Alba Refoyo\",\n\nCommit and push changes when you are done with your modifications\n\n\nStage the changes with ‘git add’\nCommit the changes with a meaningful commit message ‘git commit -m “update cookicutter template”’\nPush the changes to your forked repository on Github ‘git push origin main’ (or the appropriate branch name)\n\n\nTest your template by using cookiecutter <URL to your GitHub repository \"cookicutter-template\"> Fill up the variables and verify that the modified template looks like you would expect.\n\nOptional: You can customize or remove this prompt message entirely, allowing you to tailor the text to your preferences for a unique experience each time you use the template.\n\n\"__prompts__\": {\n \"project_name\": \"Project directory name [Example: project_short_description_202X]\",\n \"author\": \"Author of the project\",\n \"date\": \"Date of project creation, default is today's date\",\n \"short_description\": \"Provide a detailed description of the project (context/content)\"\n },\n\n\n\n\n\n\n\n\n\n\n\nOptional Exercise 1, part B\n\n\n\n\n\n\n\nCreate a template from scratch using this tutorial scratch, it can be as basic as this one below or ‘Data folder’:\nmy_template/\n|-- {{cookiecutter.project_name}}\n| |-- main.py\n|-- tests\n| |-- test_{{cookiecutter.project_name}}.py\n|-- README.md\n\nStep 1: Create a directory for the template.\nStep 2: Write a cookiecutter.json file with variables such as project_name and author.\nStep 3: Set up the folder structure by creating subdirectories and files as needed.\nStep 4: Incorporate cookiecutter variables in the names of files.\nStep 5: Use cookiecutter variables within scripts, such as printing a message that includes the project name." + "text": "1. Organize and structure your datasets and data analysis\nEstablishing a consistent file structure and naming conventions will help you efficiently manage your data. We will classify your data and data analyses into two distinct types of folders to ensure the data can be used and shared by many lab members while preventing modifications by any individual:\n\nData folders (assay or external databases and resources): They house the raw and processed datasets, alongside the pipeline/workflow used to generate the processed data, the provenance of the raw data, and quality control reports of the data. The data should be locked and set to read-only to prevent unintended modifications. This applies to experimental data generated in your lab as well as external resources. Provide an MD5 checksum file when you download them yourself to verify their integrity.\nProject folders: They contain all the essential files for a specific research project. Projects may use data from various resources or experiments, or build upon previous results from other projects. The data should not be copied or duplicated, instead, it should be linked directly from the source.\n\nData and data analysis are kept separate because a project may utilize one or more datasets to address a scientific question. Data can be reused in multiple projects over time, combined with other datasets for comparison, or used to build larger datasets. Additionally, data may be utilized by different researchers to answer various research questions.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nWhen organizing your data folders, separate assays from external resources and maintain a consistent structure. For example, organize genome references by species and further categorize them by versions. Make sure to include all relevant information, and refer to this lesson for additional tips on data organization.\nThis will help you to keep your data tidied up, especially if you are working in a big lab where assays may be used for different purposes and by different people!\n\n\n\n\n\n\nData folders\nWhether your lab generates its own experimental data, receives it from collaborators, or works with previously published datasets, the data folder should follow a similar structure to the one presented here. Create a separate folder for each dataset, including raw files and processed files alongside the corresponding documentation and pipeline that generated the processed data. Raw files should remain untouched, and you should consider locking modifications to the final results once data preprocessing is complete. This precaution helps prevent unwanted changes to the data. Each subfolder should be named in a way that is distinct, easily readable and clear at a glance. Check this lesson for tips on naming conventions.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nUse an acronym (1) that describes the type of NGS assay (RNAseq, ChIPseq, ATACseq) a keyword (2) that represents a unique element to that assay, and the date (3).\n<Assay-ID>_<keyword>_YYYYMMDD\nFor example CHIP_Oct4_20230101 is a ChIPseq assay made on 1st January 2023 with the keyword Oct4, so it is easily identifiable by the eye.\n\n\n\n\n\nLet’s explore a potential folder structure and the types of files you might encounter within it.\n<data_type>_<keyword>_YYYYMMDD/\n├── README.md \n├── CHECKSUMS\n├── pipeline\n ├── pipeline.md\n ├── scripts/\n├── processed\n ├── fastqc/\n ├── multiqc/\n ├── final_fastq/\n└── raw\n ├── .fastq.gz \n └── samplesheet.csv\n\nREADME.md: This file contains a detailed description of the dataset commonly in markdown format. It should include the provenance of the raw data (such as samples, laboratory protocols used, the aim of the project, folder structure, naming conventions, etc.).\nmetadata.yml: This metadata file outlines different keys and essential information, usually presented in YAML format. For more details, refer to this lesson.\npipeline.md: This file provides an overview of the pipeline used to process raw data, as well as the commands to run the pipeline. The pipeline itself and all the required scripts should be collected in the same directory.\nprocessed: This folder contains the results from the preprocessing pipeline. The content vary depending on the specific pipeline used (create additional subdirectories as needed).\nraw: This folder holds the raw data.\n\n.fastq.gz: For example, in NGS assays, there should be ‘fastq’ files.\nsamplesheet.csv: This file holds essential metadata for the samples, including sample identification, experimental variables, batch information, and other metrics crucial for downstream analysis. It is important that this file is complete and current, as it is key to interpreting results. If you are considering running nf-core pipelines, this file will be required.\n\n\n\n\nProject folders\nOn the other hand, we have another type of folder called Projects which refers to data analyses that are specific to particular tasks, such as those involved in preparing a potential article. In this folder, you will create a subfolder for each project that you or your lab is working on. Each Project subfolder should include project-specific information, data analysis pipelines, notebooks, and scripts used for that particular project. Additionally, you should include an environment file with all the required software and dependencies needed for the project, including their versions. This helps ensure that the analyses can be easily replicated and shared with others.\nThe Project folder should be named in a way that is unique, easy to read, distinguishable, and clear at a glance. For example, you might name it based on the main author’s initials, the dataset being analyzed, the project name, a unique descriptive element related to the project, or the part of the project you are responsible for, along with the date:\n<project>_<keyword>_YYYYMMDD\n\n\n\n\n\n\nNaming examples\n\n\n\n\n\n\n\n\nRNASeq_Mouse_Brain_20230512: a project RNA sequencing data from a mouse brain experiment, created on May 12, 2023\nEHR_COVID19_Study_20230115: a project around electronic health records data for a COVID-19 study, created on January 15, 2023.\n\n\n\n\n\n\nNow, let’s explore an example of a folder structure and the types of files you might encounter within it.\n<project>_<keyword>_YYYYMMDD\n├── data\n│ └── <ID>_<keyword>_YYYYMMDD <- symbolic link\n├── documents\n│ └── research_project_template.docx\n├── metadata.yml\n├── notebooks\n│ └── 01_data_processing.rmd\n│ └── 02_data_analysis.rmd\n│ └── 03_data_visualization.rmd\n├── README.md\n├── reports\n│ └── 01_data_processing.html\n│ └── 02_data_analysis.html\n│ ├── 03_data_visualization.html\n│ │ └── figures\n│ │ └── tables\n├── requirements.txt // env.yaml\n├── results\n│ ├── figures\n│ │ └── 02_data_analysis/\n│ │ └── heatmap_sampleCor_20230102.png\n│ ├── tables\n│ │ └── 02_data_analysis/\n│ │ └── DEA_treat-control_LFC1_p01.tsv\n│ │ └── SumStats_sampleCor_20230102.tsv\n├── pipeline\n│ ├── rules // processes \n│ │ └── step1_data_processing.smk\n│ └── pipeline.md\n├── scratch\n└── scripts\n\ndata: This folder contains symlinks or shortcuts to the actual data files, ensuring that the original files remain unaltered.\ndocuments: This folder houses Word documents, slides, or PDFs associated with the project, including data and project explanations, research papers, and more. It also includes the Data Management Plan.\n\nresearch_project_template.docx. If you download our template you will find a is a pre-filled Data Management Plan based on the Horizon Europe guidelines named ‘Non-sensitive_NGS_research_project_template.docx’.\n\nmetadata.yml: metadata file describing various keys of the project or experiment (see this lesson).\nnotebooks: This folder stores Jupyter, R Markdown, or Quarto notebooks containing the data analysis. Figures and tables used for the reports are organized under subfolders named after the notebook that created them for provenance purposes.\nREADME.md: A detailed project description in markdown or plain-text format.\nreports: Notebooks rendered as HTML, docx, or PDF files for sharing with colleagues or as formal data analysis reports.\n\nfigures: figures produced upon rendering notebooks. The figures will be saved under a subfolder named after the notebook that created them. This is for provenance purposes so we know which notebook created which figures.\n\nrequirements.txt: This file lists the necessary software, libraries, and their versions required to reproduce the code. If you’re using conda environments, you will also find the env.yaml file here, which outlines the specific environment configuration.\nresults: This folder contains analysis results, such as figures and tables. Organizing results by the pipeline, script, or notebook that generated them will make it easier to locate and interpret the data.\npipeline: A folder containing pipeline scripts or workflows for processing and analyzing data.\nscratch: A folder designated for temporary files or workspace for experiments and development.\nscripts: Folder for helper scripts needed to run data analysis or reproduce the work.\n\n\n\nTemplate engine\nCreating a folder template is straightforward with cookiecutter a command-line tool that generates projects from templates (called cookiecutters). For example, it can help you set up a Python package project based on a Python package project template.\n\n\n\n\n\n\nCookiecutter templates\n\n\n\nHere are some template that you can use to get started, adapt and modify them to your own needs:\n\nPython package project\nSandbox test\nData science\nNGS data\n\nCreate your own template from scratch.\n\n\n\nQuick tutorial on cookiecutter\nBuilding a Cookiecutter template from scratch requires defining a folder structure, crafting a cookiecutter.json file, and outlining placeholders (keywords) that will be substituted when generating a new project. Here’s a step-by-step guide on how to proceed:\n\nStep 1: Create a Folder Template\nFirst, begin by creating a folder structure that aligns with your desired template design. For instance, let’s set up a simple Python project template:\nmy_template/\n|-- {{cookiecutter.project_name}}\n| |-- main.py\n|-- tests\n| |-- test_{{cookiecutter.project_name}}.py\n|-- README.md\nIn this example, {cookiecutter.project_name} is a placeholder that will be replaced with the actual project name when the template is used. This directory contains a python script (‘main.py’), a subdirectory (‘tests’) with a second python script named after the project (‘test_{{cookiecutter.project_name}}.py’) and a ‘README.md’ file.\n\n\nStep 2: Create cookiecutter.json\nIn the root of your template folder, create a file named cookiecutter.json. This file will define the variables (keywords) that users will be prompted to fill in. For our Python project template, it might look like this:\n{\n \"project_name\": \"MyProject\",\n \"author_name\": \"Your Name\",\n \"description\": \"A short description of your project\"\n}\nWhen users generate a project based on your template, they will be prompted with these questions. The provided values (“responses”) will be used to substitute the placeholders in your template files.\nBeyond substituting placeholders in file and directory names, Cookiecutter can automatically populate text file contents with information. This feature is useful for offering default configurations or code file templates. Let’s enhance our earlier example by incorporating a placeholder within a text file:\nFirst, modify the my_template/main.py file to include a placeholder inside its contents:\n# main.py\n\ndef hello():\n print(\"Hello, {{cookiecutter.project_name}}!\")\nThe ‘{{cookiecutter.project_name}}’ placeholder is now included within the main.py file. When you execute Cookiecutter, it will automatically replace the placeholders in both file and directory names and within text file contents.\nAfter running Cookiecutter, your generated ‘main.py’ file could appear as follows:\n# main.py\n\ndef hello():\n print(\"Hello, MyProject!\") # Assuming \"MyProject\" was entered as the project_name\n\n\nStep 3: Use Cookiecutter\nOnce your template is prepared, you can utilize Cookiecutter to create a project from it. Open a terminal and execute:\ncookiecutter path/to/your/template\nCookiecutter will prompt you to provide values for project_name, author_name, and description. Once you input these values, Cookiecutter will replace the placeholders in your template files with the entered values.\n\n\nStep 4: Review the Generated Project\nAfter the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will find a project structure with the placeholders replaced by the values you provided.\n\n\n\n\n\n\nExercise 1: Create your own template\n\n\n\n\n\n\n\nUse Cookiecutter to create custom templates for your folders. You can do it from scratch (see Exercise 1, part B) or opt for one of our pre-made templates available as a Github repository (recommended for this workshop). Feel free to tailor the template to your specific requirements—you don’t have to follow our examples exactly.\nRequirements\nWe assume you have already gone through the requirements at the beginning of the practical lesson. This includes installing the necessary tools and setting up accounts as needed.\nProject\n\nGo to our Cookicutter template and click on the Fork button at the top-right corner of the repository page to create a copy of the repository on your own GitHub account or organization. \nOpen a terminal on your computer, copy the URL of your fork and clone the repository to your local machine (the URL should look something like https://github.com/your_username/cookiecutter-template):\ngit clone <your URL to the template>\nIf you have a GitHub Desktop, click Add and select “Clone repository” from the options\nOpen the repository and navigate through the different directories\nModify the contents of the repository as needed to fit your project’s requirements. You can change files, add new ones. remove existing one or adjust the folder structure. For inspiration, review the data structure above under ‘Project folder’. For instance, this template is missing the ‘reports’ directory and add the ‘requirements.txt’ file. Consider creating it, along with a subdirectory named ‘reports/figures’.\n├── results/\n│ ├── figures/\n├── requirements.txt\nHere’s an example of how to do it:\n# Open your terminal and navigate to your template directory. Then: \ncd \\{\\{\\ cookiecutter.project_name\\ \\}\\}/ \nmkdir reports \ntouch requirements.txt\nCommit and push changes when you are done with your modifications\n\n\nStage the changes with git add\nCommit the changes with a meaningful commit message git commit -m \"update cookicutter template\"\nPush the changes to your forked repository on Github git push origin main (or the appropriate branch name)\n\n\nTest your template by using cookiecutter <URL to your GitHub repository \"cookicutter-template\">\nFill up the variables and verify that the new structure (and folders) looks like you would expect. Have any new folders been added, or have some been removed?\n\n\n\n\n\n\n\n\n\n\n\n\nOptional Exercise 1, part B\n\n\n\n\n\n\n\nCreate a template from scratch using this tutorial scratch, it can be as basic as this one below or ‘Data folder’:\nmy_template/\n|-- {{cookiecutter.project_name}}\n| |-- main.py\n|-- tests\n| |-- test_{{cookiecutter.project_name}}.py\n|-- README.md\n\nStep 1: Create a directory for the template.\nStep 2: Write a cookiecutter.json file with variables such as project_name and author.\nStep 3: Set up the folder structure by creating subdirectories and files as needed.\nStep 4: Incorporate cookiecutter variables in the names of files.\nStep 5: Use cookiecutter variables within scripts, such as printing a message that includes the project name." + }, + { + "objectID": "develop/practical_workshop.html#data-documentation", + "href": "develop/practical_workshop.html#data-documentation", + "title": "Practical material", + "section": "2. Data documentation", + "text": "2. Data documentation\nData documentation involves organizing, describing, and providing context for datasets and projects. While metadata concentrates on the data itself, README files provide a broader perspective on the overall project or resource.\n\nMetadata\n\n\n\n\n\n\nmetadata.yml\n\n\n\nChoose the format that best suits the project’s needs. In this workshop, we will focus on YAMl as it is highly used for configuration files (e.g., in conda or pipelines).\n\n\n\n\n\n\nFile formats\n\n\n\n\n\n\n\n\nXML (eXtensible Markup Language): uses custom tags to describe data and allows for a hierarchical structure.\nJSON (JavaScript Object Notation): lightweight and human-readable format that is easy to parse and generate.\nCSV (Comma-Separated Values) or TSV (tabulate-separate values): simple and widely supported for representing tabular formats. Easy to manipulate using software or programming languages. It is often use for sample metadata.\nYAML (YAML Ain’t Markup Language): human-readable data serialization format, commonly used as project configuration files.\n\nOthers such as RDF or HDF5.\n\n\n\n\n\nLink to the file format database.\n\n\nMetadata in biological datasets refers to the information that describes the data and provides context for how the data was collected, processed, and analyzed. Metadata is crucial for understanding, interpreting, and using biological datasets effectively. It also ensures that datasets are reusable, reproducible and understandable by other researchers. Some of the components may differ depending on the type of project, but there are general concepts that will always be shared across different projects:\n\nSample information and collection details\nBiological context (such experimental conditions if applicable)\nData description\nData processing steps applied to the raw data\nAnnotation and Ontology terms\nFile metadata (file type, file format, etc.)\nEthical and Legal Compliance (ownership, access, provenance)\n\n\n\n\n\n\n\nMetadata and controlled vocabularies\n\n\n\nTo maximize the usefulness of metadata, aim to use controlled vocabularies across all fields. Read more about data documentation and find ontology services examples in lesson 4. We encourage you to begin implementing them systematically on your own (under the “sources” section, you will find some helpful links to guide you putting them in practice).\nIf you work with NGS data, check out this recommendations and examples of metadata for samples, projects and datasets.\n\n\n\n\nREADME file\n\n\n\n\n\n\nREADME.md\n\n\n\nChoose the format that best suits the project’s needs. In this workshop, we will focused on Markdown as it is the most used format due to its balance of simplicity and expressive formatting options.\n\n\n\n\n\n\nFile formats\n\n\n\n\n\n\n\n\nMarkdown (.md): commonly used because is easy to read and write and is compatible across platforms (e.g., GitHub, GitLab). Supports formatting like headings, lists, links, images, and code blocks.\nPlain Text (.txt): Simple and straightforward format without any rich formatting and great for basic instructions. Lack the ability of structure content effectively.\nReStructuredText (.rst): commonly used for python projects. Supports advanced formatting (takes, links, images and code blocks) .\n\nOthers such as HTML, YAML and Notebooks.\n\n\n\n\n\nLink to the file format database\n\n\nThe README.md file is a markdown file that provides a comprehensive description of the data within a folder. Its rich text format (including bold, italic, links, etc.) allows you to explain the contents of the folder, as well as the reasons and methods behind its creation or collection. The content will vary depending on what it described (data or assays, project, software…).\nHere is an example of a README file for a bioinformatics project:\n\n\n\n\n\n\nREADME\n\n\n\n\n\n# TITLE\nClear and descriptive.\n# OVERVIEW\nIntroduction to the project including its aims, and its significance. Describe the main purpose and the biological questions being addressed.\n\n\n\n\n\n\nExample text\n\n\n\n\n\n\n\nThis project aims to investigate gene expression patterns across various human tissues using Next Generation Sequencing (NGS) data. By analyzing the transcriptomes of different tissues, we seek to uncover tissue-specific gene expression profiles and identify potential markers associated with specific biological functions or diseases.\nUnderstanding tissue-specific gene expression is crucial for deciphering the molecular basis of health and disease. Identifying genes that are uniquely expressed in certain tissues can provide insights into tissue function, development, and potential therapeutic targets. This project contributes to our broader understanding of human biology and has implications for personalized medicine and disease research.\n\n\n\n\n\n# TABLE OF CONTENTS (optional but helpful for others to navigate to different sections)\n# INSTALLATION AND SETUP\nList all prerequisites, software, dependencies, and system requirements needed for others to reproduce the project. If available, you may link to a Docker image, Conda YAML file, or requirements.txt file.\n# USAGE\nInclude command-line examples for various functionalities or steps and path for running a pipeline, if applicable.\n# DATASETS\nDescribe the data,, including its sources, format, and how to access it. If the data has undergone preprocessing, provide a description of the processes applied or the pipeline used.\n\n\n\n\n\n\nExample text\n\n\n\n\n\n\n\nWe have used internal datasets with IDs: RNA_humanSkin_20201030, RNA_humanBrain_20210102, RNA_humanLung_20220304.\nIn addition, we utilized publicly available NGS datasets from the GTEx (Genotype-Tissue Expression) project, which provides comprehensive RNA-seq data across multiple human tissues. These datasets offer a wealth of information on gene expression levels and isoform variations across diverse tissues, making them ideal for our analysis.\n\n\n\n\n\n# RESULTS\nSummarize the results and key findings or outputs.\n\n\n\n\n\n\nExample text\n\n\n\n\n\n\n\nOur analysis revealed distinct gene expression patterns among different human tissues. We identified tissue-specific genes enriched in brain tissues, highlighting their potential roles in neurodevelopment and function. Additionally, we found a set of genes that exhibit consistent expression across a range of tissues, suggesting their fundamental importance in basic cellular processes.\nFurthermore, our differential expression analysis unveiled significant changes in gene expression between healthy and diseased tissues, shedding light on potential molecular factors underlying various diseases. Overall, this project underscores the power of NGS data in unraveling intricate gene expression networks and their implications for human health.\n\n\n\n\n\n# CONTRIBUTIONS AND CONTACT INFO\n# LICENSE\n\n\n\n\n\n\n\n\n\n\nExercise 2: modify the metadata.yml file in your Cookiecutter template\n\n\n\n\n\n\n\nIt is time now to customize your Cookiecutter templates and modify the metadata.yml files so that they fit your needs!\n\nConsider changing variables (add/remove) in the metadata.yml file from the cookicutter template.\nModify the cookiecutter.json file. You could add new variables or change the default key and/or values:\n{\n\"project_name\": \"myProject\",\n\"project_slug\": \"{{ cookiecutter.project_name.lower().replace(' ', '_').replace('-', '_') }}\",\n\"authors\": \"myName\",\n\"start_date\": \"{% now 'utc', '%Y%m%d' %}\",\n\"short_desc\": \"\",\n\"version\": \"0.1.0\"\n}\nThe metadata file will be filled accordingly.\nOptional: You can customize or remove this prompt message entirely, allowing you to tailor the text to your preferences for a unique experience each time you use the template.\n\"__prompts__\": {\n \"project_name\": \"Project directory name [Example: project_short_description_202X]\",\n \"author\": \"Author of the project\",\n \"date\": \"Date of project creation, default is today's date\",\n \"short_description\": \"Provide a detailed description of the project (context/content)\"\n},\nModify the metadata.yml file so that it includes the metadata recorded by the cookiecutter.json file. Hint below:\nproject: {{ cookiecutter.project_name }}\nauthor: {{ cookiecutter.author }}\ndate: {{ cookiecutter.date }}\ndescription: {{ cookiecutter.short_description }}\nModify the README.md file so that it includes the short description recorded by the cookiecutter.json file and the metadata at the top of the markdown file (top between lines of dashed).\n---\ntitle: {{ cookiecutter.project_name }}\ndate: \"{{ cookiecutter.date }}\"\nauthor: {{ cookiecutter.author }}\nversion: {{ cookiecutter.version }}\n---\n\nProject description\n----\n\n{{ cookiecutter.short_description }}\nCommit and push changes when you are done with your modifications\n\n\nStage the changes with git add\nCommit the changes with a meaningful commit message git commit -m \"update cookicutter template\"\nPush the changes to your forked repository on Github git push origin main (or the appropriate branch name)\n\n\nTest your template by using cookiecutter <URL to your GitHub repository \"cookicutter-template\">\nFill up the variables and verify that the modified information looks like you would expect." + }, + { + "objectID": "develop/practical_workshop.html#overview", + "href": "develop/practical_workshop.html#overview", + "title": "Practical material", + "section": "Overview", + "text": "Overview\nIntroduction to the project including its aims, and its significance. Describe the main purpose and the biological questions being addressed\n\n\n\n\n\n\nExample text\n\n\n\n\n\n\n\nThis project aims to investigate gene expression patterns across various human tissues using Next Generation Sequencing (NGS) data. By analyzing the transcriptomes of different tissues, we seek to uncover tissue-specific gene expression profiles and identify potential markers associated with specific biological functions or diseases.\nUnderstanding tissue-specific gene expression is crucial for deciphering the molecular basis of health and disease. Identifying genes that are uniquely expressed in certain tissues can provide insights into tissue function, development, and potential therapeutic targets. This project contributes to our broader understanding of human biology and has implications for personalized medicine and disease research." + }, + { + "objectID": "develop/practical_workshop.html#table-of-contents-optional-but-helpful-for-others-to-navigate-to-different-sections", + "href": "develop/practical_workshop.html#table-of-contents-optional-but-helpful-for-others-to-navigate-to-different-sections", + "title": "Practical material", + "section": "Table of Contents (optional but helpful for others to navigate to different sections)", + "text": "Table of Contents (optional but helpful for others to navigate to different sections)" + }, + { + "objectID": "develop/practical_workshop.html#installation-and-setup", + "href": "develop/practical_workshop.html#installation-and-setup", + "title": "Practical material", + "section": "Installation and setup", + "text": "Installation and setup\nList all prerequisites, software, dependencies, and system requirements needed for others to reproduce the project. If available, you may link to a Docker image, Conda YAML file, or requirements.txt file." + }, + { + "objectID": "develop/practical_workshop.html#usage", + "href": "develop/practical_workshop.html#usage", + "title": "Practical material", + "section": "Usage", + "text": "Usage\nInclude command-line examples for various functionalities or steps and path for running a pipeline, if applicable." + }, + { + "objectID": "develop/practical_workshop.html#results", + "href": "develop/practical_workshop.html#results", + "title": "Practical material", + "section": "Results", + "text": "Results\nOur analysis revealed distinct gene expression patterns among different human tissues. We identified tissue-specific genes enriched in brain tissues, highlighting their potential roles in neurodevelopment and function. Additionally, we found a set of genes that exhibit consistent expression across a range of tissues, suggesting their fundamental importance in basic cellular processes.\nFurthermore, our differential expression analysis unveiled significant changes in gene expression between healthy and diseased tissues, shedding light on potential molecular factors underlying various diseases. Overall, this project underscores the power of NGS data in unraveling intricate gene expression networks and their implications for human health.\n\nFor more details, refer to our Jupyter Notebook for the complete analysis pipeline and code." + }, + { + "objectID": "develop/04_metadata.html#data-documentation", + "href": "develop/04_metadata.html#data-documentation", + "title": "4. Documentation for biodata", + "section": "Data documentation", + "text": "Data documentation\nEssential documentation comes in different forms and flavors, serving various purposes in research. Examples include protocols outlining experimental procedures, detailed lab journals recording experimental conditions and observations, codebooks explaining concepts, variables, and abbreviations used in the analysis, information about the structure and content of a dataset, software installation, and usage manual, code explanation within files or methodological information outlining data processing steps.\n From ontotext.com\nData documentation provides essential context and structure to (primary) data, enabling researchers to understand its significance and facilitate efficient data management. Some common elements found in metadata for bioinformatics data include:\n\nData collection information: source (e.g., organism, tissue or location), date (YYYY-MM-DD format) and time, collection methods employed or experimental conditions.\nData processing information: data content, data format, data cleaning and transformation such as filtering and normalizations techniques, and software and tools used.\nData description: variables and attributes, and data types (e.g., categorical, numerical, or textual).\nBiological context: experimental design, biological purpose and relevance and implications in the broader context.\nData ownership and access: authorship, licensing of the data and details on accessing and sharing.\nProvenance and tracking: version control information over time and citations, such as links to publications or studies that reference the data.\n\nData documentation also serves as a crucial guide in navigating the complex landscape of data, akin to a cheat sheet for piecing together the puzzle of information. Much like identifying puzzle pieces, metadata provides essential details about data origin, structure, and context, such as sample collection details, experimental procedures, and equipment used. Metadata enables data exploration, interpretation, and future accessibility, promoting effective management and facilitating data usability and reuse.\n\n\n\n\n\n\nBenefits of collecting proper documentation\n\n\n\n\nData Context and Interpretation: Aiding in understanding experimental conditions, sample origins, and processing methods, is crucial for accurate results interpretation.\nData Discovery and Access: Documentation enables easy locating and accessing of specific datasets by quickly identifying relevant data through sample identifiers, experimental parameters, and timestamps.\nReproducibility and Collaboration: Documentation facilitates experiment replication and validation by enabling colleagues to reproduce analyses, compare results, and collaborate effectively, enhancing the integrity of scientific findings.\nQuality Control and Validation: Documentation supports data quality assessment by tracking the origin and handling of NGS data, allowing the identification of errors or biases to validate analysis accuracy and reliability.\nLong-Term Data Preservation: Documentation ensures preservation over time, facilitating future understanding and utilization of archived datasets for continued scientific impact as research progresses.\n\n\n\n\nStreamlining Metadata Collection\nData and project directories should both include metadata and a README file. Metadata delivers descriptive information about a dataset or project, offering insights for interpreting, using, and sharing the data effectively. README files offer an overview and purpose of the project or dataset, providing instructions and guidance for setting up, running, and using the data or tools. While metadata concentrates on the data itself, README files provide a broader perspective on the overall project or resource.\n\n\n\n\n\n\nPractical tips\n\n\n\n\nImplement a logical structure with clear and descriptive file names.\nUse of controlled vocabularies and ontologies to ensure consistency and efficient data management and interpretation.\nUse a repository and a versioning system\nMake it Machine-readable, -actionable, and -interpretable.\nDevelop standards further within your research environment FAIRsharing standards.\nInclude all information for others to comprehend and effectively utilize the data.\n\n\n\n\n\nREADME.md\n\n\n\n\n\n\nFile formats\n\n\n\nLink to the file format database\n\nMarkdown (.md): commonly used because is easy to read and write and is compatible across platforms (e.g., GitHub, GitLab). Supports formatting like headings, lists, links, images, and code blocks.\nPlain Text (.txt): Simple and straightforward format without any rich formatting and great for basic instructions. Lack the ability of structure content effectively.\nReStructuredText (.rst): commonly used for python projects. Supports advanced formatting (takes, links, images and code blocks) .\n\nOthers such as HTML, YAML and Notebooks.\n\n\nThe README.md file, written in markdown format, provides a detailed description of the folder’s content. It includes information such as the purpose of the data, collection methods, and relevant details. The content might differ based on the purpose of the data.\n\n\n\n\n\n\nExercise 1: Identify README.md key components.\n\n\n\n\n\n\n\nSelect one of the examples below and reflect on how effectively the README communicates important information about the project. Please note that some of the links lead to README files describing databases, while others pertain to software and tools.\n\n1000 Genomes Project. You will find several readme files here.\n\nHomo Sapiens, fasta GRCh38\nIPD-IMGT/HLA Database\nDocker\nPython pandas\n\n\n\n\n\n\nStructure for bioinformatics projects.\n\nDescription and relevance the project\nObjectives and aims\nDatasets and software requirements\nInstruction for data interpretation\nSummary of results\nContributions\nAdditional comments or notes\n\n\n\nmetadata.yml\n\n\n\n\n\n\nFile formats\n\n\n\n\nXML (eXtensible Markup Language): uses custom tags to describe data and allows for a hierarchical structure.\nJSON (JavaScript Object Notation): lightweight and human-readable format that is easy to parse and generate.\nCSV (Comma-Separated Values) or TSV (tabulate-separate values): simple and widely supported for representing tabular formats. Easy to manipulate using software or programming languages. It is often use for sample metadata.\nYAML (YAML Ain’t Markup Language): human-readable data serialization format, commonly used as project configuration files.\n\nOthers such as RDF or HDF5.\n\n\nLink to the file format database.\nMetadata can be written in many file formats (commonly used: YAML, TXT, JSON, and CSV). We recommend YAML format, which is a text document that contains data formatted using a human-readable data format for data serialization. However, choose the format that best suits the project’s needs. The content will be specific to the type of project.\nmetadata:\n project: \"Title\"\n author: \"Name\"\n date: \"YYYYMMDD\"\n description: \"Project short description\"\n version: \"1.0\"\n analysis:\n tool: \"software\"\n version: \"1.1.1\"\nSome general metadata fields used across different disciplines:\n\nProject Title: A concise and informative name for the dataset.\nAuthor(s): The individual(s) or organization responsible for creating the dataset. Include ORCID for identification.\nDate Created: The date when the dataset was originally generated or compiled, in YYYY-MM-DD format.\nDate Modified: The date when the dataset was last updated or modified (YYYY-MM-DD).\nObject ID: The project or assay ID for tracking and reference purposes.\nDescription: A short narrative explaining the content, purpose, and context of the project.\nKeywords: Descriptive terms or phrases capturing the main topics and attributes.\nEthical and Legal Considerations: Information about ethical approvals, consent, and any legal restrictions.\nVersion: The version number or identifier, useful for tracking changes.\nRelated Publications: Links or references to scientific publications associated with the folder. Always add the DOI.\nFunding Source: Details about the funding agency or source that supported the research or data generation.\nLicense: The type of license or terms of use associated with the dataset/project.\nContact Information: Contact details for individuals who can provide further information about the dataset/project.\n\n\n\n\n\n\n\nTip\n\n\n\nThere is an exercise in the practical material to streamline the creation of metadata files using Cookiecutter, a template-based scaffolding tool.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nCreate a metadata file with the following description fields: name, date, description, version, authors, keywords, license. Fill it up at the start of the project, when you generate the file structure.", + "crumbs": [ + "Course material", + "Key practices", + "4. Documentation for biodata" + ] + }, + { + "objectID": "develop/practical_workshop.html#create-a-catalog-of-your-data-folder", + "href": "develop/practical_workshop.html#create-a-catalog-of-your-data-folder", + "title": "Practical material", + "section": "4. Create a catalog of your data folder", + "text": "4. Create a catalog of your data folder\nThe next step is to collect all the NGS datasets that you have created in the manner explained above. Since your folders all should contain the metadata.yml file in the same place with the same metadata, it should be very easy to iteratively go through all the folders and merge all the metadata.yml files into a one single table. This table can be then browsed easily with Microsoft Excel, for example. If you are interested in making a Shiny app or Python Panel tool to interactively browse the catalog, check out this lesson.\n\n\n\n\n\n\nExercise 4: create a metadata.tsv catalog\n\n\n\n\n\n\n\nWe will make a small script in R (or you can make one with Python) that recursively goes through all the folders inside an input path (like your Assays folder), fetches all the metadata.yml files, and merges them. Finally, it will write a TSV file as an output.\n\nCreate a folder called dataset and change directory cd dataset\nFork this repository: a Cookiecutter template designed for NGS datasets. While you are welcome to create your own template from scratch, we recommend using this one to save time.\nRun the cookiecutter cc-data-template command at least twice to create multiple datasets or projects. Use different values each time to simulate various scenarios (do this in the dataset directory that you have previously created). Execute the script below using R (or create your own script in Python). Adjust the folder_path variable so that it matches the path to the Assays folder. The resulting table will be saved in the same folder_path.\nOpen your database_YYYYMMDD.tsv table in a text editor from the command-line, or view it in Excel for better visualization.\n\n\nlibrary(yaml)\nlibrary(dplyr)\nlibrary(lubridate)\n\n# Function to read a YAML file and transform it into a dataframe format.\nread_yaml <- function(file_path) {\n # Read the YAML file and convert it to a data frame\n df <- yaml::yaml.load_file(file_path) %>% as.data.frame(stringsAsFactors = FALSE)\n \n # Return the data frame\n return(df)\n}\n\n# Function to recursively fetch metadata.yml files\nget_metadata <- function(folder_path) {\n file_list <- list.files(path = folder_path, pattern = \"metadata\\\\.yml$\", recursive = TRUE, full.names = TRUE)\n\n metadata_list <- lapply(file_list, read_yaml)\n \n # Combine the list of data frames into a single data frame using dplyr::bind_rows()\n combined_metadata <- bind_rows(metadata_list)\n\n return(combined_metadata)\n}\n\n# Specify the folder path\nfolder_path <- \"/path/to/your/folder\"\n\n# Fetch metadata from the specified folder\nmetadata <- get_metadata(folder_path)\n\n# Save the data frame as a TSV file\noutput_file <- paste0(\"database_\", format(Sys.Date(), \"%Y%m%d\"), \".tsv\")\nwrite.table(metadata, file = output_file, sep = \"\\t\", quote = FALSE, row.names = FALSE)\n\n# Print confirmation message\ncat(\"Database saved as\", output_file, \"\\n\")" } ] \ No newline at end of file diff --git a/_site/site_libs/bootstrap/bootstrap.min.css b/_site/site_libs/bootstrap/bootstrap.min.css index db99189a..0b053cee 100644 --- a/_site/site_libs/bootstrap/bootstrap.min.css +++ b/_site/site_libs/bootstrap/bootstrap.min.css @@ -1,4 +1,4 @@ -@import"https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css";@import"https://fonts.googleapis.com/css2?family=Roboto:wght@300;400;500;700&display=swap";details>summary{list-style:none}details>summary::before{content:"> ";font-size:1.5em;margin:-5px 7px 0 0;color:#000;font-weight:bold}div.callout-exercise{border-left-color:#3eab1f !important}div.callout-exercise .callout-header{background-color:#9df980 !important;height:30px}.callout-exercise>.callout-header::before{font-family:"Font Awesome 5 Free";content:"";margin-right:10px}div.callout-exercise.callout-style-default div.callout-body{padding-bottom:0em !important;margin-bottom:-1.5em !important}.callout-hint>.callout-header::before{font-family:"Font Awesome 5 Free";content:"";margin-right:10px;color:#0c0b0c}div.callout-hint.callout-style-default>.callout-header{background-color:#f3f3f6 !important;height:25px}.callout-definition>.callout-header::before{font-family:"Font Awesome 5 Free";content:"";margin-right:10px;color:#bebcbc}div.callout-definition.callout-style-default>.callout-header{background-color:#fff !important;height:30px}/*! +@import"https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css";@import"https://fonts.googleapis.com/css2?family=Roboto:wght@300;400;500;700&display=swap";details>summary{list-style:none}details>summary::before{content:"> ";font-size:1.5em;margin:-5px 7px 0 0;color:#000;font-weight:bold}div.callout-exercise{border-left-color:#3eab1f !important}div.callout-exercise .callout-header{background-color:#9df980 !important;height:30px}.callout-exercise>.callout-header::before{font-family:"Font Awesome 5 Free";content:"";margin-right:10px}div.callout-exercise.callout-style-default div.callout-body{padding-bottom:0em !important;margin-bottom:-1.5em !important}.callout-hint>.callout-header::before{font-family:"Font Awesome 5 Free";content:"";margin-right:10px;color:#0c0b0c}div.callout-hint.callout-style-default>.callout-header{background-color:#f3f3f6 !important;height:25px}.callout-definition>.callout-header::before{font-family:"Font Awesome 5 Free";content:"";margin-right:10px;color:#bebcbc}div.callout-definition.callout-style-default>.callout-header{background-color:#fff !important;height:26px}div.callout-definition{font-size:15px}.callout-readme>.callout-header::before{font-family:"Courier New",Courier,monospace;margin-right:10px;color:#606060}div.callout-readme.callout-style-default>.callout-header{background-color:#e3e3e3 !important;font-family:"Courier New",Courier,monospace;font-size:22px;height:30px}div.callout-readme p{font-family:"Courier New",Courier,monospace;font-size:14px;margin-bottom:.5em !important}/*! * Bootstrap v5.3.1 (https://getbootstrap.com/) * Copyright 2011-2023 The Bootstrap Authors * Licensed under MIT (https://github.com/twbs/bootstrap/blob/main/LICENSE) diff --git a/_site/use_cases.html b/_site/use_cases.html index 16d8ce22..86d7d3fe 100644 --- a/_site/use_cases.html +++ b/_site/use_cases.html @@ -8,7 +8,7 @@ -RDM for NGS - RDM use cases +RDM for biodata - RDM use cases