From 9cb00bba0d2c0c55df5ac97f3bc7bbec246cea3d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alba=20Refoyo=20Mart=C3=ADnez?= <44649699+albarema@users.noreply.github.com> Date: Wed, 3 Apr 2024 16:07:31 +0200 Subject: [PATCH] Update 03_DOD.qmd --- develop/03_DOD.qmd | 58 +++++++++++++++++++++++----------------------- 1 file changed, 29 insertions(+), 29 deletions(-) diff --git a/develop/03_DOD.qmd b/develop/03_DOD.qmd index 0b4d6f74..6c742eec 100644 --- a/develop/03_DOD.qmd +++ b/develop/03_DOD.qmd @@ -4,7 +4,7 @@ format: html date-modified: last-modified date-format: long date: 2023-11-30 -summary: In this lesson we discuss about how to organize your files and follow some naming recommendations. +summary: In this lesson, we discuss about how to organize your files and follow some naming recommendations. --- :::{.callout-note title="Course Overview"} @@ -17,7 +17,7 @@ summary: In this lesson we discuss about how to organize your files and follow s 3. Define rules for naming results and figures accurately ::: -So far, we have covered how to adhere to FAIR and Open Science standards, which primarily focus on data sharing post-project completion. However, effective data management is essential while actively working on the project. Organizing data folders, raw and processed data, analysis scripts and pipelines, and results ensures long-term project success. Without a clear structure, future access and understanding of data become challenging, even more for collaborators, leading to potential chaos down the line. +So far, we have covered how to adhere to FAIR and Open Science standards, which primarily focus on data sharing post-project completion. However, effective data management is essential while actively working on the project. Organizing data folders, raw and processed data, analysis scripts and pipelines, and results ensure long-term project success. Without a clear structure, future access and understanding of data become challenging, even more so for collaborators, leading to potential chaos down the line. :::{.callout-exercise} - Have you ever had trouble finding data, results, figures, or specific scripts? @@ -32,7 +32,7 @@ On the other hand, applying a consistent file structure and naming conventions t - **Subfolders**: enhance the organization using subfolders to further categorize data based on their contents, such as workflows, scripts, results, reports, etc. - **File naming conventions**: implement a standardized file naming convention to maintain consistency and clarity. Use descriptive and informative names (e.g., specify data type: plots, results tables, etc.) -In this lesson we will see a practical example on how you could organize your own files and folders. +In this lesson, we will see a practical example of how you could organize your own files and folders. ## Folder organization @@ -45,7 +45,7 @@ Here we suggest the use of three main folders: - Data in these folders should be locked and set to **read-only** to prevent unauthorized ("unwanted") modifications. 2. **Individual project folders**: - This directory typically belongs to the researcher conducting bioinformatics analyses and encompasses all essential files for a specific research project (data, scripts, software, workflows, results, etc.). -- A project may utilize data from various assays or results obtained from other projects. It's important to avoid duplicating datasets; instead, link it from the original source to maintain data integrity and avoid redundancy. +- A project may utilize data from various assays or results obtained from other projects. It's important to avoid duplicating datasets; instead, link them from the original source to maintain data integrity and avoid redundancy. 3. **Resources and databases folders**: - This (commonly) shared directory contains common repositories or curated databases that facilitate research (genomics, clinical data, imaging data, and more!). For instance, in genomics, it includes genome references (fasta files), annotations (gtf files) for different species, and indexes for various alignment algorithms. - Each folder corresponds to a unique reference or database version, allowing for multiple references from the same organism or different species. @@ -64,7 +64,7 @@ A database is a structured repository for storing, managing, and retrieving info ::: :::{.callout-tip title="Create shortcuts to public datasets and assays!"} -The use of symbolic links, also referred as softlinks, is a key practice in large labs where data might used for different purposes and by multiple people. +The use of symbolic links, also referred to as softlinks, is a key practice in large labs where data might used for different purposes and by multiple people. - They act as pointers, containing the path to the location of the target files/directories. - They avoid duplication and they are flexible and lightweight (do not occupy much disk space). @@ -74,7 +74,7 @@ The use of symbolic links, also referred as softlinks, is a key practice in larg :::{.callout-exercise} # Exercise: create a softlink link -Open your terminal and create a softlink using the following command. The first path is the target (directory or file) and the second one where the symbolic link will be created. +Open your terminal and create a softlink using the following command. The first path is the target (directory or file) and the second one is where the symbolic link will be created. ```{.bash} ln -s path/to/dataset/ /path/to/user//data/ ``` @@ -112,14 +112,14 @@ Let's focus on the shared folders containing experimental datasets generated in- ### Naming Shared Folders Effectively -Create a folder for all your NGS experiments, for instance named `Assay`. Each subfolder, denoted by a unique `Assay-ID`, should be named clearly and comprehensibly. Assay-ID comprises raw files, processed files, and the pipeline used to generate them. Raw files should remain unchanged, while modifications to processed files should be restricted post-preprocessing (e.g., after quality control) to prevent unintended alterations. Check the exercise for efficient naming of Assay-ID: +Create a folder for all your NGS experiments, for instance, named `Assay`. Each subfolder, denoted by a unique `Assay-ID`, should be named clearly and comprehensibly. Assay-ID comprises raw files, processed files, and the pipeline used to generate them. Raw files should remain unchanged, while modifications to processed files should be restricted post-preprocessing (e.g., after quality control) to prevent unintended alterations. Check the exercise for efficient naming of Assay-ID: :::{.callout-exercise} # Exercise: name your `Assay-ID` - How would you ensure its name is unique and understood at a glance? :::{.callout-hint} -Use an acronym (1) that describes type of NGS assay (RNAseq, ChIPseq, ATACseq) a keyword (2) that represents a unique element to that assay, and the date (3). +Use an acronym (1) that describes the type of NGS assay (RNAseq, ChIPseq, ATACseq) a keyword (2) that represents a unique element to that assay, and the date (3). ```{.bash} __YYYYMMDD ``` @@ -161,7 +161,7 @@ The provided folder structure is designed to be intuitive for NGS data. The desc - **README.md**: This file contains general information about the project or experiment, usually in markdown or plain text format. It includes details such as such as the origin of the raw NGS data (including sample information, laboratory protocols used, and the assay's objectives). Sometimes, it also outlines the basic directory structure and file naming conventions. - **metadata.yml**: This serves as the metadata file for the project ([see this lesson](./04_metadata.qmd)). -- **pipeline.md**: This document describes the pipeline employed to process the raw data, along with the specific commands used to execute the pipeline. The specific format can very depending on the workflow system employed (e.g., bash, Snakemake, Nextflow, Jupyter Notebooks etc.) ([see this lesson](./06_pipelines.qmd)). Employing a standardized pipeline ensures a consistent file organization system (and the corresponding documentation) +- **pipeline.md**: This document describes the pipeline employed to process the raw data, along with the specific commands used to execute the pipeline. The specific format can vary depending on the workflow system employed (e.g., bash, Snakemake, Nextflow, Jupyter Notebooks, etc.) ([see this lesson](./06_pipelines.qmd)). Employing a standardized pipeline ensures a consistent file organization system (and the corresponding documentation) - **processed_data**: folder with results of the preprocessing pipeline. The contents may vary depending on the pipeline utilized. For example, - **fastqc**: quality Control results of the raw fastq files. - **multiqc**: aggregated quality control results across all samples @@ -175,7 +175,7 @@ The provided folder structure is designed to be intuitive for NGS data. The desc In the `Projects` folder, usually private to the individual performing the data analysis, each project has its own subfolder containing project information, data analysis scripts and pipelines, and results. It's advisable to maintain folders for individual projects, separate from shared data folders, as project-specific files typically aren't reused across multiple projects, and more than one dataset might be needed to answer a specific scientific question. ### Naming Project Folders Effectively -The Project folder should have a unique, easily readable, distinguishable, and instantly understandable name.For instance, consider naming it using the main author's initials, a descriptive keyword, and the date: +The Project folder should have a unique, easily readable, distinguishable, and instantly understandable name. For instance, consider naming it using the main author's initials, a descriptive keyword, and the date: ```{.bash} __YYYYMMDD @@ -210,13 +210,13 @@ Next, let's take a look at a possible folder structure and what kind of files yo ``` - **data**: contains symlinks or shortcuts to where the data is (raw, processed, external, etc.), avoiding duplication and modification of original files. -- **docs**: folder containing word documents, slides or pdfs related to the project. It also contains your [Data Management Plan](./02_DMP.qmd). -- **notebooks or pipelines**: folder containing notebooks (Jupyter, R markdown, Quarto notebooks) or workflows (Snakemake or Nextflow) with the actual data analysis. Tip: labeled them numerically indicating the sequential order. +- **docs**: a folder containing Word documents, slides, or PDFs related to the project. It also contains your [Data Management Plan](./02_DMP.qmd). +- **notebooks or pipelines**: a folder containing notebooks (Jupyter, R markdown, Quarto notebooks) or workflows (Snakemake or Nextflow) with the actual data analysis. Tip: Label them numerically indicating the sequential order. - **README.md**: detailed description of the project in markdown format. - logs: log files. - tmp: store temporary or intermediate files. -- **environment**: files for reproducing the analysis environment to reproduce the results, such as a Dockerfile, conda yaml file, or a text file ([See 6th lesson](./06_pipelines.qmd) for more tips on making your pipelines reproducible). It includes software, libraries/packages and dependencies (and their versions!). -- **scripts**: folder containing helper scripts to run data analysis or source code +- **environment**: files for reproducing the analysis environment to reproduce the results, such as a Dockerfile, conda yaml file, or a text file ([See 6th lesson](./06_pipelines.qmd) for more tips on making your pipelines reproducible). It includes software, libraries/packages, and dependencies (and their versions!). +- **scripts**: a folder containing helper scripts to run data analysis or source code - **reports**: Generated analysis as HTML, PDF, LaTeX, etc. Great for sharing with colleagues and creating formal reports of the data analysis procedure. - *figures*: figures produced upon rendering notebooks. Tip: save the figures under a subfolder named after the notebook/pipeline that created them (you will appreciate this organization when you need to rerun analysis and know which script created each figure!). - **results**: results from the data analysis, such as tables and figures, etc. Tip: Create a subfolder named after the notebook or pipeline for storing the results generated by that specific notebook or pipeline. @@ -232,7 +232,7 @@ Next, let's take a look at a possible folder structure and what kind of files yo Setting up folder structures manually for each new project can be time-consuming. Thankfully, tools like [Cookiecutter](https://github.com/cookiecutter/cookiecutter) offer a solution by allowing users to create project templates easily. These templates can ensure consistency across projects and save time. Additionally, using [cruft](https://github.com/cruft/cruft) alongside Cookiecutter can assist in maintaining older templates when updates are made (by synchronizing them with the latest version). :::{.callout-note title="Cookiecutter templates"} -- Coockiecutter template for [Data science projects](https://github.com/drivendata/cookiecutter-data-science) +- Cookiecutter template for [Data science projects](https://github.com/drivendata/cookiecutter-data-science) - Brickmanlab template for [NGS data](https://github.com/brickmanlab/ngs-template): similar to the folder structures in the examples above. You can download and modify it to suit your needs. ::: @@ -255,22 +255,22 @@ How can you keep track of your resources? Name the folder using the version, or use a reference genome manager such as [refgenie](http://refgenie.databio.org/en/latest/). #### Refgenie -It manages storage, access, and transfer of reference genome resources. It provides command-line and Python interfaces to download pre-built reference genome "assets", like indexes used by bioinformatics tools. It can also build assets for custom genome assemblies. Refgenie provides programmatic access to a standard genome folder structure, so software can swap from one genome to another. Check this [tutorial](http://refgenie.databio.org/en/latest/tutorial/) to get started. +It manages the storage, access, and transfer of reference genome resources. It provides command-line and Python interfaces to download pre-built reference genome "assets", like indexes used by bioinformatics tools. It can also build assets for custom genome assemblies. Refgenie provides programmatic access to a standard genome folder structure, so software can swap from one genome to another. Check this [tutorial](http://refgenie.databio.org/en/latest/tutorial/) to get started. ::: -#### Manual download +#### Manual Download -Best practices for downloading data from the source while ensuring the preservation of information about the version and other metadata includes: +Best practices for downloading data from the source while ensuring the preservation of information about the version and other metadata include: -- Organizing data structure: Create data structure that allows storing all versions in the same parent directory, and ensure that all lab members follow these practices. +- Organizing data structure: Create a data structure that allows storing all versions in the same parent directory, and ensure that all lab members follow these practices. - Documentation and metadata preservation: Before downloading, carefully review the documentation provided by the database. Download files containing the data version and any associated metadata. - README.md: Record the version of the data in the README.md file. - Checksums: Check for and use checksums provided by the database to verify the integrity of the downloaded data, ensuring that it hasn't been corrupted during transfer. Do the exercise below. -- Verify File size: Check the file size provided by the source. It is not as secure as checksum verification but discrepancy could indicate corruption. +- Verify File size: Check the file size provided by the source. It is not as secure as checksum verification but discrepancies could indicate corruption. - Automated Processes: whenever possible, automate the download process to reduce the likelihood of errors and ensure consistency (e.g. use bash script or pipeline). :::{.callout-note title="Optional: Exercise on CHECKSUMS" collapse="true"} -We recommend the use of md5sum to verify data integrity, specially if you are downloading large datasets. In this example, we use data from the [HLA FTP Directory](ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/). +We recommend the use of md5sum to verify data integrity, especially if you are downloading large datasets. In this example, we use data from the [HLA FTP Directory](ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/). 1. Install md5sum (from coreutils package) ```{.bash} @@ -285,9 +285,9 @@ brew install coreutils #!/bin/bash # Important: go through the README before downloading! Check if a checksums file is included. -# 1. Create or change directory to the resources dir. +# 1. Create or change the directory to the resources dir. -# Check for checksums (e.g.: md5checksum.txt), download and modify it so that it only contains the checksums of the target files. The file will look like: +# Check for checksums (e.g.: md5checksum.txt), download, and modify it so that it only contains the checksums of the target files. The file will look like this: 1a3d12e4e6cc089388d88e3509e41cb3 hla_gen.fasta # Finally, save it: md5file="md5checksum.txt" @@ -314,7 +314,7 @@ genomic_resources/ │ └── indexes/ └── dw_resources.sh ``` -4. Create an md5sum file and share it with collaborators before sharing the data. This allows others to check the integrity of the files. +4. Create a md5sum file and share it with collaborators before sharing the data. This allows others to check the integrity of the files. ```{.bash} md5sum @@ -327,7 +327,7 @@ Download a file using md5sums. Choose a file from your favorite dataset or selec ## Naming conventions -Consistent naming conventions play a crucial role in scientific research by enhancing organization and data retrieval. By adopting standardized naming conventions, researchers ensure that files, experiments, or datasets are labeled logically, facilitating easy location and comparison of similar data. For instance, in fields like genomics, uniform naming conventions for files associated with particular experiments or samples allows for swift identification and comparison of relevant data, streamlining the research process and contributing to the reproducibility of findings. Overall, promotes efficiency, collaboration, and the integrity of scientific work. +Consistent naming conventions play a crucial role in scientific research by enhancing organization and data retrieval. By adopting standardized naming conventions, researchers ensure that files, experiments, or datasets are labeled logically, facilitating easy location and comparison of similar data. For instance, in fields like genomics, uniform naming conventions for files associated with particular experiments or samples allow for swift identification and comparison of relevant data, streamlining the research process and contributing to the reproducibility of findings. Overall, promotes efficiency, collaboration, and the integrity of scientific work. :::{.callout-tip title="General tips for file and folder naming"} @@ -341,9 +341,9 @@ Remember to keep the folder structure simple. - **Date-based format**: use `YYYYMMDD` format (year/month/day format helps with sorting and listing files in chronological order) - Use **underscores and hyphens** as delimiters and **avoid spaces**. - Not all search tools may work well with spaces (messy to indicate paths) - - If the length is a concern, use capital letter to delimit words [camelCase](https://en.wikipedia.org/wiki/Camel_case). + - If the length is a concern, use capital letters to delimit words [camelCase](https://en.wikipedia.org/wiki/Camel_case). - **Sequential numbering**: Use a two-‑digit format for single-digit numbers (0–9) to ensure correct numerical sequence order (for example, 01 and not 1) -- **Version control**: Indicate the version ("V") or revision ("R") as the last element, using the two--digit format (e.g., v01, v02) +- **Version control**: Indicate the version ("V") or revision ("R") as the last element, using the two-digit format (e.g., v01, v02) - Write down your naming convention pattern and document it in the README file ::: @@ -351,7 +351,7 @@ Remember to keep the folder structure simple. # Define your file name conventions Avoid long and complicated names and ensure your file names are both informative and easy to manage: -1. For saving a new plot, heatmap representing sample correlations +1. For saving a new plot, a heatmap representing sample correlations 2. When naming the file for the document containing the Research Data Management Course Objectives (Version 2, 2nd May 2024) from the University of Copenhagen 3. Consider the most common file types you work with, such as visualizations, tables, etc., and create logical and clear file names @@ -393,4 +393,4 @@ In this lesson, we have learned some practical tips and examples about how to or - UK Data Service: - Oakland University: - Cessda guidelines: . -- RDMkit Elixir Europe: \ No newline at end of file +- RDMkit Elixir Europe: