diff --git a/_quarto.yml b/_quarto.yml index 03108dcf..f99557b5 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -42,6 +42,9 @@ website: - title: "Use cases" contents: - use_cases.qmd + - section: "General" + contents: + - develop/examples/mkdocs_pages.qmd - section: "NGS data" contents: - develop/examples/NGS_OS_FAIR.qmd diff --git a/_site/develop/03_DOD.html b/_site/develop/03_DOD.html index 17ee3c63..0ac0ee6d 100644 --- a/_site/develop/03_DOD.html +++ b/_site/develop/03_DOD.html @@ -311,7 +311,7 @@

3. Data organization and storage

Modified
-

April 26, 2024

+

April 29, 2024

@@ -724,6 +724,8 @@

Template engine

@@ -742,7 +744,6 @@

Quick tutor

Learn how to create your own template here.

-

We offer workshops on practical RDM for biodata. Keep an eye on the upcoming events on the Sandbox website.

diff --git a/_site/develop/04_metadata.html b/_site/develop/04_metadata.html index 9449ee2d..ae625f14 100644 --- a/_site/develop/04_metadata.html +++ b/_site/develop/04_metadata.html @@ -305,7 +305,7 @@

4. Documentation for biodata

Modified
-

April 26, 2024

+

May 3, 2024

@@ -552,14 +552,26 @@

Con
@@ -613,6 +625,7 @@

Tables as databasesYAML metadata files associated with each project

Click on the hint to reveal the solution and a code example for the exercise, which may serve as inspiration.

+

You can find a thorough guided exercise in the practical material - Exercise 4.

  • Commit and push changes when you are done with your modifications

    • @@ -934,9 +960,9 @@

      3. Naming conventions

      4. Create a catalog of your data folder

      -

      The next step is to collect all the NGS datasets that you have created in the manner explained above. Since your folders all should contain the metadata.yml file in the same place with the same metadata, it should be very easy to iteratively go through all the folders and merge all the metadata.yml files into a one single table. This table can be then browsed easily with Microsoft Excel, for example. If you are interested in making a Shiny app or Python Panel tool to interactively browse the catalog, check out this lesson.

      +

      The next step is to collect all the datasets that you have created in the manner explained above. Since your folders all should contain the metadata.yml file in the same place with the same metadata, it should be very easy to iteratively go through all the folders and merge all the metadata.yml files into a one single table. he table can be easily viewed in your terminal or even with Microsoft Excel.

      -
      +
      @@ -945,65 +971,357 @@

      4. Cr

      -
      +
      -

      We will make a small script in R (or you can make one with Python) that recursively goes through all the folders inside an input path (like your Assays folder), fetches all the metadata.yml files, and merges them. Finally, it will write a TSV file as an output.

      +

      We will make a small script in R (or you can make one with Python) that recursively goes through all the folders inside an input path (like your Assays folder), fetches all the metadata.yml files, merges them and writes a TSV file as an output.

      1. Create a folder called dataset and change directory cd dataset
      2. -
      3. Fork this repository: a Cookiecutter template designed for NGS datasets. While you are welcome to create your own template from scratch, we recommend using this one to save time.
      4. -
      5. Run the cookiecutter cc-data-template command at least twice to create multiple datasets or projects. Use different values each time to simulate various scenarios (do this in the dataset directory that you have previously created). Execute the script below using R (or create your own script in Python). Adjust the folder_path variable so that it matches the path to the Assays folder. The resulting table will be saved in the same folder_path.
      6. +
      7. Fork this repository: a Cookiecutter template designed for NGS datasets.While you are welcome to create your own template from scratch, we recommend using this one to save time.
      8. +
      9. Run the cookiecutter cc-data-template command at least twice to create multiple datasets or projects. Use different values each time to simulate various scenarios (do this in the dataset directory that you have previously created).
      10. +
      11. Execute the script below using R (or create your own script in Python). Adjust the folder_path variable so that it matches the path to the Assays folder. The resulting table will be saved in the same folder_path.
      12. Open your database_YYYYMMDD.tsv table in a text editor from the command-line, or view it in Excel for better visualization.
      -
      
      -library(yaml)
      -library(dplyr)
      -library(lubridate)
      -
      -# Function to read a YAML file and transform it into a dataframe format.
      -read_yaml <- function(file_path) {
      -  # Read the YAML file and convert it to a data frame
      -  df <- yaml::yaml.load_file(file_path) %>% as.data.frame(stringsAsFactors = FALSE)
      -  
      -  # Return the data frame
      -  return(df)
      +
        +
      • Solution A. From a TSV
      • +
      +
      + +
      +
      +
      +
      +
      # R version 4.3.2
      +# RScript to read all yaml files in directory and save the metadata into a dataframe
      +quiet <- function(package_name) {
      +  # Suppress warnings and messages while checking and installing the package
      +  suppressMessages(suppressWarnings({
      +    # Check if the package is available and load it
      +    if (!requireNamespace(package_name, quietly = TRUE)) {
      +      install.packages(package_name)
      +    }
      +    # Load the package
      +    library(package_name, character.only = TRUE)
      +  }))
       }
       
      -# Function to recursively fetch metadata.yml files
      -get_metadata <- function(folder_path) {
      -  file_list <- list.files(path = folder_path, pattern = "metadata\\.yml$", recursive = TRUE, full.names = TRUE)
      -
      -  metadata_list <- lapply(file_list, read_yaml)
      -  
      -  # Combine the list of data frames into a single data frame using dplyr::bind_rows()
      -  combined_metadata <- bind_rows(metadata_list)
      -
      -  return(combined_metadata)
      -}
      -
      -# Specify the folder path
      -folder_path <- "/path/to/your/folder"
      -
      -# Fetch metadata from the specified folder
      -metadata <- get_metadata(folder_path)
      +# Check and install necessary libraries
      +quiet("yaml")
      +quiet("dplyr")
      +quiet("lubridate")
      +
      +
      +read_yaml <- function(file_path) {
      +  # Read the YAML file and convert it to a data frame
      +  df <- yaml::yaml.load_file(file_path) %>% as.data.frame(stringsAsFactors = FALSE)
      +  
      +  # Return the data frame
      +  return(df)
      +}
      +
      +# Function to recursively fetch metadata.yml files
      +get_metadata <- function(folder_path) {
      +  file_list <- list.files(path = folder_path, pattern = "metadata\\.yml$", recursive = TRUE, full.names = TRUE)
       
      -# Save the data frame as a TSV file
      -output_file <- paste0("database_", format(Sys.Date(), "%Y%m%d"), ".tsv")
      -write.table(metadata, file = output_file, sep = "\t", quote = FALSE, row.names = FALSE)
      -
      -# Print confirmation message
      -cat("Database saved as", output_file, "\n")
      + metadata_list <- lapply(file_list, read_yaml) + + # Combine the list of data frames into a single data frame using dplyr::bind_rows() + combined_metadata <- bind_rows(metadata_list) + + return(combined_metadata) +} + +# Specify the folder path +folder_path <- "./" #/path/to/your/folder + +# Fetch metadata from the specified folder +df <- get_metadata(folder_path) + +# Save the data frame as a TSV file +output_file <- paste0("database_", format(Sys.Date(), "%Y%m%d"), ".tsv") +write.table(df, file = output_file, sep = "\t", quote = FALSE, row.names = FALSE) + +# Print confirmation message +cat("Database saved as", output_file, "\n")
      +
      +
      +
      +
      +
      +

      Exercise 4, option B: create a SQLite database

      +

      Alternatively, create a SQLite database from a metadata. If you opt for this option in the exercise, you must still complete the first three steps outlined above. Read more from the RSQLite documentation.

      +
        +
      • Solution B. SQLite database
      • +
      +
      + +
      +
      +
      +
      +
      print("Assuming the libraries from Exercise 4 are already loaded and a dataframe has been generated from the YAML files...")
      +
      +# check_and_install() form Exercise 4, and load the other packages. 
      +quiet("DBI")
      +quiet("RSQLite")
      +
      +# Initialize a temporary in memory database and copy the data.frame into it
      +
      +db_file_path <- paste0("database_", format(Sys.Date(), "%Y%m%d"), ".sqlite")
      +con <- dbConnect(RSQLite::SQLite(), db_file_path)
      +
      +dbWriteTable(con, "metadata", df,  overwrite=TRUE) #row.names = FALSE,append =
      +
      +# Print confirmation message
      +cat("Database saved as", db_file_path, "\n")
      +
      +# Close the database connection
      +dbDisconnect(con)
      +
      +
      +
      +
      +
      +
      +

    + + + +
    +

    Shiny apps

    +

    To get the most out of your metadata file and the ones from other colleagues, you can combine them and explore them by creating an interactive catalog browser. You can create interactive web apps straight from R or Python. Whether you have generated a tabulated-file or a sqlite database, browse through the metadata using Shiny. Shiny apps are perfect for researchers because they enable you to create interactive visualizations and dashboards with dynamic data inputs and outputs without needing extensive web development knowledge. Shiny provides a variety of user interface components such as forms, tables, graphs, and maps to help you organize and present your data effectively. It also allows you to filter, sort, and segment data for deeper insights.

    +
    +
    +
    + +
    +
    +Tip +
    +
    +
    +
      +
    • For R Enthusiasts
    • +
    +

    Explore demos from the R Shiny community to kickstart your projects or for inspiration.

    +
      +
    • For python Enthusiasts
    • +
    +

    Shiny for Python provides live, interactive code throughout its entire tutorial. Additionally, it offers a great tool called Playground, where you can code and test your own app to explore how different features render.

    +
    +
    +
    +
    +
    + +
    +
    +Exercise 5: Skill Booster, build an interactive catalog browser +
    +
    +
    +
    +
    +
    +
    +

    Build an interactive web app straight from R or Python. Below, you will find an example of an R shiny app. In either case, you will need to define a user interface (UI) and a server function. The UI specifies the layout and appearance of the app, including input controls and output displays. The server function contains the app’s logic, handling data manipulation, and responding to user interactions. Once you set up the UI and server, you can launch the app!

    +

    Here’s the UI and server function structure for an R Shiny app:

    +
    # Don't forget to load shiny and DT libraries!
    +
    +# Specify the layout
    +ui <- fluidPage(
    +    titlePanel(...)
    +    # Define the appearance of the app
    +    sidebarLayout(
    +        sidebarPanel(...)
    +        mainPanel(...)
    +    )
    +)
    +
    +server <- function(input, output, session) {
    +    # Define a reactive expression for data based on user inputs
    +    data <- reactive({
    +        req(input$dataInput)  # Ensure data input is available
    +        # Load or manipulate data here
    +    })
    +
    +    # Define an output table based on data
    +    output$dataTable <- renderTable({
    +        data()  # Render the data as a table
    +    })
    +
    +    # Observe a button click event and perform an action
    +    observeEvent(input$actionButton, {
    +        # Perform an action when the button is clicked
    +    })
    +
    +    # Define cleanup tasks when the app stops
    +    onStop(function() {
    +        # Close connections or save state if necessary
    +    })
    +}
    +# Run the app
    +shinyApp(ui, server)
    +

    If you need more assistance, take a look at the code below (Hint).

    +
    + +
    +
    +
    +
    +
    # R version 4.3.2
    +print("Assuming the libraries from Exercise 4 are already loaded and a dataframe has been generated from the YAML files...")
    +
    +# check_and_install() form Exercise 4. 
    +quiet("shiny")
    +quiet("DT")
    +
    +# UI
    +ui <- fluidPage(
    +  titlePanel("TSV File Viewer"),
    +  
    +  sidebarLayout(
    +    sidebarPanel(
    +      fileInput("file", "Choose a TSV file", accept = c(".tsv")),
    +      selectInput("filter_column", "Filter by Column:", choices = c("n_samples", "technology"), selected = "technology"),
    +      textInput("filter_value", "Filter Value:", value = ""),
    +      # if only numbers, numericInput()
    +      radioButtons("sort_order", "Sort Order:", choices = c("Ascending", "Descending"), selected = "Ascending")
    +    ),
    +    
    +    mainPanel(
    +      DTOutput("table")
    +    )
    +  )
    +)
    +
    +# Server
    +server <- function(input, output) {
    +  
    +  data <- reactive({
    +    req(input$file)
    +    df <- read.delim(input$file$datapath, sep = "\t")
    +    print(str(df))
    +
    +    # Filter the DataFrame based on user input
    +    if (input$filter_column != "" && input$filter_value != "") {
    +      # Check if the column is numeric, and filter for value
    +      if (is.numeric(df[[input$filter_column]])) {
    +        df <- df[df[[input$filter_column]] >= as.numeric(input$filter_value), ]
    +      }
    +      # Check if the column is a string
    +      else if (is.character(df[[input$filter_column]])) {
    +        df <- df[df[[input$filter_column]] == input$filter_value, ]
    +      }
    +    }
    +    
    +    # Sort the DataFrame based on user input
    +    sort_order <- if (input$sort_order == "Ascending") TRUE else FALSE
    +    df <- df[order(df[[input$filter_column]], decreasing = !sort_order), ]
    +    df
    +  })
    +  
    +  output$table <- renderDT({
    +    datatable(data())
    +  })
    +}
    +
    +# Run the app
    +shinyApp(ui, server)
    +
    +
    +
    +
    +
    +

    In the optional exercise below, you’ll find a code example for using an SQLite database as input instead of a tabulated file.

    +
    +
    +
    +
    +
    +
    +
    +
    + +
    +
    +Exercise (optional) +
    +
    +
    +
    +
    +
    +
    +

    Once you’ve finished the previous exercise, consider implementing these additional ideas to maximize the utility of your catalog browser.

    +
      +
    • Use SQLite databases as input
    • +
    • Add a functionality to only select certain columns uiOutput("column_select")
    • +
    • Filter columns by value using column_filter_select()
    • +
    • Add multiple tabs using tabsetPanel()
    • +
    • Add buttons to order numeric columns ascending or descending using radioButtons()
    • +
    • Use SQL aggregation functions (e.g., SUM, COUNT, AVG) to perform custom data summaries and calculations.
    • +
    • Add a tab tabPanel() to create a project directory interactively (and fill up the metadata fields), tips: dir.create(), data.frame(), write.table()
    • +
    • Modify existing entries
    • +
    • Visualize results using Cirrocumulus, an interactive visualization tool for large-scale single-cell genomics data.
    • +
    +

    If you need some assistance, take a look at the code below (Hint).

    +
    + +
    +
    +
    +
    +

    Explore an example with advanced features such as a two-tab layout, filtering by numeric values and matching strings, and a color-customized dashboard here.

    +
    +
    +
    +
    +
    -
    -

    5. Version control of your data analysis using Git and GitHub

    -

    Version control is a systematic approach to tracking changes made to a project over time. It provides a structured means of documenting alterations, allowing you to revisit and understand the evolution of your work. In research data management and data analytics, version control is very important and gives you a lot of advantages.

    -

    Git is a distributed version control system that enables developers and researchers to efficiently manage their project’s history, collaborate seamlessly, and ensure data integrity. At its core, Git operates through the following principles and mechanisms: On the other hand, GitHub is a web-based platform that enhances Git’s capabilities by providing a collaborative and centralized hub for hosting Git repositories. It offers several key functionalities, such as tracking issues, security features to safeguard your repos, and GitHub Pages that allow you to create websites to showcase your projects.

    +
    +
    +

    5. Version control using Git and GitHub

    +

    Version control involves systematically tracking changes to a project over time, offering a structured way to document revisions and understand the progression of your work. In research data management and data analytics, it plays a critical role and provides numerous benefits.

    +

    Git is a distributed version control system that helps developers and researchers efficiently manage project history, collaborate seamlessly, and maintain data integrity. On the other hand, GitHub is a web-based platform that builds on Git’s functionality by providing a centralized, collaborative hub for hosting Git repositories. It offers several key functionalities, such as tracking issues, security features to safeguard your repos, and GitHub Pages that allow you to create websites to showcase your projects.

    @@ -1014,15 +1332,40 @@

    -

    GitHub allows users to create organizations and teams that will collaborate or create repositories under the same umbrella organization. If you would like to create an educational organization in GitHub, you can do so for free! For example, you could create a GitHub account for your lab.

    -

    To create a GitHub organization, follow these instructions

    -

    After you have created the GitHub organization, make sure that you create your repositories under the organization space and not your user!

    +

    GitHub users can create organizations, allowing groups to collaborate or create repositories under the same organization umbrella. You can create an educational organization on Github for free, by setting up a Github account for your lab.

    +

    Follow these instructions to create a GitHub organization.

    +

    Once you’ve established your GitHub organization, be sure to create your repositories within the organization’s space rather than under your personal user account. This keeps your projects centralized and accessible to the entire group. Best practices for managing an organization on GitHub include setting clear access permissions, regularly reviewing roles and memberships, and organizing repositories effectively to keep your projects structured and easy to navigate.

    +

    +
    +
    +

    Setting up a GitHub repository for your project folder

    +

    Version controlling your data analysis folders becomes straightforward once you’ve established your Cookiecutter templates. After you’ve created several folder structures and metadata using your Cookiecutter template, you can manage version control by either converting those folders into Git repositories or copying a folder into an existing Git repository. Both approaches are explained in Lesson 5.

    +
    +
    +
    + +
    +
    +Exercise 6: initialize a repository from an existing folder: +
    +
    +
    +
    +
    +
    +
    +
      +
    1. Initialize the repository: Begin by running the command git init in your project directory. This command sets up a new Git repository in the current directory and is executed only once, even for collaborative projects. See (git init) for more details.
    2. +
    3. Create a remote repository: Once the local repository is initialized, create an empty new repository on GitHub (website or Github Desktop).
    4. +
    5. Connect the remote repository: Add the GitHub repository URL to your local repository using the command git remote add origin <URL>. This associates the remote repository with the name “origin.”
    6. +
    7. Commit changes: If you have files you want to add to your repository, stage them using git add ., then create a commit to save a snapshot of your changes with git commit -m "add local folder".
    8. +
    9. Push to GitHub: To synchronize your local repository with the remote repository and establish a tracking relationship, push your commits to the GitHub repository using git push -u origin main.
    10. +
    +
    +
    +
    -
    -

    Creating a git repo online and copying your project folder

    -

    Version controlling your data analysis folders, a.k.a. Project folder, is very easy once you have set up your Cookiecutter templates. The simplest way of doing this is to first create a remote GitHub repository from the webpage (or from the Desktop app, if you are using it) with a proper project name. Then git clone that repository you just made into your Projects main folder. Then, use cookiecutter to create a project folder template and copy-paste the contents of the folder template to your cloned repo. Remember to fill up your metadata and description files! If you wish, you could already git add, commit, and push the first changes to the folders and continue from there on.

    -

    Go back to the course material lesson 5 and read the differences between converting folders to git repositories and cloning a folder to an existing git repository.

    @@ -1039,48 +1382,80 @@

    GitHub Pages

    -

    Once you have created your repository (and put it in GitHub), you have now the opportunity to add your data analysis reports that you created, in either Jupyter Notebooks, Rmarkdowns, or HTML reports, in a GitHub Page website. Creating a GitHub page is very simple, and we really recommend that you follow the nice tutorial that GitHub has put for you. Nonetheless, we will see the main steps in the exercise below.

    -

    There are many different ways to create your web pages. We recommend using Mkdocs and Mkdocs materials as a framework to create a nice webpage simply. The folder templates that we used as an example in the previous exercise already contain everything you need to start a webpage. Nonetheless, you will need to understand the basics of MkDocs and MkDocs materials to design a webpage to your liking. MkDocs is a static webpage generator that is very easy to use, while MkDocs materials is an extension of the tool that gives you many more options to customize your website. Check out their web pages to get started!

    +

    After creating your repository and hosting it on GitHub, you can now add your data analysis reports—such as Jupyter Notebooks, R Markdown files, or HTML reports—to a GitHub Page website. Setting up a GitHub Page is straightforward, and we recommend following GitHub’s helpful tutorial. However, we will go through the key steps in the exercise below. There are several ways to create your web pages, but we suggest using Quarto as a framework to build a sleek, professional-looking website with ease. The folder templates from the previous exercise already contain the necessary elements to launch a webpage. Familiarizing yourself with the basics of Quarto will help you design a webpage that suits your preferences. Other common options include MkDocs. If you want to use MkDocs instead, click here and follow the instructions.

    +
    +
    +
    + +
    +
    +Tip +
    +
    +
    +

    Here are some useful links to get started with Github Pages:

    + +
    +
    -
    +
    -Exercise 5: make a project folder and publish a data analysis webpage +Exercise 7: Create a Github Page using Quarto
    -
    +
      -
    1. Configure your main GitHub Page and its repo

      -

      The first step is to set up the main GitHub Page site and the repository that will host it. This is very simple, as you will only need to follow these steps. In a Markdown document, outline the primary objectives of the organization and provide an overview of ongoing research projects. After you have created the organization/usernamegithub.io, it is time to configure your Project repository webpage using MkDocs!

    2. -
    3. Start a new project from Cookiecutter or use one from the previous exercise.

      -

      If you use a Project repo from the first exercise, go to the next paragraph. Using Cookiecutter, create a new data analysis project. Remember to fill up your metadata and description files! After you have created the folder, it would be best to initialize a Git repo following the instructions from the previous section.

      -

      Next, link your data of interest (or create a small fake dataset) and make an example of a data analysis notebook/report (this could be just a scatter plot of a random matrix of values). Depending on your setup, you might be using Jupyter Notebooks or Rmarkdowns. The extensions that we have installed using pip allow you to directly add a Jupyter Notebook file to the mkdocs.yml navigation section. On the other hand, if you are using Rmarkdown, you will have to knit your document into either an HTML page or a GitHub document.

      -

      For the purposes of this exercise, we have already included a basic index.md markdown file that can serve as the intro page of your repo, and a jupyter_example.ipynb with some code in it. You are welcome to modify them further to test them out!

    4. -
    5. Use MkDocs to create your webpage

      -

      When you are happy with your files and are ready to publish them, make sure to add, commit, and push the changes to the remote. Then, build up your webpage using MkDocs and the mkdocs gh-deploy command from the same directory where the mkdocs.yml file is. For example, if your mkdocs.yml for your Project folder is in /Users/JARH/Projects/project1_JARH_20231010/mkdocs.yml, do cd /Users/JARH/Projects/project1_JARH_20231010/ and then mkdocs gh-deploy. This requires a couple of changes in your GitHub organization settings.

      -

      Remember to make sure that your markdowns, images, reports, etc., are included in the docs folder and properly set up in the navigation section of your mkdocs.yml file.

      -

      Finally, we only need to set up the GitHub Project repo settings.

    6. -
    7. Publishing your GitHub Page

      -

      Go to your GitHub repo settings and configure the Page section. Since you are using the mkdocs gh-deploy command to publish your site in the gh-pages branch (as explained the the mkdocs documentation), we need to change where GitHub is fetching the website. You will need to configure the settings of this repository in GitHub so that the Page is taken from the gh-pages branch and the root folder.

      -
      -
      -

      -
      GitHub Pages setup
      -
      -
      -
        -
      • Branch should be gh-pages
      • -
      • Folder should be root
      • -
      -

      After a couple of minutes, your webpage should be ready! You should be able to see your webpage through the link provided in the Page section!

    8. +
    9. Head over to GitHub and create a new public repository named username.github.io, where username is your username (or organization name) on GitHub. If the first part of the repository doesn’t exactly match your username, it won’t work, so make sure to get it right.

    10. +
    11. Go to the folder where you want to store your project, and clone the new repository: git clone https://github.com/username/username.github.io (or use Github Desktop)

    12. +
    13. Create a new file named _quarto.yml

      +
      +
      +
      _quarto.yml
      +
      +
      project:
      +    type: website
      +
    14. +
    15. Open the terminal ```{.bash filename=“Terminal”} # Add a .nojekyll file to the root of the repository not to do additional processing of your published site touch .nojekyll #copy NUL .nojekyll for windows

      +

      # Render and push it to Github quarto render git commit -m “Publish site to docs/” git push ```

    16. +
    17. If you do not have a gh-pages, you can create one as follows

      +
      +
      +
      Terminal
      +
      +
      git checkout --orphan gh-pages
      +git reset --hard # make sure all changes are committed before running this!
      +git commit --allow-empty -m "Initialising gh-pages branch"
      +git push origin gh-pages
      +
    18. +
    19. Before attempting to publish you should ensure that the Source branch for your repository is gh-pages and that the site directory is set to the repository root (/)

      +

    20. +
    21. It is important to not check your _site directory into version control, add the output directory of your project to .gitignore

      +
      +
      +
      .gitignore
      +
      +
      /.quarto/
      +/_site/
      +
    22. +
    23. Now is time to publish your website

      +
      +
      +
      .Terminal
      +
      +
      quarto publish gh-pages
      +
    24. +
    25. Once you’ve completed a local publish, add a publish.yml GitHub Action to your project by creating this YAML file and saving it to .github/workflows/publish.yml. Read how to do it here

    -

    Now it is also possible to include this repository webpage in your main webpage organizationgithub.io by including the link of the repo website (https://organizationgithub.io/repo-name) in the navigation section of the mkdocs.yml file in the main organizationgithub.io repo.

    @@ -1110,7 +1485,7 @@

    Zenodo

    Zenodo[https://zenodo.org/] is an open-access digital repository designed to facilitate the archiving of scientific research outputs. It operates under the umbrella of the European Organization for Nuclear Research (CERN) and is supported by the European Commission. Zenodo accommodates a broad spectrum of research outputs, including datasets, papers, software, and multimedia files. This versatility makes it an invaluable resource for researchers across a wide array of domains, promoting transparency, collaboration, and the advancement of knowledge on a global scale.

    Operating on a user-friendly web platform, Zenodo allows researchers to easily upload, share, and preserve their research data and related materials. Upon deposit, each item is assigned a unique Digital Object Identifier (DOI), granting it a citable status and ensuring its long-term accessibility. Additionally, Zenodo provides robust metadata capabilities, enabling researchers to enrich their submissions with detailed contextual information. In addition, it allows you to link your GitHub account, providing a streamlined way to archive a specific release of your GitHub repository directly into Zenodo. This integration simplifies the process of preserving a snapshot of your project’s progress for long-term accessibility and citation.

    -
    +
    @@ -1119,7 +1494,7 @@

    Zenodo

    -
    +
    diff --git a/_site/develop/scripts/shiny_sqlite_advanced.r b/_site/develop/scripts/shiny_sqlite_advanced.r new file mode 100644 index 00000000..99946538 --- /dev/null +++ b/_site/develop/scripts/shiny_sqlite_advanced.r @@ -0,0 +1,169 @@ +#!/usr/bin/env Rscript + +# Author: Alba Refoyo Martinez +# Copyright: Copyright 2024, University of Copenhagen +# Email: gsd818@ku.dk +# License: MIT +# R version: 4.3.2 + +# Define the UI +ui <- fluidPage( + titlePanel("SQLite R Shiny App"), + + # Use tabsetPanel to add multiple tabs + tabsetPanel( + # Existing tab for browsing the SQLite database + tabPanel("Browse Database", + sidebarLayout( + sidebarPanel( + fileInput("db_file", "Select SQLite Database File", accept = c(".sqlite")), + uiOutput("table_select"), + uiOutput("column_filter_select"), + textInput("filter_value", "Find by value", ""), + actionButton("refresh", "Refresh Tables") + # UI output for selecting columns (populated based on the selected table) + # uiOutput("column_select"), + ), + mainPanel( + DTOutput("tableData") + ) + )), + + # New tab for creating a project directory and filling metadata fields + tabPanel("Create Project Directory", + sidebarLayout( + sidebarPanel( + textInput("project_name", "Project Name:", value = "MyProject"), + textInput("metadata_field1", "Metadata Field 1:", value = ""), + textInput("metadata_field2", "Metadata Field 2:", value = ""), + actionButton("create_project", "Create Project") + ), + mainPanel( + textOutput("message") # To display feedback messages + ) + ) + ) + ) +) + +# Define the server +server <- function(input, output, session) { + # Reactive value to hold the database connection + db_conn <- reactiveVal(NULL) + + # Observe changes in the file input + observeEvent(input$db_file, { + # Check if a file is uploaded + if (!is.null(input$db_file)) { + # Get the path to the uploaded file + db_path <- input$db_file$datapath + + # Disconnect any existing connection + if (!is.null(db_conn())) { + dbDisconnect(db_conn()) + } + + # Establish a new connection to the SQLite database + conn <- dbConnect(RSQLite::SQLite(), dbname = db_path) + db_conn(conn) + + # Update the list of tables + updateTableChoices() + } + }) + + # Function to update the list of tables in the database + updateTableChoices <- function() { + # Ensure there's a database connection + if (!is.null(db_conn())) { + # Retrieve the list of tables in the database + tables <- dbListTables(db_conn()) + # Update the choices in the select input + updateSelectInput(session, "table", choices = tables) + } + } + + # Observe the refresh button + observeEvent(input$refresh, { + updateTableChoices() + }) + + # Render the select input for tables + output$table_select <- renderUI({ + selectInput("table", "Select a table", choices = character(0)) + }) + + # Render the select input for columns (choices populated based on the selected table) + # output$column_select <- renderUI({ + # req(input$table) + # # Read data from the selected table + # data <- dbReadTable(db_conn(), input$table) + # # Get the column names from the data + # columns <- names(data) + # # Create a select input for the column choices + # selectInput("columns", "Select columns", choices = columns, multiple = TRUE) + # }) + + + # Render the select input for columns to filter by + output$column_filter_select <- renderUI({ + req(input$table) + data <- dbReadTable(db_conn(), input$table) + columns <- names(data) + selectInput("column_filter", "Filter by column", choices = columns) + }) + + # Display data from the selected table + output$tableData <- renderDT({ + req(input$table) + data <- dbReadTable(db_conn(), input$table) + + # If filtering by columns, ensure they are selected + # req(input$columns) + #filtered_data <- data[, input$columns, drop = FALSE] + + if (!is.null(input$column_filter) && input$filter_value != "") { + filtered_data <- data[data[[input$column_filter]] == input$filter_value, ] + } else { + filtered_data <- data + } + + datatable(filtered_data) + }) + + # Observe the create_project button + observeEvent(input$create_project, { + project_name <- input$project_name + metadata_field1 <- input$metadata_field1 + metadata_field2 <- input$metadata_field2 + + # Define the project directory path + project_dir <- file.path(getwd(), project_name) + + # Check if the directory already exists + if (dir.exists(project_dir)) { + output$message <- renderText("Directory already exists. Please choose a different project name.") + } else { + # Create the directory + dir.create(project_dir) + + # Save the metadata fields to a TSV file in the project directory + metadata <- data.frame(Field1 = metadata_field1, Field2 = metadata_field2) + metadata_file <- file.path(project_dir, "metadata.tsv") + write.table(metadata, metadata_file, sep = "\t", row.names = FALSE, col.names = TRUE) + + # Provide feedback to the user + output$message <- renderText(paste("Project created successfully in", project_dir)) + } + }) + + # Close the database connection when the app is stopped + onStop(function() { + if (!is.null(db_conn())) { + dbDisconnect(db_conn()) + } + }) +} + +# Run the app +shinyApp(ui = ui, server = server) \ No newline at end of file diff --git a/_site/index.html b/_site/index.html index 81afc5b4..f7820231 100644 --- a/_site/index.html +++ b/_site/index.html @@ -164,7 +164,7 @@

    Computational Research Data Management

    Modified
    -

    April 17, 2024

    +

    April 29, 2024

    @@ -180,6 +180,19 @@

    Computational Research Data Management

    # You should hide the navigation if there are no subsections # You should hide the Table of Contents if there are no important titles --> +
    +
    +
    + +
    +
    +Practical RDM workshop +
    +
    +
    +

    We offer workshops on practical RDM for biodata. Keep an eye on the upcoming events on the Sandbox website.

    +
    +

    Research Data Management for biological data

    The course “Research Data Management (RDM) for biological data” is designed to provide participants with foundational knowledge and practical skills in handling the extensive data generated by modern studies, with a focus on Next Generation Sequencing (NGS) data. It emphasizes the importance of Open Science and FAIR principles in managing data effectively. This course covers essential principles and best practices guidelines in data organization, metadata annotation, version control, and data preservation. These principles are explored from a computational perspective, ensuring participants gain hands-on experience in applying them to real-world scenarios in their research labs. Additionally, the course delves into FAIR principles and Open Science, promoting collaboration and reproducibility in research endeavors. By the course’s conclusion, attendees will possess essential tools and techniques to address the data challenges prevalent in today’s NGS research landscape, as well as in other related fields to health and bioinformatics.

    diff --git a/_site/practical_workflows.html b/_site/practical_workflows.html index 0dbf8cf6..14c35da4 100644 --- a/_site/practical_workflows.html +++ b/_site/practical_workflows.html @@ -190,7 +190,7 @@
    Modified
    -

    April 22, 2024

    +

    April 29, 2024

    @@ -242,8 +242,8 @@
    -
    -

    Workflows

    +
    +

    FAIR Workflows

    Data analysis typically involves the use of different tools, algorithms, and scripts. It often requires multiple steps to transform, filter, aggregate, and visualize data. The process can be time-consuming because each tool may demand specific inputs and parameter settings. As analyses become more complex, the importance of reproducible and scalable automated workflow management increases. Workflow management encompasses tasks such as parallelization, resumption, logging, and data provenance.

    If you develop your own software make sure you follow FAIR principles. We highly endorse following these FAIR recommendations and to register your computational workflow here.

    Using workflow managers, you ensure:

    @@ -351,6 +351,9 @@

    Nextflow

    +
    +
    +

    FAIR environments

    Sources

      @@ -359,6 +362,7 @@

      Sources

    • https://bioconda.github.io
    • Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.
    • Köster, Johannes. “Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis”, PhD thesis, TU Dortmund 2014.
    • +
    • faircookbook worflows
    diff --git a/_site/search.json b/_site/search.json index 5c2e2185..821ce494 100644 --- a/_site/search.json +++ b/_site/search.json @@ -93,7 +93,7 @@ "href": "develop/practical_workshop.html#version-control-of-your-data-analysis-using-git-and-github", "title": "Practical material", "section": "5. Version control of your data analysis using Git and GitHub", - "text": "5. Version control of your data analysis using Git and GitHub\nVersion control is a systematic approach to tracking changes made to a project over time. It provides a structured means of documenting alterations, allowing you to revisit and understand the evolution of your work. In research data management and data analytics, version control is very important and gives you a lot of advantages.\nGit is a distributed version control system that enables developers and researchers to efficiently manage their project’s history, collaborate seamlessly, and ensure data integrity. At its core, Git operates through the following principles and mechanisms: On the other hand, GitHub is a web-based platform that enhances Git’s capabilities by providing a collaborative and centralized hub for hosting Git repositories. It offers several key functionalities, such as tracking issues, security features to safeguard your repos, and GitHub Pages that allow you to create websites to showcase your projects.\n\n\n\n\n\n\nCreate a GitHub organization for your lab or department\n\n\n\nGitHub allows users to create organizations and teams that will collaborate or create repositories under the same umbrella organization. If you would like to create an educational organization in GitHub, you can do so for free! For example, you could create a GitHub account for your lab.\nTo create a GitHub organization, follow these instructions\nAfter you have created the GitHub organization, make sure that you create your repositories under the organization space and not your user!\n\n\n\nCreating a git repo online and copying your project folder\nVersion controlling your data analysis folders, a.k.a. Project folder, is very easy once you have set up your Cookiecutter templates. The simplest way of doing this is to first create a remote GitHub repository from the webpage (or from the Desktop app, if you are using it) with a proper project name. Then git clone that repository you just made into your Projects main folder. Then, use cookiecutter to create a project folder template and copy-paste the contents of the folder template to your cloned repo. Remember to fill up your metadata and description files! If you wish, you could already git add, commit, and push the first changes to the folders and continue from there on.\nGo back to the course material lesson 5 and read the differences between converting folders to git repositories and cloning a folder to an existing git repository.\n\n\n\n\n\n\nTips to write good commit messages\n\n\n\nIf you would like to know more about Git commits and the best way to make clear Git messages, check out this post!\n\n\n\n\nGitHub Pages\nOnce you have created your repository (and put it in GitHub), you have now the opportunity to add your data analysis reports that you created, in either Jupyter Notebooks, Rmarkdowns, or HTML reports, in a GitHub Page website. Creating a GitHub page is very simple, and we really recommend that you follow the nice tutorial that GitHub has put for you. Nonetheless, we will see the main steps in the exercise below.\nThere are many different ways to create your web pages. We recommend using Mkdocs and Mkdocs materials as a framework to create a nice webpage simply. The folder templates that we used as an example in the previous exercise already contain everything you need to start a webpage. Nonetheless, you will need to understand the basics of MkDocs and MkDocs materials to design a webpage to your liking. MkDocs is a static webpage generator that is very easy to use, while MkDocs materials is an extension of the tool that gives you many more options to customize your website. Check out their web pages to get started!\n\n\n\n\n\n\nExercise 5: make a project folder and publish a data analysis webpage\n\n\n\n\n\n\n\n\nConfigure your main GitHub Page and its repo\nThe first step is to set up the main GitHub Page site and the repository that will host it. This is very simple, as you will only need to follow these steps. In a Markdown document, outline the primary objectives of the organization and provide an overview of ongoing research projects. After you have created the organization/usernamegithub.io, it is time to configure your Project repository webpage using MkDocs!\nStart a new project from Cookiecutter or use one from the previous exercise.\nIf you use a Project repo from the first exercise, go to the next paragraph. Using Cookiecutter, create a new data analysis project. Remember to fill up your metadata and description files! After you have created the folder, it would be best to initialize a Git repo following the instructions from the previous section.\nNext, link your data of interest (or create a small fake dataset) and make an example of a data analysis notebook/report (this could be just a scatter plot of a random matrix of values). Depending on your setup, you might be using Jupyter Notebooks or Rmarkdowns. The extensions that we have installed using pip allow you to directly add a Jupyter Notebook file to the mkdocs.yml navigation section. On the other hand, if you are using Rmarkdown, you will have to knit your document into either an HTML page or a GitHub document.\nFor the purposes of this exercise, we have already included a basic index.md markdown file that can serve as the intro page of your repo, and a jupyter_example.ipynb with some code in it. You are welcome to modify them further to test them out!\nUse MkDocs to create your webpage\nWhen you are happy with your files and are ready to publish them, make sure to add, commit, and push the changes to the remote. Then, build up your webpage using MkDocs and the mkdocs gh-deploy command from the same directory where the mkdocs.yml file is. For example, if your mkdocs.yml for your Project folder is in /Users/JARH/Projects/project1_JARH_20231010/mkdocs.yml, do cd /Users/JARH/Projects/project1_JARH_20231010/ and then mkdocs gh-deploy. This requires a couple of changes in your GitHub organization settings.\nRemember to make sure that your markdowns, images, reports, etc., are included in the docs folder and properly set up in the navigation section of your mkdocs.yml file.\nFinally, we only need to set up the GitHub Project repo settings.\nPublishing your GitHub Page\nGo to your GitHub repo settings and configure the Page section. Since you are using the mkdocs gh-deploy command to publish your site in the gh-pages branch (as explained the the mkdocs documentation), we need to change where GitHub is fetching the website. You will need to configure the settings of this repository in GitHub so that the Page is taken from the gh-pages branch and the root folder.\n\n\n\nGitHub Pages setup\n\n\n\nBranch should be gh-pages\nFolder should be root\n\nAfter a couple of minutes, your webpage should be ready! You should be able to see your webpage through the link provided in the Page section!\n\nNow it is also possible to include this repository webpage in your main webpage organizationgithub.io by including the link of the repo website (https://organizationgithub.io/repo-name) in the navigation section of the mkdocs.yml file in the main organizationgithub.io repo." + "text": "5. Version control of your data analysis using Git and GitHub\nVersion control involves systematically tracking changes to a project over time, offering a structured way to document revisions and understand the progression of your work. In research data management and data analytics, it plays a critical role and provides numerous benefits.\nGit is a distributed version control system that helps developers and researchers efficiently manage project history, collaborate seamlessly, and maintain data integrity. On the other hand, GitHub is a web-based platform that builds on Git’s functionality by providing a centralized, collaborative hub for hosting Git repositories. It offers several key functionalities, such as tracking issues, security features to safeguard your repos, and GitHub Pages that allow you to create websites to showcase your projects.\n\n\n\n\n\n\nCreate a GitHub organization for your lab or department\n\n\n\nGitHub users can create organizations, allowing groups to collaborate or create repositories under the same organization umbrella. You can create an educational organization on Github for free, by setting up a Github account for your lab.\nFollow these instructions to create a GitHub organization.\nOnce you’ve established your GitHub organization, be sure to create your repositories within the organization’s space rather than under your personal user account. This keeps your projects centralized and accessible to the entire group. Best practices for managing an organization on GitHub include setting clear access permissions, regularly reviewing roles and memberships, and organizing repositories effectively to keep your projects structured and easy to navigate.\n\n\n\nSetting up a GitHub repository for your project folder\nVersion controlling your data analysis folders becomes straightforward once you’ve established your Cookiecutter templates. After you’ve created several folder structures and metadata using your Cookiecutter template, you can manage version control by either converting those folders into Git repositories or copying a folder into an existing Git repository. Both approaches are explained in Lesson 5.\n\n\n\n\n\n\nExercise 6: initialize a repository from an existing folder:\n\n\n\n\n\n\n\n\nInitialize the repository: Begin by running the command git init in your project directory. This command sets up a new Git repository in the current directory and is executed only once, even for collaborative projects. See (git init) for more details.\nCreate a remote repository: Once the local repository is initialized, create an empty new repository on GitHub (website or Github Desktop).\nConnect the remote repository: Add the GitHub repository URL to your local repository using the command git remote add origin <URL>. This associates the remote repository with the name “origin.”\nCommit changes: If you have files you want to add to your repository, stage them using git add ., then create a commit to save a snapshot of your changes with git commit -m \"add local folder\".\nPush to GitHub: To synchronize your local repository with the remote repository and establish a tracking relationship, push your commits to the GitHub repository using git push -u origin main.\n\n\n\n\n\n\n\n\n\n\n\n\nTips to write good commit messages\n\n\n\nIf you would like to know more about Git commits and the best way to make clear Git messages, check out this post!\n\n\n\n\nGitHub Pages\nOnce you have created your repository (and put it in GitHub), you have now the opportunity to add your data analysis reports that you created, in either Jupyter Notebooks, R Markdown files, or HTML reports, in a GitHub Page website. Creating a GitHub page is very simple, and we really recommend that you follow the nice tutorial that GitHub has put for you. Nonetheless, we will see the main steps in the exercise below.\nThere are many different ways to create your web pages. We recommend using Mkdocs and Mkdocs materials as a framework to create a nice webpage simply. The folder templates that we used as an example in the previous exercise already contain everything you need to start a webpage. Nonetheless, you will need to understand the basics of MkDocs and MkDocs materials to design a webpage to your liking. MkDocs is a static webpage generator that is very easy to use, while MkDocs materials is an extension of the tool that gives you many more options to customize your website. Check out their web pages to get started!\n\n\n\n\n\n\nExercise 5: make a project folder and publish a data analysis webpage\n\n\n\n\n\n\n\n\nConfigure your main GitHub Page and its repo\nThe first step is to set up the main GitHub Page site and the repository that will host it. This is very simple, as you will only need to follow these steps. In a Markdown document, outline the primary objectives of the organization and provide an overview of ongoing research projects. After you have created the organization/usernamegithub.io, it is time to configure your Project repository webpage using MkDocs!\nStart a new project from Cookiecutter or use one from the previous exercise.\nIf you use a Project repo from the first exercise, go to the next paragraph. Using Cookiecutter, create a new data analysis project. Remember to fill up your metadata and description files! After you have created the folder, it would be best to initialize a Git repo following the instructions from the previous section.\nNext, link your data of interest (or create a small fake dataset) and make an example of a data analysis notebook/report (this could be just a scatter plot of a random matrix of values). Depending on your setup, you might be using Jupyter Notebooks or Rmarkdowns. The extensions that we have installed using pip allow you to directly add a Jupyter Notebook file to the mkdocs.yml navigation section. On the other hand, if you are using Rmarkdown, you will have to knit your document into either an HTML page or a GitHub document.\nFor the purposes of this exercise, we have already included a basic index.md markdown file that can serve as the intro page of your repo, and a jupyter_example.ipynb with some code in it. You are welcome to modify them further to test them out!\nUse MkDocs to create your webpage\nWhen you are happy with your files and are ready to publish them, make sure to add, commit, and push the changes to the remote. Then, build up your webpage using MkDocs and the mkdocs gh-deploy command from the same directory where the mkdocs.yml file is. For example, if your mkdocs.yml for your Project folder is in /Users/JARH/Projects/project1_JARH_20231010/mkdocs.yml, do cd /Users/JARH/Projects/project1_JARH_20231010/ and then mkdocs gh-deploy. This requires a couple of changes in your GitHub organization settings.\nRemember to make sure that your markdowns, images, reports, etc., are included in the docs folder and properly set up in the navigation section of your mkdocs.yml file.\nFinally, we only need to set up the GitHub Project repo settings.\nPublishing your GitHub Page\nGo to your GitHub repo settings and configure the Page section. Since you are using the mkdocs gh-deploy command to publish your site in the gh-pages branch (as explained the the mkdocs documentation), we need to change where GitHub is fetching the website. You will need to configure the settings of this repository in GitHub so that the Page is taken from the gh-pages branch and the root folder.\n\n\n\nGitHub Pages setup\n\n\n\nBranch should be gh-pages\nFolder should be root\n\nAfter a couple of minutes, your webpage should be ready! You should be able to see your webpage through the link provided in the Page section!\n\nNow it is also possible to include this repository webpage in your main webpage organizationgithub.io by including the link of the repo website (https://organizationgithub.io/repo-name) in the navigation section of the mkdocs.yml file in the main organizationgithub.io repo." }, { "objectID": "develop/practical_workshop.html#archive-github-repositories-on-zenodo", @@ -459,7 +459,7 @@ "href": "index.html", "title": "Computational Research Data Management", "section": "", - "text": "The course “Research Data Management (RDM) for biological data” is designed to provide participants with foundational knowledge and practical skills in handling the extensive data generated by modern studies, with a focus on Next Generation Sequencing (NGS) data. It emphasizes the importance of Open Science and FAIR principles in managing data effectively. This course covers essential principles and best practices guidelines in data organization, metadata annotation, version control, and data preservation. These principles are explored from a computational perspective, ensuring participants gain hands-on experience in applying them to real-world scenarios in their research labs. Additionally, the course delves into FAIR principles and Open Science, promoting collaboration and reproducibility in research endeavors. By the course’s conclusion, attendees will possess essential tools and techniques to address the data challenges prevalent in today’s NGS research landscape, as well as in other related fields to health and bioinformatics.\n\n\n\n\n\n\nCourse Overview\n\n\n\n\n📖 Syllabus:\n\n\nData Lifecycle Management\nData Management Plans (DMPs)\nData Organization and storage\nDocumentation standards for biodata\nVersion Control and Collaboration\nProcessing and analyzing biodata\nStoring and sharing biodata\n\n\n⏰ Total Time Estimation: X hours\n\n📁 Supporting Materials:\n\n👨‍💻 Target Audience: Ph.D., MSc, anyone interested in RDM for NGS data or other related fields within bioinformatics.\n👩‍🎓 Level: Beginner.\n🔒 License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.\n\n💰 Funding: This project was funded by the Novo Nordisk Fonden (NNF20OC0063268).\n\n\n\n\n\n\n\n\n\n\nCourse Requirements\n\n\n\n\nBasic understanding Next Generation Sequencing data and formats.\nCommand Line experience\nBasic programming experience\nQuarto or Mkdocs tools\n\n\n\nThis course offers participants with an in-depth introduction to effectively managing the vast amounts of data generated in modern studies. Throughout the program, emphasis is placed on practical understanding of RDM principles and the importance of efficient handling of large datasets. In this context, participants will learn the necessity of adopting Open Science and FAIR principles for enhancing data accessibility and reusability.\nParticipants will acquire practical skills for organizing data, including the creation of folder and file structures, and the implementation of metadata to facilitate data discoverability and interpretation. Special attention is given to the development of Data Management Plans (DMPs) with examples tailored to omics data, ensuring compliance with institutional and funding agency requirements while maintaining data integrity. Attendees will also gain insights into the establishment of simple databases and the use of version control systems to track changes in data analysis, thereby promoting collaboration and reproducibility.\nThe course concludes with a focus on archiving and data repositories, enabling participants to learn strategies for preserving and sharing data for long-term scientific usage. By the end of the course, attendees will be equipped with essential tools and techniques to effectively navigate the challenges prevalent in today’s research landscape. This will not only foster successful data management practices but also enhance collaboration within the scientific community.\n\n\n\n\n\n\nCourse Goals\n\n\n\nBy the end of this workshop, you should be able to apply the following concepts in the context of Next Generation Sequencing data:\n\nUnderstand the Importance of Research Data Management (RDM)\nFamiliarize Yourself with FAIR and Open Science Principles\nDraft a Data Management Plan for your own Data\nEstablish File and Folder Naming Conventions\nEnhance Data with Descriptive Metadata\nImplement Version Control for Data Analysis\nSelect an Appropriate Repository for Data Archiving\nMake your data analysis and workflows reproducible and FAIR\n\n\n\n\n\n\n\n\n\nWarning\n\n\n\nThis is a computational workshop that focuses primarily on the digital aspect of our data. While wet lab Research Data Management (RDM) involving protocols, instruments, reagents, ELM or LIMS systems is integral to the entire RDM process, it won’t be covered in this course.\nAs part of effective data management, it’s crucial to prioritize strategies that ensure security and privacy. While these aspects are important, please note that they won’t be covered in our course. However, we highly recommend enrolling in the GDPR course offered by Center for Health Data Science, specially if you’re working with sensitive data. This course specifically focuses on GDPR compliance and will provide you with valuable insights and skills in managing data privacy and security.\n\n\n\n\n\nUniversity of Copenhagen\nUniversity Library of Southern Denmark\nTechnical University of Denmark\nAalborg University\nAarhus University\n\n\n\n\n\nRDMkit, ELIXIR (2021) Research Data Management Kit. A deliverable from the EU-funded ELIXIR-CONVERGE project (grant agreement 871075).\nUniversity of Copenhagen Research Data Management Team.\nMartin Proks and Sarah Lundregan, Brickman Lab, NNF Center for Stem Cell Biology (reNEW), University of Copenhagen.\nRichard Dennis, Data Steward, NNF Center for Stem Cell Biology (reNEW), University of Copenhagen.\nNBISweden." + "text": "Practical RDM workshop\n\n\n\nWe offer workshops on practical RDM for biodata. Keep an eye on the upcoming events on the Sandbox website." }, { "objectID": "develop/06_file_structure.html", @@ -917,7 +917,7 @@ { "objectID": "practical_workflows.html", "href": "practical_workflows.html", - "title": "Workflows", + "title": "FAIR Workflows", "section": "", "text": "Course Overview\n\n\n\n\n⏰ Total Time Estimation: X hours\n\n📁 Supporting Materials:\n\n👨‍💻 Target Audience: Ph.D., MSc, anyone interested in workflow management systems for High-Throughput data or other related fields within bioinformatics.\n👩‍🎓 Level: Advanced.\n🔒 License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.\n\n💰 Funding: This project was funded by the Novo Nordisk Fonden (NNF20OC0063268)." }, @@ -1055,7 +1055,7 @@ "href": "develop/03_DOD.html#template-engine", "title": "3. Data organization and storage", "section": "Template engine", - "text": "Template engine\nSetting up folder structures manually for each new project can be time-consuming. Thankfully, tools like Cookiecutter offer a solution by allowing users to create project templates easily. These templates can ensure consistency across projects and save time. Additionally, using cruft alongside Cookiecutter can assist in maintaining older templates when updates are made (by synchronizing them with the latest version).\n\n\n\n\n\n\nCookiecutter templates\n\n\n\n\nCookiecutter template for Data science projects\nBrickmanlab template for NGS data: similar to the folder structures in the examples above. You can download and modify it to suit your needs.\n\n\n\n\nQuick tutorial on cookiecutter\n\n\n\n\n\n\nSandbox Tutorial\n\n\n\nLearn how to create your own template here.\nWe offer workshops on practical RDM for biodata. Keep an eye on the upcoming events on the Sandbox website.", + "text": "Template engine\nSetting up folder structures manually for each new project can be time-consuming. Thankfully, tools like Cookiecutter offer a solution by allowing users to create project templates easily. These templates can ensure consistency across projects and save time. Additionally, using cruft alongside Cookiecutter can assist in maintaining older templates when updates are made (by synchronizing them with the latest version).\n\n\n\n\n\n\nCookiecutter templates\n\n\n\n\nSandbox Project/Data analysis template\nSandbox Data/Assay template\nCookiecutter template for Data science projects\nBrickmanlab template for NGS data: similar to the folder structures in the examples above. You can download and modify it to suit your needs.\n\n\n\n\nQuick tutorial on cookiecutter\n\n\n\n\n\n\nSandbox Tutorial\n\n\n\nLearn how to create your own template here.", "crumbs": [ "Course material", "Key practices", @@ -1175,7 +1175,7 @@ "href": "develop/04_metadata.html#wrap-up", "title": "4. Documentation for biodata", "section": "Wrap up", - "text": "Wrap up\nIn this lesson, we’ve covered the importance of attaching metadata to your data for future reusability and comprehension. We briefly introduced various controlled vocabularies and provided several sources for inspiration. Implementing ontologies is optional, as their usage complexity varies.\nOptionally, if you’ve gone through the lesson, you’ve learned how to use the metadata YAML files to create a database and a catalog browser using Shiny apps. This makes it easy to manage all assays together.\n\nSources\n\nRMDKit: https://rdmkit.elixir-europe.org/data_brokering#collecting-and-processing-the-metadata-and-data\nFAIRsharing.org: provide a searchable database of metadata standards for a wide variety of disciplines\n\nOther sources:\n\nJohns Hopkins Sheridan libraries, RDM. They provide a list of medical metadata standards resources.\n\nKU Leuven Guidance: https://www.kuleuven.be/rdm/en/guidance/documentation-metadata\nTranscriptomics metadata standards and fields\nBiological ontologies for data scientists,Bionty\nNIH standardizing data collection\nObservational Health Data Sciences and Informatics (OHDSI) OMOP Common Data Model\n\n\n\nTools and software\n\nRightfield: open source tool facilitates the integration of ontology terms into Excel spreadsheet.\nOwlready2: Python package, enables the loading of ontologies as Python objects. This versatile tool allows users to manipulate and store ontology classes, instances, and properties as needed.", + "text": "Wrap up\nIn this lesson, we’ve covered the importance of attaching metadata to your data for future reusability and comprehension. We briefly introduced various controlled vocabularies and provided several sources for inspiration. Implementing ontologies is optional, as their usage complexity varies.\nOptionally, if you’ve gone through the lesson, you’ve learned how to use the metadata YAML files to create a database and a catalog browser using Shiny apps. This makes it easy to manage all assays together.\n\nSources\n\nRMDKit: https://rdmkit.elixir-europe.org/data_brokering#collecting-and-processing-the-metadata-and-data\nFAIRsharing.org: provide a searchable database of metadata standards for a wide variety of disciplines\n\nOther sources:\n\nJohns Hopkins Sheridan libraries, RDM. They provide a list of medical metadata standards resources.\n\nKU Leuven Guidance: https://www.kuleuven.be/rdm/en/guidance/documentation-metadata\nTranscriptomics metadata standards and fields\nNIH standardizing data collection\nObservational Health Data Sciences and Informatics (OHDSI) OMOP Common Data Model\n\n\n\nTools and software\n\nRightfield: open source tool facilitates the integration of ontology terms into Excel spreadsheet.\nOwlready2: Python package, enables the loading of ontologies as Python objects. This versatile tool allows users to manipulate and store ontology classes, instances, and properties as needed.\nShiny Apps: easy interactive web apps for data science", "crumbs": [ "Course material", "Key practices", @@ -1307,7 +1307,7 @@ "href": "develop/05_VC.html", "title": "5. Data Analysis with Version Control", "section": "", - "text": "Course Overview\n\n\n\n⏰ Time Estimation: X minutes\n💬 Learning Objectives:\n\nVersion control essentials and practices\nGit and Github repositories\nCreate repositories\nGitHub page to showcase your data analysis reports\n\n\n\nThis lesson introduces version control with Git and Github and its significance in research. You will gain the ability to create Git repositories, and skills to build GitHub pages for showcasing data analysis.\n\n\nVersion control systematically tracks project changes, documenting alterations for understanding project evolution. It holds significant importance in research data management, software development, and data analysis, offering numerous advantages.\n\n\n\n\n\n\nAdvantages of using version control\n\n\n\n\nDocument Progress: Detailed change history aids understanding of project development and modifications.\nEnsure Data Integrity: Prevents accidental data loss or corruption, with each change tracked for easy recovery.\nFacilitate Collaboration: Enables seamless collaboration among team members, allowing multiple individuals to work concurrently without conflicts.\nReproducibility: Preserves project state for accurate validation and analysis.\nBranching and Experimentation: Allows the creation of alternative project versions for experimentation, without altering the main branch.\nGlobal Accessibility: Platforms like GitHub provide visibility for sharing, feedback, and contribution to open science.\n\n\n\n\n\n\n\n\n\nTake our course on Git & Github\n\n\n\nif you’re interested in delving deeper, explore our course on Git and GitHub.\nAlternatively, here are some examples and online resources to expand your understanding:\n\nGit and GitHub online resources\nGitHub documentation\nGit documentation\n\n\n\n\n\nGit is a widely adopted version control system that empowers developers and researchers to efficiently manage their project’s history, collaborate seamlessly, track changes, and ensure data integrity. Git operates on core principles and mechanisms:\n\nLocal Repository: Each user maintains a local repository on their computer, storing the complete project history for independent work.\nSnapshots, Not Files: Git captures snapshots of the entire project at different points instead of tracking individual file changes, ensuring data consistency.\nCommits: Users create ‘commits’ as snapshots of the project at specific moments, recording changes made to files along with explanatory commit messages.\nBranching: Git supports branching, enabling users to create separate lines of development for new features or bug fixes without affecting the main branch.\nMerging: Changes from one branch can be merged into another, facilitating the incorporation of new features or bug fixes back into the main project with a smooth merging process.\nDistributed Architecture: Git’s distributed nature means each user’s local repository is a complete copy of the project, enabling offline work and ensuring data redundancy.\nRemote Repositories: Users can connect and synchronize their local repositories with remote repositories hosted on platforms like GitHub, facilitating collaboration and project sharing.\nPush and Pull: Users ‘push’ their local changes to a remote repository to share with others and ‘pull’ changes made by others into their local repository to stay updated.\nConflict Resolution: Git provides tools to resolve conflicts manually in cases of conflicting changes, ensuring data integrity during collaboration.\nVersioning and Tagging: Git offers versioning and tagging capabilities, allowing users to mark specific points in history such as major releases or significant milestones.\n\n\n\n\nIn addition to exploring Git, we will also explore GitHub, a collaborative platform for hosting Git repositories. GitHub enhances Git’s capabilities by offering features like issue tracking, security measures to protect repositories, and GitHub Pages for creating project websites. Additionally, GitHub provides the option to set repositories as private until you are ready to share your work publicly.\n\n\n\n\n\n\nAlternatives flows for collaborative projects\n\n\n\n\nGitLab\nBitBucket\n\nWe will focus on GitHub for the remainder of this lesson due to its widespread usage and compatibility.\n\n\n\n\n\n\n\n\nWarning\n\n\n\nWe will discuss repositories for archiving experimental or large datasets in lesson 7.\n\n\n\n\nMoving from Git to GitHub involves transitioning from a local version control setup to a remote hosting platform. You will need a GitHub account for the exercise in this section.\n\n\n\n\n\n\nCreate a GitHub account\n\n\n\n\nIf you don’t have a GitHub account yet, click here.\nInstall Git from Git webpage\n\n\n\nYou have two options when it comes to creating a repository for your project. First, you can start from scratch by creating a new repository and adding files to it as your project progresses. Alternatively, if you already have an existing folder structure for your project, you can initialize a repository directly from that folder. It is crucial to initiate version control in the early stages of a project to facilitate easy tracking of changes and effective management of the project’s version history from the beginning.\n\n\nIf you completed all the exercises in lesson 3, you should have a project data structure prepared. Otherwise, consider using one of your existing projects or creating a small toy example for practice using cookiecutter (see practical_workshop).\n\n\n\n\n\n\nGithub documentation link\n\n\n\n\nAdding locally hosted code to Github\n\n\n\n\n\n\n\n\n\nExercise 1: initialize a repository from an existing folder:\n\n\n\n\n\n\n\n\nFirst, initialize the repository using the command git init. This command is run only once, even in collaborative projects (git init).\nOnce the repository is initialized, create a remote repository on GitHub.\nAdd the remote URL to your local git repository using git remote add origin <URL>`. This associates the remote URL with the name “origin”.\nEnsure you have at least one commit in your history by staging existing files with git add and then creating a snapshot, known as committing, with git commit.\nFinally, push your local commits to the remote repository and establish a tracking relationship using git push -u origin master.\n\n\n\n\n\n\n\n\n\nAlternatively to converting folders to repositories, you can create a new repository remotely, and then clone (git clone) it locally. Here, git init is not needed. You can move the files into the repository locally (git add, git commit, and git push). If you are creating a collaborative repository, you can now share it with your colleagues.\n\n\n\n\n\n\nTips to write good commit messages\n\n\n\nWrite useful and clear Git commits. Check out this post for tips.\n\n\n\n\n\n\n\nAfter setting up your repository on GitHub, take advantage of the opportunity to enhance it by adding your data analysis reports. Whether they are in Jupyter Notebooks, Rmarkdowns, or HTML reports, you can showcase them on a GitHub Page.\nOnce you have created your repository (and put it in GitHub), you have now the opportunity to add your data analysis reports that you created, in either Jupyter Notebooks, Rmarkdowns, or HTML reports, in a GitHub Page website. Creating a GitHub page is very simple, and we recommend that you follow the nice tutorial that GitHub has put for you.\nFor simplicity, we recommend using Quarto or MkDocs. Visit their websites and follow the instructions to get started.\n\n\n\n\n\n\nTutorial links\n\n\n\n\nGet started in quarto: https://quarto.org/docs/get-started/. We recommend using the VS code tool, if you do, follow this tutorial.\nMkDocs materials to further customize MkDocs websites.\n\n\n\n\n\n\nWe provide an example of setting up Git, MkDocs, and a GitHub account, enabling you to replicate the process independently! (see Exercise 5 in the practical material)\n\n\n\n\nIn this lesson, we explored version control and utilized Git and GitHub to establish data analysis repositories from our Project folders. Additionally, we delved into creating a GitHub organization and leveraging GitHub Pages to showcase data analysis scripts and notebooks publicly. Remember to complete the corresponding exercise from the practical workshop to reinforce your knowledge.\n\n\n\nVersion Control and Code Repository Link.", + "text": "Course Overview\n\n\n\n⏰ Time Estimation: X minutes\n💬 Learning Objectives:\n\nVersion control essentials and practices\nGit and Github repositories\nCreate repositories\nGitHub page to showcase your data analysis reports\n\n\n\nThis lesson introduces version control with Git and Github and its significance in research. You will gain the ability to create Git repositories, and skills to build GitHub pages for showcasing data analysis.\n\n\nVersion control systematically tracks project changes, documenting alterations for understanding project evolution. It holds significant importance in research data management, software development, and data analysis, offering numerous advantages.\n\n\n\n\n\n\nAdvantages of using version control\n\n\n\n\nDocument Progress: Detailed change history aids understanding of project development and modifications.\nEnsure Data Integrity: Prevents accidental data loss or corruption, with each change tracked for easy recovery.\nFacilitate Collaboration: Enables seamless collaboration among team members, allowing multiple individuals to work concurrently without conflicts.\nReproducibility: Preserves project state for accurate validation and analysis.\nBranching and Experimentation: Allows the creation of alternative project versions for experimentation, without altering the main branch.\nGlobal Accessibility: Platforms like GitHub provide visibility for sharing, feedback, and contribution to open science.\n\n\n\n\n\n\n\n\n\nTake our course on Git & Github\n\n\n\nif you’re interested in delving deeper, explore our course on Git and GitHub.\nAlternatively, here are some examples and online resources to expand your understanding:\n\nGit and GitHub online resources\nGitHub documentation\nGit documentation\n\n\n\n\n\nGit is a widely adopted version control system that empowers developers and researchers to efficiently manage their project’s history, collaborate seamlessly, track changes, and ensure data integrity. Git operates on core principles and mechanisms:\n\nLocal Repository: Each user maintains a local repository on their computer, storing the complete project history for independent work.\nSnapshots, Not Files: Git captures snapshots of the entire project at different points instead of tracking individual file changes, ensuring data consistency.\nCommits: Users create ‘commits’ as snapshots of the project at specific moments, recording changes made to files along with explanatory commit messages.\nBranching: Git supports branching, enabling users to create separate lines of development for new features or bug fixes without affecting the main branch.\nMerging: Changes from one branch can be merged into another, facilitating the incorporation of new features or bug fixes back into the main project with a smooth merging process.\nDistributed Architecture: Git’s distributed nature means each user’s local repository is a complete copy of the project, enabling offline work and ensuring data redundancy.\nRemote Repositories: Users can connect and synchronize their local repositories with remote repositories hosted on platforms like GitHub, facilitating collaboration and project sharing.\nPush and Pull: Users ‘push’ their local changes to a remote repository to share with others and ‘pull’ changes made by others into their local repository to stay updated.\nConflict Resolution: Git provides tools to resolve conflicts manually in cases of conflicting changes, ensuring data integrity during collaboration.\nVersioning and Tagging: Git offers versioning and tagging capabilities, allowing users to mark specific points in history such as major releases or significant milestones.\n\n\n\n\nIn addition to exploring Git, we will also explore GitHub, a collaborative platform for hosting Git repositories. GitHub enhances Git’s capabilities by offering features like issue tracking, security measures to protect repositories, and GitHub Pages for creating project websites. Additionally, GitHub provides the option to set repositories as private until you are ready to share your work publicly.\n\n\n\n\n\n\nAlternatives flows for collaborative projects\n\n\n\n\nGitLab\nBitBucket\n\nWe will focus on GitHub for the remainder of this lesson due to its widespread usage and compatibility.\n\n\n\n\n\n\n\n\nWarning\n\n\n\nWe will discuss repositories for archiving experimental or large datasets in lesson 7.\n\n\n\n\nMoving from Git to GitHub involves transitioning from a local version control setup to a remote hosting platform. You will need a GitHub account for the exercise in this section.\n\n\n\n\n\n\nCreate a GitHub account\n\n\n\n\nIf you don’t have a GitHub account yet, click here\nInstall Git from Git webpage\n\n\n\nYou have two options when it comes to creating a repository for your project. First, you can start from scratch by creating a new repository and adding files to it as your project progresses. Alternatively, if you already have an existing folder structure for your project, you can initialize a repository directly from that folder. It is crucial to initiate version control in the early stages of a project to facilitate easy tracking of changes and effective management of the project’s version history from the beginning.\n\n\nIf you completed all the exercises in lesson 3, you should have a project data structure prepared. Otherwise, consider using one of your existing projects or creating a small toy example for practice using cookiecutter (see practical_workshop).\n\n\n\n\n\n\nGithub documentation link\n\n\n\n\nAdding locally hosted code to Github\n\n\n\n\n\n\n\n\n\nExercise 1: initialize a repository from an existing folder:\n\n\n\n\n\n\n\n\nInitialize the repository: Begin by running the command git init in your project directory. This command sets up a new Git repository in the current directory and is executed only once, even for collaborative projects. See (git init) for more details.\nCreate a remote repository: Once the local repository is initialized, create am empty new repository on GitHub.\nConnect the remote repository: Add the GitHub repository URL to your local repository using the command git remote add origin <URL>. This associates the remote repository with the name “origin.”\nCommit changes: If you have files you want to add to your repository, stage them using git add ., then create a commit to save a snapshot of your changes with git commit -m \"add local folder\".\nPush to GitHub: To synchronize your local repository with the remote repository and establish a tracking relationship, push your commits to the GitHub repository using git push -u origin main.\n\n\n\n\n\n\n\n\n\nAlternatively to converting folders to repositories, you can create a new repository remotely, and then clone (git clone) it locally. Here, git init is not needed. You can move the files into the repository locally (git add, git commit, and git push). If you are creating a collaborative repository, you can now share it with your colleagues.\n\n\n\n\n\n\nTips to write good commit messages\n\n\n\nWrite useful and clear Git commits. Check out this post for tips.\n\n\n\n\n\n\n\nAfter setting up your repository on GitHub, take advantage of the opportunity to enhance it by adding your data analysis reports. Whether they are in Jupyter Notebooks, R Markdown files, or HTML reports, you can showcase them on a GitHub Page.\nOnce you have created your repository (and put it in GitHub), you have now the opportunity to add your data analysis reports that you created, in either Jupyter Notebooks, R Markdown files, or HTML reports, in a GitHub Page website. Creating a GitHub page is very simple, and we recommend that you follow the nice tutorial that GitHub has put for you.\nFor simplicity, we recommend using Quarto or MkDocs. Visit their websites and follow the instructions to get started.\n\n\n\n\n\n\nTutorial links\n\n\n\n\nGet started in quarto: https://quarto.org/docs/get-started/. We recommend using the VS code tool, if you do, follow this tutorial.\nMkDocs materials to further customize MkDocs websites.\n\n\n\n\n\n\nWe provide an example of setting up Git, MkDocs, and a GitHub account, enabling you to replicate the process independently! (see Exercise 5 in the practical material)\n\n\n\n\nIn this lesson, we explored version control and utilized Git and GitHub to establish data analysis repositories from our Project folders. Additionally, we delved into creating a GitHub organization and leveraging GitHub Pages to showcase data analysis scripts and notebooks publicly. Remember to complete the corresponding exercise from the practical workshop to reinforce your knowledge.\n\n\n\nVersion Control and Code Repository Link.\nGit cheat sheet.", "crumbs": [ "Course material", "Key practices", @@ -1319,7 +1319,7 @@ "href": "develop/05_VC.html#version-control", "title": "5. Data Analysis with Version Control", "section": "", - "text": "Version control systematically tracks project changes, documenting alterations for understanding project evolution. It holds significant importance in research data management, software development, and data analysis, offering numerous advantages.\n\n\n\n\n\n\nAdvantages of using version control\n\n\n\n\nDocument Progress: Detailed change history aids understanding of project development and modifications.\nEnsure Data Integrity: Prevents accidental data loss or corruption, with each change tracked for easy recovery.\nFacilitate Collaboration: Enables seamless collaboration among team members, allowing multiple individuals to work concurrently without conflicts.\nReproducibility: Preserves project state for accurate validation and analysis.\nBranching and Experimentation: Allows the creation of alternative project versions for experimentation, without altering the main branch.\nGlobal Accessibility: Platforms like GitHub provide visibility for sharing, feedback, and contribution to open science.\n\n\n\n\n\n\n\n\n\nTake our course on Git & Github\n\n\n\nif you’re interested in delving deeper, explore our course on Git and GitHub.\nAlternatively, here are some examples and online resources to expand your understanding:\n\nGit and GitHub online resources\nGitHub documentation\nGit documentation\n\n\n\n\n\nGit is a widely adopted version control system that empowers developers and researchers to efficiently manage their project’s history, collaborate seamlessly, track changes, and ensure data integrity. Git operates on core principles and mechanisms:\n\nLocal Repository: Each user maintains a local repository on their computer, storing the complete project history for independent work.\nSnapshots, Not Files: Git captures snapshots of the entire project at different points instead of tracking individual file changes, ensuring data consistency.\nCommits: Users create ‘commits’ as snapshots of the project at specific moments, recording changes made to files along with explanatory commit messages.\nBranching: Git supports branching, enabling users to create separate lines of development for new features or bug fixes without affecting the main branch.\nMerging: Changes from one branch can be merged into another, facilitating the incorporation of new features or bug fixes back into the main project with a smooth merging process.\nDistributed Architecture: Git’s distributed nature means each user’s local repository is a complete copy of the project, enabling offline work and ensuring data redundancy.\nRemote Repositories: Users can connect and synchronize their local repositories with remote repositories hosted on platforms like GitHub, facilitating collaboration and project sharing.\nPush and Pull: Users ‘push’ their local changes to a remote repository to share with others and ‘pull’ changes made by others into their local repository to stay updated.\nConflict Resolution: Git provides tools to resolve conflicts manually in cases of conflicting changes, ensuring data integrity during collaboration.\nVersioning and Tagging: Git offers versioning and tagging capabilities, allowing users to mark specific points in history such as major releases or significant milestones.\n\n\n\n\nIn addition to exploring Git, we will also explore GitHub, a collaborative platform for hosting Git repositories. GitHub enhances Git’s capabilities by offering features like issue tracking, security measures to protect repositories, and GitHub Pages for creating project websites. Additionally, GitHub provides the option to set repositories as private until you are ready to share your work publicly.\n\n\n\n\n\n\nAlternatives flows for collaborative projects\n\n\n\n\nGitLab\nBitBucket\n\nWe will focus on GitHub for the remainder of this lesson due to its widespread usage and compatibility.\n\n\n\n\n\n\n\n\nWarning\n\n\n\nWe will discuss repositories for archiving experimental or large datasets in lesson 7.\n\n\n\n\nMoving from Git to GitHub involves transitioning from a local version control setup to a remote hosting platform. You will need a GitHub account for the exercise in this section.\n\n\n\n\n\n\nCreate a GitHub account\n\n\n\n\nIf you don’t have a GitHub account yet, click here.\nInstall Git from Git webpage\n\n\n\nYou have two options when it comes to creating a repository for your project. First, you can start from scratch by creating a new repository and adding files to it as your project progresses. Alternatively, if you already have an existing folder structure for your project, you can initialize a repository directly from that folder. It is crucial to initiate version control in the early stages of a project to facilitate easy tracking of changes and effective management of the project’s version history from the beginning.\n\n\nIf you completed all the exercises in lesson 3, you should have a project data structure prepared. Otherwise, consider using one of your existing projects or creating a small toy example for practice using cookiecutter (see practical_workshop).\n\n\n\n\n\n\nGithub documentation link\n\n\n\n\nAdding locally hosted code to Github\n\n\n\n\n\n\n\n\n\nExercise 1: initialize a repository from an existing folder:\n\n\n\n\n\n\n\n\nFirst, initialize the repository using the command git init. This command is run only once, even in collaborative projects (git init).\nOnce the repository is initialized, create a remote repository on GitHub.\nAdd the remote URL to your local git repository using git remote add origin <URL>`. This associates the remote URL with the name “origin”.\nEnsure you have at least one commit in your history by staging existing files with git add and then creating a snapshot, known as committing, with git commit.\nFinally, push your local commits to the remote repository and establish a tracking relationship using git push -u origin master.\n\n\n\n\n\n\n\n\n\nAlternatively to converting folders to repositories, you can create a new repository remotely, and then clone (git clone) it locally. Here, git init is not needed. You can move the files into the repository locally (git add, git commit, and git push). If you are creating a collaborative repository, you can now share it with your colleagues.\n\n\n\n\n\n\nTips to write good commit messages\n\n\n\nWrite useful and clear Git commits. Check out this post for tips.\n\n\n\n\n\n\n\nAfter setting up your repository on GitHub, take advantage of the opportunity to enhance it by adding your data analysis reports. Whether they are in Jupyter Notebooks, Rmarkdowns, or HTML reports, you can showcase them on a GitHub Page.\nOnce you have created your repository (and put it in GitHub), you have now the opportunity to add your data analysis reports that you created, in either Jupyter Notebooks, Rmarkdowns, or HTML reports, in a GitHub Page website. Creating a GitHub page is very simple, and we recommend that you follow the nice tutorial that GitHub has put for you.\nFor simplicity, we recommend using Quarto or MkDocs. Visit their websites and follow the instructions to get started.\n\n\n\n\n\n\nTutorial links\n\n\n\n\nGet started in quarto: https://quarto.org/docs/get-started/. We recommend using the VS code tool, if you do, follow this tutorial.\nMkDocs materials to further customize MkDocs websites.\n\n\n\n\n\n\nWe provide an example of setting up Git, MkDocs, and a GitHub account, enabling you to replicate the process independently! (see Exercise 5 in the practical material)", + "text": "Version control systematically tracks project changes, documenting alterations for understanding project evolution. It holds significant importance in research data management, software development, and data analysis, offering numerous advantages.\n\n\n\n\n\n\nAdvantages of using version control\n\n\n\n\nDocument Progress: Detailed change history aids understanding of project development and modifications.\nEnsure Data Integrity: Prevents accidental data loss or corruption, with each change tracked for easy recovery.\nFacilitate Collaboration: Enables seamless collaboration among team members, allowing multiple individuals to work concurrently without conflicts.\nReproducibility: Preserves project state for accurate validation and analysis.\nBranching and Experimentation: Allows the creation of alternative project versions for experimentation, without altering the main branch.\nGlobal Accessibility: Platforms like GitHub provide visibility for sharing, feedback, and contribution to open science.\n\n\n\n\n\n\n\n\n\nTake our course on Git & Github\n\n\n\nif you’re interested in delving deeper, explore our course on Git and GitHub.\nAlternatively, here are some examples and online resources to expand your understanding:\n\nGit and GitHub online resources\nGitHub documentation\nGit documentation\n\n\n\n\n\nGit is a widely adopted version control system that empowers developers and researchers to efficiently manage their project’s history, collaborate seamlessly, track changes, and ensure data integrity. Git operates on core principles and mechanisms:\n\nLocal Repository: Each user maintains a local repository on their computer, storing the complete project history for independent work.\nSnapshots, Not Files: Git captures snapshots of the entire project at different points instead of tracking individual file changes, ensuring data consistency.\nCommits: Users create ‘commits’ as snapshots of the project at specific moments, recording changes made to files along with explanatory commit messages.\nBranching: Git supports branching, enabling users to create separate lines of development for new features or bug fixes without affecting the main branch.\nMerging: Changes from one branch can be merged into another, facilitating the incorporation of new features or bug fixes back into the main project with a smooth merging process.\nDistributed Architecture: Git’s distributed nature means each user’s local repository is a complete copy of the project, enabling offline work and ensuring data redundancy.\nRemote Repositories: Users can connect and synchronize their local repositories with remote repositories hosted on platforms like GitHub, facilitating collaboration and project sharing.\nPush and Pull: Users ‘push’ their local changes to a remote repository to share with others and ‘pull’ changes made by others into their local repository to stay updated.\nConflict Resolution: Git provides tools to resolve conflicts manually in cases of conflicting changes, ensuring data integrity during collaboration.\nVersioning and Tagging: Git offers versioning and tagging capabilities, allowing users to mark specific points in history such as major releases or significant milestones.\n\n\n\n\nIn addition to exploring Git, we will also explore GitHub, a collaborative platform for hosting Git repositories. GitHub enhances Git’s capabilities by offering features like issue tracking, security measures to protect repositories, and GitHub Pages for creating project websites. Additionally, GitHub provides the option to set repositories as private until you are ready to share your work publicly.\n\n\n\n\n\n\nAlternatives flows for collaborative projects\n\n\n\n\nGitLab\nBitBucket\n\nWe will focus on GitHub for the remainder of this lesson due to its widespread usage and compatibility.\n\n\n\n\n\n\n\n\nWarning\n\n\n\nWe will discuss repositories for archiving experimental or large datasets in lesson 7.\n\n\n\n\nMoving from Git to GitHub involves transitioning from a local version control setup to a remote hosting platform. You will need a GitHub account for the exercise in this section.\n\n\n\n\n\n\nCreate a GitHub account\n\n\n\n\nIf you don’t have a GitHub account yet, click here\nInstall Git from Git webpage\n\n\n\nYou have two options when it comes to creating a repository for your project. First, you can start from scratch by creating a new repository and adding files to it as your project progresses. Alternatively, if you already have an existing folder structure for your project, you can initialize a repository directly from that folder. It is crucial to initiate version control in the early stages of a project to facilitate easy tracking of changes and effective management of the project’s version history from the beginning.\n\n\nIf you completed all the exercises in lesson 3, you should have a project data structure prepared. Otherwise, consider using one of your existing projects or creating a small toy example for practice using cookiecutter (see practical_workshop).\n\n\n\n\n\n\nGithub documentation link\n\n\n\n\nAdding locally hosted code to Github\n\n\n\n\n\n\n\n\n\nExercise 1: initialize a repository from an existing folder:\n\n\n\n\n\n\n\n\nInitialize the repository: Begin by running the command git init in your project directory. This command sets up a new Git repository in the current directory and is executed only once, even for collaborative projects. See (git init) for more details.\nCreate a remote repository: Once the local repository is initialized, create am empty new repository on GitHub.\nConnect the remote repository: Add the GitHub repository URL to your local repository using the command git remote add origin <URL>. This associates the remote repository with the name “origin.”\nCommit changes: If you have files you want to add to your repository, stage them using git add ., then create a commit to save a snapshot of your changes with git commit -m \"add local folder\".\nPush to GitHub: To synchronize your local repository with the remote repository and establish a tracking relationship, push your commits to the GitHub repository using git push -u origin main.\n\n\n\n\n\n\n\n\n\nAlternatively to converting folders to repositories, you can create a new repository remotely, and then clone (git clone) it locally. Here, git init is not needed. You can move the files into the repository locally (git add, git commit, and git push). If you are creating a collaborative repository, you can now share it with your colleagues.\n\n\n\n\n\n\nTips to write good commit messages\n\n\n\nWrite useful and clear Git commits. Check out this post for tips.\n\n\n\n\n\n\n\nAfter setting up your repository on GitHub, take advantage of the opportunity to enhance it by adding your data analysis reports. Whether they are in Jupyter Notebooks, R Markdown files, or HTML reports, you can showcase them on a GitHub Page.\nOnce you have created your repository (and put it in GitHub), you have now the opportunity to add your data analysis reports that you created, in either Jupyter Notebooks, R Markdown files, or HTML reports, in a GitHub Page website. Creating a GitHub page is very simple, and we recommend that you follow the nice tutorial that GitHub has put for you.\nFor simplicity, we recommend using Quarto or MkDocs. Visit their websites and follow the instructions to get started.\n\n\n\n\n\n\nTutorial links\n\n\n\n\nGet started in quarto: https://quarto.org/docs/get-started/. We recommend using the VS code tool, if you do, follow this tutorial.\nMkDocs materials to further customize MkDocs websites.\n\n\n\n\n\n\nWe provide an example of setting up Git, MkDocs, and a GitHub account, enabling you to replicate the process independently! (see Exercise 5 in the practical material)", "crumbs": [ "Course material", "Key practices", @@ -1374,7 +1374,7 @@ "href": "develop/05_VC.html#wrap-up", "title": "5. Data Analysis with Version Control", "section": "", - "text": "In this lesson, we explored version control and utilized Git and GitHub to establish data analysis repositories from our Project folders. Additionally, we delved into creating a GitHub organization and leveraging GitHub Pages to showcase data analysis scripts and notebooks publicly. Remember to complete the corresponding exercise from the practical workshop to reinforce your knowledge.\n\n\n\nVersion Control and Code Repository Link.", + "text": "In this lesson, we explored version control and utilized Git and GitHub to establish data analysis repositories from our Project folders. Additionally, we delved into creating a GitHub organization and leveraging GitHub Pages to showcase data analysis scripts and notebooks publicly. Remember to complete the corresponding exercise from the practical workshop to reinforce your knowledge.\n\n\n\nVersion Control and Code Repository Link.\nGit cheat sheet.", "crumbs": [ "Course material", "Key practices", @@ -1446,7 +1446,7 @@ "href": "develop/04_metadata.html#controlled-vocabularies-and-ontologies", "title": "4. Documentation for biodata", "section": "Controlled vocabularies and ontologies", - "text": "Controlled vocabularies and ontologies\nResearchers encountering inconsistent and non-standardized terms (e.g., gene names, disease names, cell types, protein domains, etc.) across datasets may face challenges in data integration. Thus, requiring additional curation time to enable meaningful comparisons. Standardized vocabularies streamline integration, improving consistency and comparability in analysis. Leveraging widely accepted ontologies in the documentation ensures consistent capture of experiment details in metadata fields, aiding data interpretation.\n\n\n\n\n\n\nExamples of ontology services\n\n\n\n\nUberon anatomy ontology\nGene ontology\nEnsembl gene IDs\nMedical Subject Headings (MeSH)\nChemical Entities of Biological Interest\nMicroarray Gene Expression Society Ontology (MGED)\nNCBI taxonomy\nMondo disease database\n\n\n\n\n\n\n\n\n\nOntology definition\n\n\n\n\n\n\n\nAn ontology is a structured framework representing concepts, attributes, and relationships within a specific domain, aiding knowledge organization and integration. Employing standardized vocabularies, it facilitates effective communication and reasoning between humans and computers. Ontologies are crucial for knowledge representation, data integration, and semantic interoperability, enhancing understanding and collaboration across complex domains.\n\n\n\n\n\nStandardization improves data discoverability and interoperability, enabling robust analysis, accelerating knowledge sharing, and facilitating cross-study comparisons. Ontologies act as universal translators, fostering harmonious data interpretation and collaboration across scientific disciplines.\nYou can find three examples of metadata tailored for different purposes NGS data examples: sample metadata, project metadata, and experimental metadata. We suggest exploring controlled vocabularies and metadata standards within your field and seeking additional specialized sources. You will find a few sources at the end of the page.", + "text": "Controlled vocabularies and ontologies\nResearchers encountering inconsistent and non-standardized terms (e.g., gene names, disease names, cell types, protein domains, etc.) across datasets may face challenges in data integration. Thus, requiring additional curation time to enable meaningful comparisons. Standardized vocabularies streamline integration, improving consistency and comparability in analysis. Leveraging widely accepted ontologies in the documentation ensures consistent capture of experiment details in metadata fields, aiding data interpretation.\n\n\n\n\n\n\nExamples of ontology services\n\n\n\n\nBiological ontologies for data scientists - Bionty\nAnatomy - Uberon\nTissue - Uberon\nChemical compoundsChemical Entities of Biological Interest\nExperimentalFactor - Experimental Factor Ontology\nSpecies - NCBI Taxonomy, Ensembl Species\nDisease - Mondo, Human Disease\nGene - Ensembl, NCBI Gene, Gene ontology,Microarray Gene Expression Society Ontology (MGED)\nProtein - Uniprot\nCellLine - Cell Line Ontology\nCellType - Cell Ontology\nCellMarker - CellMarker\nPhenotype - Human Phenotype, Phecodes, PATO, Mammalian Phenotype, Zebrafish Phenotype\nPathway - Gene Ontology, Pathway Ontology\nDevelopmentalStage - Human Developmental Stages, Mouse Developmental Stages\nDrug - Drug Ontology\nEthnicity - Human Ancestry Ontology\nBFXPipeline - largely based on nf-core\nBioSample - NCBI BioSample attributes\nArticles Indexing Medical Subject Headings (MeSH)\n\n\n\n\n\n\n\n\n\nOntology definition\n\n\n\n\n\n\n\nAn ontology is a structured framework representing concepts, attributes, and relationships within a specific domain, aiding knowledge organization and integration. Employing standardized vocabularies, it facilitates effective communication and reasoning between humans and computers. Ontologies are crucial for knowledge representation, data integration, and semantic interoperability, enhancing understanding and collaboration across complex domains.\n\n\n\n\n\nStandardization improves data discoverability and interoperability, enabling robust analysis, accelerating knowledge sharing, and facilitating cross-study comparisons. Ontologies act as universal translators, fostering harmonious data interpretation and collaboration across scientific disciplines.\nYou can find three examples of metadata tailored for different purposes NGS data examples: sample metadata, project metadata, and experimental metadata. We suggest exploring controlled vocabularies and metadata standards within your field and seeking additional specialized sources. You will find a few sources at the end of the page.", "crumbs": [ "Course material", "Key practices", @@ -1470,7 +1470,7 @@ "href": "develop/04_metadata.html#database-and-data-catalogs", "title": "4. Documentation for biodata", "section": "Database and data catalogs", - "text": "Database and data catalogs\nMetadata can be used to create data catalogs, particularly beneficial for the efficient organization of experimental or sequencing data generated by researchers. While databases can range from simple tabular formats like Excel to sophisticated DataBase Management Systems (DBMS) like SQLite, the choice depends on factors such as complexity and volume of data. Leveraging a DBMS offers advantages like efficient data storage, enhanced security, and rapid data querying capabilities.\n\nTables as databases\nA browsable table can be created by recursively navigating through a project’s folder hierarchy using a script and generating a TSV file (tab-separated values) named, for example, database_YYYYMMDD.tsv. This table acts as a centralized repository for all project data, simplifying access and organization. Consistency in metadata structure across projects is vital for efficient data management and integration, as it aids in tracking all conducted assays. Adhering to a uniform metadata format enables the seamless inclusion of essential information from YAML files into the browsable table.\n\n\n\n\n\n\nExercise 2: Generate database tables from metadata\n\n\n\n\n\n\n\nWrite a script (R or Python) that recursively fetches metadata.yml files in a given path. It is important that each subdirectory contains its corresponding metadata.yml.\nRequirements:\n\nData folder structure: containing all project folders\nYAML metadata files associated with each project\n\nClick on the hint to reveal the solution and a code example for the exercise, which may serve as inspiration.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nquiet <- function(x) { suppressMessages(suppressWarnings(x)) }\nquiet(library(yaml))\nquiet(library(dplyr))\nquiet(library(lubridate))\n\n# Function to recursively fetch metadata.yml files\nget_metadata <- function(folder_path) {\n file_list <- list.files(path = folder_path, \n pattern = \"metadata\\\\.yml$\", \n recursive = TRUE, full.names = TRUE)\n metadata_list <- lapply(file_list, yaml::yaml.load_file)\n return(metadata_list)\n}\n\n# Specify the folder path\nfolder_path <- \"/path/to/your/folder\"\n\n# Fetch metadata from the specified folder\nmetadata <- get_metadata(folder_path)\n\n# Convert metadata to a data frame\nmetadata_df <- data.frame(matrix(unlist(metadata), \nncol = length(metadata), byrow = TRUE))\ncolnames(metadata_df) <- names(metadata[[1]])\n\n# Save the data frame as a TSV file\noutput_file <- paste0(\"database_\", format(Sys.Date(), \"%Y%m%d\"), \".tsv\")\nwrite.table(metadata_df, \n file = output_file, \n sep = \"\\t\", \n quote = FALSE, \n row.names = FALSE)\n\n# Print confirmation message\nprint(\"Database saved as\", output_file, \"\\n\")\n\n\n\n\n\n\n\n\n\n\n\n\nSQLite database\nAn alternative to the tabular format is SQLite, a lightweight and self-contained relational database management system known for its simplicity and efficiency. SQLite operates without the need for a separate server, making it ideal for scenarios requiring minimal resource usage. It excels in tasks involving structured data storage and retrieval, making it suitable for managing experiment metadata. Similar to the previous example, you can use a script that records all the information from the YAML file in a SQLite database.\n\n\n\n\n\n\nAdvantages of using SQLite database\n\n\n\n\nEfficient Querying: SQLite databases optimize querying and data retrieval, enabling fast and efficient extraction of specific information.\nStructured Organization: Databases provide structured and organized data storage, ensuring easy access and maintenance.\nData Integrity: SQLite databases enforce data integrity through constraints and validations, minimizing errors and inconsistencies.\nConcurrency and Multi-User Support: SQLite supports concurrent read access from multiple users, ensuring accessibility without compromising data integrity.\nScalability: It can handle growing volumes of data without significant performance degradation.\nModularity and Portability: Databases are self-contained and modular, simplifying data distribution and portability.\nSecurity and Access Control: SQLite offers security features like password protection and encryption, with granular control over user access.\nIndexing: Support for indexing accelerates data retrieval based on specific columns, particularly beneficial for large datasets.\nData Relationships: Databases allow for the establishment of relationships between tables, facilitating storage of interconnected data, such as project, assay, and sample information.\n\n\n\n\n\n\n\n\n\nExercise 3: Generate a SQLite database from metadata\n\n\n\n\n\n\n\nClick on the hint to reveal the solution and a code example for the exercise, which may serve as inspiration.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nquiet <- function(x) { suppressMessages(suppressWarnings(x)) }\nquiet(library(yaml))\nquiet(library(dplyr))\nquiet(library(lubridate))\nquiet(library(DBI))\n\n# Generate the metadata_df using the script from the example above (recursively fetching metadata.yml files)\n\n# Create an SQLite database and insert data\ndb_file <- paste0(\"database_\", format(Sys.Date(), \"%Y%m%d\"), \".sqlite\")\ncon <- dbConnect(SQLite(), db_file)\n\ndbWriteTable(con, \"metadata\", metadata_df, row.names = FALSE)\n\n# Print confirmation message\ncat(\"Database saved as\", db_file, \"\\n\")\n\n# Close the database connection\ndbDisconnect(con)\n\n\n\n\n\n\n\n\n\n\n\n\nCatalog browser\nYou can design a user-friendly catalog browser for your database using tools like Rshiny or Panel. These frameworks provide interfaces for dynamic search, filtering, and visualization, facilitating efficient exploration of database contents. Creating such a tool with Rshiny from both a TSV file and a SQLite database will be demonstrated below.\nHere’s an example of an SQLite database catalog created by the Brickman Lab at the Center for Stem Cell Medicine. It’s simple yet effective! Clicking on a data row opens the metadata.yml file, allowing access to detailed metadata for that assay.\n\n\nVideo\ntype:video\n\n\n\n\n\n\n\n\nExercise 4: Create your first catalog browser using Rshiny\n\n\n\n\n\n\n\nClick on the hint to reveal the solution and a code example for the exercise, which may serve as inspiration.\n\nSolution A. From a TSV\n\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nR script\n\nquiet <- function(x) { suppressMessages(suppressWarnings(x)) }\nquiet(library(shiny))\nquiet(library(DT))\n\n# UI\nui <- fluidPage(\n titlePanel(\"TSV File Viewer\"),\n \n sidebarLayout(\n sidebarPanel(\n fileInput(\"file\", \"Choose a TSV file\", accept = c(\".tsv\"))\n ),\n \n mainPanel(\n DTOutput(\"table\")\n )\n )\n)\n\n# Server\nserver <- function(input, output) {\n \n data <- reactive({\n req(input$file)\n read.delim(input$file$datapath, sep = \"\\t\")\n })\n \n output$table <- renderDT({\n datatable(data())\n })\n}\n\n# Run the app\nshinyApp(ui, server)\n\n\n\n\n\n\nSolution B. From an SQLite database\n\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nR script\nquiet <- function(x) { suppressMessages(suppressWarnings(x)) }\nquiet(library(shiny))\nquiet(library(DT))\nquiet(library(DBI))\n\n# UI\nui <- fluidPage(\n titlePanel(\"SQLite Database Viewer\"),\n \n sidebarLayout(\n sidebarPanel(\n fileInput(\"db_file\", \"Choose an SQLite Database\", accept = c(\".sqlite\")),\n textInput(\"table_name\", \"Enter Table Name:\", value = \"\"),\n actionButton(\"load_button\", \"Load Table\")\n ),\n \n mainPanel(\n DTOutput(\"table\")\n )\n )\n)\n\n# Server\nserver <- function(input, output, session) {\n \n con <- reactive({\n if (!is.null(input$db_file)) {\n dbConnect(SQLite(), input$db_file$datapath)\n }\n })\n \n data <- reactive({\n req(input$load_button > 0, input$table_name, con())\n query <- glue::glue_sql(\"SELECT * FROM {dbQuoteIdentifier(con(), input$table_name)}\")\n dbGetQuery(con(), query)\n })\n \n output$table <- renderDT({\n datatable(data())\n })\n \n observeEvent(input$load_button, {\n output$table <- renderDT({\n datatable(data())\n })\n })\n \n # Disconnect from the database when app closes\n observe({\n on.exit(dbDisconnect(con()), add = TRUE)\n })\n}\n\n# Run the app\nshinyApp(ui, server)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nExercise 5: Add complex features to your catalog browser\n\n\n\n\n\n\n\nOnce you’ve finished the previous exercise, consider implementing these additional ideas to maximize the utility of your catalog browser.\n\nAdd a tab to create a project directory interactively (and fill up the metadata fields)\nModify existing entries\nVisualize results using Cirrocumulus", + "text": "Database and data catalogs\nMetadata can be used to create data catalogs, particularly beneficial for the efficient organization of experimental or sequencing data generated by researchers. While databases can range from simple tabular formats like Excel to sophisticated DataBase Management Systems (DBMS) like SQLite, the choice depends on factors such as complexity and volume of data. Leveraging a DBMS offers advantages like efficient data storage, enhanced security, and rapid data querying capabilities.\n\nTables as databases\nA browsable table can be created by recursively navigating through a project’s folder hierarchy using a script and generating a TSV file (tab-separated values) named, for example, database_YYYYMMDD.tsv. This table acts as a centralized repository for all project data, simplifying access and organization. Consistency in metadata structure across projects is vital for efficient data management and integration, as it aids in tracking all conducted assays. Adhering to a uniform metadata format enables the seamless inclusion of essential information from YAML files into the browsable table.\n\n\n\n\n\n\nExercise 2: Generate database tables from metadata\n\n\n\n\n\n\n\nWrite a script (R or Python) that recursively fetches metadata.yml files in a given path. It is important that each subdirectory contains its corresponding metadata.yml.\nRequirements:\n\nData folder structure: containing all project folders\nYAML metadata files associated with each project\n\nClick on the hint to reveal the solution and a code example for the exercise, which may serve as inspiration.\nYou can find a thorough guided exercise in the practical material - Exercise 4.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\n# Load required packages\npackages <- c(\"yaml\", \"ggplot2\", \"lubridate\")\n\n# Function to recursively fetch YAML files files, read and convert them to a data frame\n\ndf = lapply(file_list, yaml::yaml.load_file)\n\n# Save the data frame as a TSV file\n\n\n\n\n\n\n\n\n\n\n\n\nSQLite database\nAn alternative to the tabular format is SQLite, a lightweight and self-contained relational database management system known for its simplicity and efficiency. SQLite operates without the need for a separate server, making it ideal for scenarios requiring minimal resource usage. It excels in tasks involving structured data storage and retrieval, making it suitable for managing experiment metadata. Similar to the previous example, you can use a script that records all the information from the YAML file in a SQLite database.\n\n\n\n\n\n\nAdvantages of using SQLite database\n\n\n\n\nEfficient Querying: SQLite databases optimize querying and data retrieval, enabling fast and efficient extraction of specific information.\nStructured Organization: Databases provide structured and organized data storage, ensuring easy access and maintenance.\nData Integrity: SQLite databases enforce data integrity through constraints and validations, minimizing errors and inconsistencies.\nConcurrency and Multi-User Support: SQLite supports concurrent read access from multiple users, ensuring accessibility without compromising data integrity.\nScalability: It can handle growing volumes of data without significant performance degradation.\nModularity and Portability: Databases are self-contained and modular, simplifying data distribution and portability.\nSecurity and Access Control: SQLite offers security features like password protection and encryption, with granular control over user access.\nIndexing: Support for indexing accelerates data retrieval based on specific columns, particularly beneficial for large datasets.\nData Relationships: Databases allow for the establishment of relationships between tables, facilitating storage of interconnected data, such as project, assay, and sample information.\n\n\n\n\n\n\n\n\n\nExercise 3: Generate a SQLite database from metadata\n\n\n\n\n\n\n\nClick on the hint to reveal the necessary libraries and some functions, which may serve as inspiration.\nYou can find a thorough guided exercise, complete with code example, in the practical material - Exercise 4, option B.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\n# Load required packages\npackages <- c(\"yaml\", \"ggplot2\", \"lubridate\", \"DBI\")\n\n# Function to recursively fetch YAML files files, read and convert them to a data frame\n\ndf = lapply(file_list, yaml::yaml.load_file)\n\n# Create an SQLite database from a dataframe and insert data\ndbConnect(SQLite(), \"filenameXXX.sqlite\")\ndbWriteTable()\n\n\n\n\n\n\n\n\n\n\n\n\nCatalog browser\nTo further optimize the use of your metadata and improve the integration of all your lab metadata, you can design a user-friendly catalog browser for your database using tools like Rshiny or Panel. These frameworks provide interfaces for dynamic search, filtering, and visualization, facilitating efficient exploration of database contents.\nCreating such a tool with RShiny is straightforward and does not require extensive development knowledge, whether using a TSV file or a SQLite database. In the practical materials, we demonstrate both scenarios and showcase various functionalities for inspiration. SQLite files are particularly advantageous for data fetching and other operations due to their efficient querying and indexing capabilities.\nHere’s an example of an SQLite database catalog created by the Brickman Lab at the Center for Stem Cell Medicine. It’s simple yet effective! Clicking on a data row opens the metadata.yml file, allowing access to detailed metadata for that assay.\n\n\nVideo\ntype:video\n\n\n\n\n\n\n\n\nExercise 4: Create your first catalog browser using Rshiny\n\n\n\n\n\n\n\nGo to the practical material for complete exercise instructions and solutions. The code provided can serve as inspiration for you to adapt as needed.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nThese are some of the libraries required: install.packages(c(\"shiny\", \"DT\", \"DBI\"))\nYou need to define both a user interface (UI) and a server function. The UI (fluidPage()) outlines the app’s layout using for example, the sidebarLayout() and mainPanel() functions for input controls and output displays.\nThe server function manages data manipulation and user interactions. Use shinyApp() to launch the app once the UI and server are set up.\nHere is a simple example of a server function settup including the main parts (additional components provide advanced functionalities):\n server <- function(input, output, session) {\n # Define a reactive expression for data based on user inputs\n data <- reactive({\n req(input$dataInput) # Ensure data input is available\n # Load or manipulate data here\n })\n\n # Define an output table based on data\n output$dataTable <- renderTable({\n data() # Render the data as a table\n })\n\n # Observe a button click event and perform an action\n observeEvent(input$actionButton, {\n # Perform an action when the button is clicked\n })\n\n # Define cleanup tasks when the app stops\n onStop(function() {\n # Close connections or save state if necessary\n })\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nExercise 5: Add complex features to your catalog browser\n\n\n\n\n\n\n\nOnce you’ve finished the previous exercise, consider implementing these additional ideas to maximize the utility of your catalog browser.\n\nAdd a functionality to only select certain columns uiOutput(\"column_select\")\nAdd buttons to order numeric columns ascending or descending using radioButtons()\nUse SQL aggregation functions (e.g., SUM, COUNT, AVG) to perform custom data summaries and calculations.\nAdd a tab tabPanel() to create a project directory interactively (and fill up the metadata fields), tips: dir.create(), data.frame(), write.table()\nModify existing entries\nVisualize results using Cirrocumulus, an interactive visualization tool for large-scale single-cell genomics data.\n\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nExplore this example with advanced features such as a two-tab layout, filtering by numeric values and matching strings, and a color-customized dashboard here.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTip\n\n\n\n\nFor R Enthusiasts Explore demosfrom the R Shiny community to kickstart your projects or for inspiration.\nFor python Enthusiasts If you want to dive deeper into Shiny apps and their various uses (such as dynamic plots or other interactive widgets), Shiny for Python provides live, interactive code throughout its entire tutorial. Additionally, it offers a great tool called Playground, where you can code and test your own app to explore how different features render.", "crumbs": [ "Course material", "Key practices", @@ -1530,7 +1530,7 @@ "href": "develop/06_pipelines.html#wrap-up", "title": "6. Processing and analyzing biodata", "section": "Wrap up", - "text": "Wrap up\nThis lesson emphasized the importance of reproducibility in computational research and provided practical techniques for achieving it. Using annotated notebooks, pipeline frameworks, and community-curated pipelines, such as those developed by the nf-core community, enhances reproducibility and readability.\n\nSources\n\nRDMkit, Elixir Data Management - Data Analysis\nCode documentation by Johns Hopkins Sheridan libraries. This link includes best practices for code documentation, style guides, R markdown, Jupyter Notebook, version control, and code repository.\nGuide to reproducible code in ecology and evolution\nBest practices for Scientific computing\nElixir Software Best Practices", + "text": "Wrap up\nThis lesson emphasized the importance of reproducibility in computational research and provided practical techniques for achieving it. Using annotated notebooks, pipeline frameworks, and community-curated pipelines, such as those developed by the nf-core community, enhances reproducibility and readability.\n\nSources\n\nRDMkit, Elixir Data Management - Data Analysis\nCode documentation by Johns Hopkins Sheridan libraries. This link includes best practices for code documentation, style guides, R markdown, Jupyter Notebook, version control, and code repository.\nGuide to reproducible code in ecology and evolution\nBest practices for Scientific computing\nElixir Software Best Practices\nfaircookbook worflows", "crumbs": [ "Course material", "Key practices", @@ -1541,22 +1541,22 @@ "objectID": "index.html#research-data-management-for-biological-data", "href": "index.html#research-data-management-for-biological-data", "title": "Computational Research Data Management", - "section": "", - "text": "The course “Research Data Management (RDM) for biological data” is designed to provide participants with foundational knowledge and practical skills in handling the extensive data generated by modern studies, with a focus on Next Generation Sequencing (NGS) data. It emphasizes the importance of Open Science and FAIR principles in managing data effectively. This course covers essential principles and best practices guidelines in data organization, metadata annotation, version control, and data preservation. These principles are explored from a computational perspective, ensuring participants gain hands-on experience in applying them to real-world scenarios in their research labs. Additionally, the course delves into FAIR principles and Open Science, promoting collaboration and reproducibility in research endeavors. By the course’s conclusion, attendees will possess essential tools and techniques to address the data challenges prevalent in today’s NGS research landscape, as well as in other related fields to health and bioinformatics.\n\n\n\n\n\n\nCourse Overview\n\n\n\n\n📖 Syllabus:\n\n\nData Lifecycle Management\nData Management Plans (DMPs)\nData Organization and storage\nDocumentation standards for biodata\nVersion Control and Collaboration\nProcessing and analyzing biodata\nStoring and sharing biodata\n\n\n⏰ Total Time Estimation: X hours\n\n📁 Supporting Materials:\n\n👨‍💻 Target Audience: Ph.D., MSc, anyone interested in RDM for NGS data or other related fields within bioinformatics.\n👩‍🎓 Level: Beginner.\n🔒 License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.\n\n💰 Funding: This project was funded by the Novo Nordisk Fonden (NNF20OC0063268).\n\n\n\n\n\n\n\n\n\n\nCourse Requirements\n\n\n\n\nBasic understanding Next Generation Sequencing data and formats.\nCommand Line experience\nBasic programming experience\nQuarto or Mkdocs tools\n\n\n\nThis course offers participants with an in-depth introduction to effectively managing the vast amounts of data generated in modern studies. Throughout the program, emphasis is placed on practical understanding of RDM principles and the importance of efficient handling of large datasets. In this context, participants will learn the necessity of adopting Open Science and FAIR principles for enhancing data accessibility and reusability.\nParticipants will acquire practical skills for organizing data, including the creation of folder and file structures, and the implementation of metadata to facilitate data discoverability and interpretation. Special attention is given to the development of Data Management Plans (DMPs) with examples tailored to omics data, ensuring compliance with institutional and funding agency requirements while maintaining data integrity. Attendees will also gain insights into the establishment of simple databases and the use of version control systems to track changes in data analysis, thereby promoting collaboration and reproducibility.\nThe course concludes with a focus on archiving and data repositories, enabling participants to learn strategies for preserving and sharing data for long-term scientific usage. By the end of the course, attendees will be equipped with essential tools and techniques to effectively navigate the challenges prevalent in today’s research landscape. This will not only foster successful data management practices but also enhance collaboration within the scientific community.\n\n\n\n\n\n\nCourse Goals\n\n\n\nBy the end of this workshop, you should be able to apply the following concepts in the context of Next Generation Sequencing data:\n\nUnderstand the Importance of Research Data Management (RDM)\nFamiliarize Yourself with FAIR and Open Science Principles\nDraft a Data Management Plan for your own Data\nEstablish File and Folder Naming Conventions\nEnhance Data with Descriptive Metadata\nImplement Version Control for Data Analysis\nSelect an Appropriate Repository for Data Archiving\nMake your data analysis and workflows reproducible and FAIR\n\n\n\n\n\n\n\n\n\nWarning\n\n\n\nThis is a computational workshop that focuses primarily on the digital aspect of our data. While wet lab Research Data Management (RDM) involving protocols, instruments, reagents, ELM or LIMS systems is integral to the entire RDM process, it won’t be covered in this course.\nAs part of effective data management, it’s crucial to prioritize strategies that ensure security and privacy. While these aspects are important, please note that they won’t be covered in our course. However, we highly recommend enrolling in the GDPR course offered by Center for Health Data Science, specially if you’re working with sensitive data. This course specifically focuses on GDPR compliance and will provide you with valuable insights and skills in managing data privacy and security.\n\n\n\n\n\nUniversity of Copenhagen\nUniversity Library of Southern Denmark\nTechnical University of Denmark\nAalborg University\nAarhus University\n\n\n\n\n\nRDMkit, ELIXIR (2021) Research Data Management Kit. A deliverable from the EU-funded ELIXIR-CONVERGE project (grant agreement 871075).\nUniversity of Copenhagen Research Data Management Team.\nMartin Proks and Sarah Lundregan, Brickman Lab, NNF Center for Stem Cell Biology (reNEW), University of Copenhagen.\nRichard Dennis, Data Steward, NNF Center for Stem Cell Biology (reNEW), University of Copenhagen.\nNBISweden." + "section": "Research Data Management for biological data", + "text": "Research Data Management for biological data\nThe course “Research Data Management (RDM) for biological data” is designed to provide participants with foundational knowledge and practical skills in handling the extensive data generated by modern studies, with a focus on Next Generation Sequencing (NGS) data. It emphasizes the importance of Open Science and FAIR principles in managing data effectively. This course covers essential principles and best practices guidelines in data organization, metadata annotation, version control, and data preservation. These principles are explored from a computational perspective, ensuring participants gain hands-on experience in applying them to real-world scenarios in their research labs. Additionally, the course delves into FAIR principles and Open Science, promoting collaboration and reproducibility in research endeavors. By the course’s conclusion, attendees will possess essential tools and techniques to address the data challenges prevalent in today’s NGS research landscape, as well as in other related fields to health and bioinformatics.\n\n\n\n\n\n\nCourse Overview\n\n\n\n\n📖 Syllabus:\n\n\nData Lifecycle Management\nData Management Plans (DMPs)\nData Organization and storage\nDocumentation standards for biodata\nVersion Control and Collaboration\nProcessing and analyzing biodata\nStoring and sharing biodata\n\n\n⏰ Total Time Estimation: X hours\n\n📁 Supporting Materials:\n\n👨‍💻 Target Audience: Ph.D., MSc, anyone interested in RDM for NGS data or other related fields within bioinformatics.\n👩‍🎓 Level: Beginner.\n🔒 License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.\n\n💰 Funding: This project was funded by the Novo Nordisk Fonden (NNF20OC0063268).\n\n\n\n\n\n\n\n\n\n\nCourse Requirements\n\n\n\n\nBasic understanding Next Generation Sequencing data and formats.\nCommand Line experience\nBasic programming experience\nQuarto or Mkdocs tools\n\n\n\nThis course offers participants with an in-depth introduction to effectively managing the vast amounts of data generated in modern studies. Throughout the program, emphasis is placed on practical understanding of RDM principles and the importance of efficient handling of large datasets. In this context, participants will learn the necessity of adopting Open Science and FAIR principles for enhancing data accessibility and reusability.\nParticipants will acquire practical skills for organizing data, including the creation of folder and file structures, and the implementation of metadata to facilitate data discoverability and interpretation. Special attention is given to the development of Data Management Plans (DMPs) with examples tailored to omics data, ensuring compliance with institutional and funding agency requirements while maintaining data integrity. Attendees will also gain insights into the establishment of simple databases and the use of version control systems to track changes in data analysis, thereby promoting collaboration and reproducibility.\nThe course concludes with a focus on archiving and data repositories, enabling participants to learn strategies for preserving and sharing data for long-term scientific usage. By the end of the course, attendees will be equipped with essential tools and techniques to effectively navigate the challenges prevalent in today’s research landscape. This will not only foster successful data management practices but also enhance collaboration within the scientific community.\n\n\n\n\n\n\nCourse Goals\n\n\n\nBy the end of this workshop, you should be able to apply the following concepts in the context of Next Generation Sequencing data:\n\nUnderstand the Importance of Research Data Management (RDM)\nFamiliarize Yourself with FAIR and Open Science Principles\nDraft a Data Management Plan for your own Data\nEstablish File and Folder Naming Conventions\nEnhance Data with Descriptive Metadata\nImplement Version Control for Data Analysis\nSelect an Appropriate Repository for Data Archiving\nMake your data analysis and workflows reproducible and FAIR\n\n\n\n\n\n\n\n\n\nWarning\n\n\n\nThis is a computational workshop that focuses primarily on the digital aspect of our data. While wet lab Research Data Management (RDM) involving protocols, instruments, reagents, ELM or LIMS systems is integral to the entire RDM process, it won’t be covered in this course.\nAs part of effective data management, it’s crucial to prioritize strategies that ensure security and privacy. While these aspects are important, please note that they won’t be covered in our course. However, we highly recommend enrolling in the GDPR course offered by Center for Health Data Science, specially if you’re working with sensitive data. This course specifically focuses on GDPR compliance and will provide you with valuable insights and skills in managing data privacy and security.\n\n\n\nDanish institutional RDM links\n\nUniversity of Copenhagen\nUniversity Library of Southern Denmark\nTechnical University of Denmark\nAalborg University\nAarhus University\n\n\n\nAcknowledgements\n\nRDMkit, ELIXIR (2021) Research Data Management Kit. A deliverable from the EU-funded ELIXIR-CONVERGE project (grant agreement 871075).\nUniversity of Copenhagen Research Data Management Team.\nMartin Proks and Sarah Lundregan, Brickman Lab, NNF Center for Stem Cell Biology (reNEW), University of Copenhagen.\nRichard Dennis, Data Steward, NNF Center for Stem Cell Biology (reNEW), University of Copenhagen.\nNBISweden." }, { "objectID": "practical_workflows.html#snakemake", "href": "practical_workflows.html#snakemake", - "title": "Workflows", + "title": "FAIR Workflows", "section": "Snakemake", "text": "Snakemake\nIt is a text-based tool using python-based language plus domain specific syntax. The workflow is decompose into rules that are define to obtain output files from input files. It infers dependencies and the execution order.\n\nBasics\n\nDefine rules\nGeneralise the rule: creating wildcards You can refer by index or by name\nDependencies are determined top-down\n\nFor a given target, a rule that can be applied to create it, is determined (a job) For the input files of the rule, go on recursively, If no target is specified, snakemake , tries to apply the first rule\n\nRule all: target rule that collects results\n\n\n\nJob execution\nA job is executed if and only if: - otuput file is target and does not exist - output file needed by another executed job and does not exist - input file newer than output file - input file will be updated by other job (eg. changes in rules) - execution is force (‘–force-all’)\nYou can plot the DAG (directed acyclic graph) of the jobs\n\n\nUseful command line interface\n# dry-run (-n), print shell commands (-p)\nsnakemake -n -p\n# Snakefile named different in another location \nsnakemake --snakefile path/to/file.smoker\n# dry-run (-n), print execution reason for each job\nsnakemake -n -r\n# Visualise DAG of jobs using Graphviz dot command\nsnakemake --dag | dot -Tsvg > dag.svg\n\n\nDefining resources\nrule myrule:\n resources: mem_mb= 100 #(100MB memory allocation)\n threads: X\n shell:\n \"command {threads}\"\nLet’s say you defined our rule myrule needs 4 works, if we execute the workflow with 8 cores as follows:\nsnakemake --cores 8\nThis means that 2 ‘myrule’ jobs, will be executed in parallel.\nThe jobs are schedules to maximize parallelization, high priority jobs will be scheduled first, all while satisfying resource constrains. This means:\nIf we allocate 100MB for the execution of ‘myrule’ and we call snakemake as follows:\nsnakemake --resources mem_mb=100 --cores 8\nOnly one ‘myrule’ job can be executed in parallel (you do not provide enough memory resources for 2). The memory resources is useful for jobs that are heavy memory demanding to avoid running out of memory. You will need to benchmark your pipeline to estimate how much memory and time your full workflow will take. We highly recommend doing so, get a subset of your dataset and give it a go! Log files will come very handy for the resource estimation. Of course, the execution of jobs is dependant on the free resources availability (eg. CPU cores).\nrule myrule:\n log: \"logs/myrule.log\"\n threads: X\n shell:\n \"command {threads}\"\nLog files need to define the same wildcards as the output files, otherwise, you will get an error.\n\n\nConfig files\nYou can also define values for wildcards or parameters in the config file. This is recommended when the pipeline might be used several times at different time points, to avoid unwanted modifications to the workflow. parameterization is key for such cases.\n\n\nCluster execution\nWhen working from cluster systems you can execute the workflow using -qsub submission command\nsnakemake --cluster qsub \n\n\nAdditional advanced features\n\nmodularization\nhandling temporary and protected files: very important for intermediate files that filled up our memory and are not used in the long run and can be deleted once the final output is generated. This is automatically done by snakemake if you defined them in your pipeline HTML5 reports\nrule parameters\ntracking tool versions and code changes: will force rerunning older jobs when code and software are modified/updated.\ndata provenance information per file\npython API for embedding snakemake in other tools\n\n\n\nCreate an isolated environment to install dependencies\nBasic file structure\n| - config.yml\n| - requirements.txt (commonly also named environment.txt)\n| - rules/\n| | - myrules.smk\n| - scripts/\n| | - script1.py\n| - Snakefile\nCreate conda environment, one per project!\n# create env\nconda create -n myworklow --file requirements.txt\n# activate environment\nsource activate myworkflow\n# then execute snakemake\nUse git repositories to save your projects and pipelines!" }, { "objectID": "practical_workflows.html#sources", "href": "practical_workflows.html#sources", - "title": "Workflows", + "title": "FAIR Workflows", "section": "Sources", - "text": "Sources\n\nSnakemake tutorial\nSnakemake turorial slides by Johannes Koster\nhttps://bioconda.github.io\nKöster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.\nKöster, Johannes. “Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis”, PhD thesis, TU Dortmund 2014." + "text": "Sources\n\nSnakemake tutorial\nSnakemake turorial slides by Johannes Koster\nhttps://bioconda.github.io\nKöster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.\nKöster, Johannes. “Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis”, PhD thesis, TU Dortmund 2014.\nfaircookbook worflows" }, { "objectID": "develop/practical_workshop.html#create-a-catalog-of-your-assay-folder", @@ -1580,7 +1580,7 @@ { "objectID": "practical_workflows.html#nextflow", "href": "practical_workflows.html#nextflow", - "title": "Workflows", + "title": "FAIR Workflows", "section": "Nextflow", "text": "Nextflow" }, @@ -1601,14 +1601,14 @@ "href": "develop/practical_workshop.html#organize-and-structure-your-datasets-and-data-analysis", "title": "Practical material", "section": "1. Organize and structure your datasets and data analysis", - "text": "1. Organize and structure your datasets and data analysis\nEstablishing a consistent file structure and naming conventions will help you efficiently manage your data. We will classify your data and data analyses into two distinct types of folders to ensure the data can be used and shared by many lab members while preventing modifications by any individual:\n\nData folders (assay or external databases and resources): They house the raw and processed datasets, alongside the pipeline/workflow used to generate the processed data, the provenance of the raw data, and quality control reports of the data. The data should be locked and set to read-only to prevent unintended modifications. This applies to experimental data generated in your lab as well as external resources. Provide an MD5 checksum file when you download them yourself to verify their integrity.\nProject folders: They contain all the essential files for a specific research project. Projects may use data from various resources or experiments, or build upon previous results from other projects. The data should not be copied or duplicated, instead, it should be linked directly from the source.\n\nData and data analysis are kept separate because a project may utilize one or more datasets to address a scientific question. Data can be reused in multiple projects over time, combined with other datasets for comparison, or used to build larger datasets. Additionally, data may be utilized by different researchers to answer various research questions.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nWhen organizing your data folders, separate assays from external resources and maintain a consistent structure. For example, organize genome references by species and further categorize them by versions. Make sure to include all relevant information, and refer to this lesson for additional tips on data organization.\nThis will help you to keep your data tidied up, especially if you are working in a big lab where assays may be used for different purposes and by different people!\n\n\n\n\n\n\nData folders\nWhether your lab generates its own experimental data, receives it from collaborators, or works with previously published datasets, the data folder should follow a similar structure to the one presented here. Create a separate folder for each dataset, including raw files and processed files alongside the corresponding documentation and pipeline that generated the processed data. Raw files should remain untouched, and you should consider locking modifications to the final results once data preprocessing is complete. This precaution helps prevent unwanted changes to the data. Each subfolder should be named in a way that is distinct, easily readable and clear at a glance. Check this lesson for tips on naming conventions.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nUse an acronym (1) that describes the type of NGS assay (RNAseq, ChIPseq, ATACseq) a keyword (2) that represents a unique element to that assay, and the date (3).\n<Assay-ID>_<keyword>_YYYYMMDD\nFor example CHIP_Oct4_20230101 is a ChIPseq assay made on 1st January 2023 with the keyword Oct4, so it is easily identifiable by the eye.\n\n\n\n\n\nLet’s explore a potential folder structure and the types of files you might encounter within it.\n<data_type>_<keyword>_YYYYMMDD/\n├── README.md \n├── CHECKSUMS\n├── pipeline\n ├── pipeline.md\n ├── scripts/\n├── processed\n ├── fastqc/\n ├── multiqc/\n ├── final_fastq/\n└── raw\n ├── .fastq.gz \n └── samplesheet.csv\n\nREADME.md: This file contains a detailed description of the dataset commonly in markdown format. It should include the provenance of the raw data (such as samples, laboratory protocols used, the aim of the project, folder structure, naming conventions, etc.).\nmetadata.yml: This metadata file outlines different keys and essential information, usually presented in YAML format. For more details, refer to this lesson.\npipeline.md: This file provides an overview of the pipeline used to process raw data, as well as the commands to run the pipeline. The pipeline itself and all the required scripts should be collected in the same directory.\nprocessed: This folder contains the results from the preprocessing pipeline. The content vary depending on the specific pipeline used (create additional subdirectories as needed).\nraw: This folder holds the raw data.\n\n.fastq.gz: For example, in NGS assays, there should be ‘fastq’ files.\nsamplesheet.csv: This file holds essential metadata for the samples, including sample identification, experimental variables, batch information, and other metrics crucial for downstream analysis. It is important that this file is complete and current, as it is key to interpreting results. If you are considering running nf-core pipelines, this file will be required.\n\n\n\n\nProject folders\nOn the other hand, we have another type of folder called Projects which refers to data analyses that are specific to particular tasks, such as those involved in preparing a potential article. In this folder, you will create a subfolder for each project that you or your lab is working on. Each Project subfolder should include project-specific information, data analysis pipelines, notebooks, and scripts used for that particular project. Additionally, you should include an environment file with all the required software and dependencies needed for the project, including their versions. This helps ensure that the analyses can be easily replicated and shared with others.\nThe Project folder should be named in a way that is unique, easy to read, distinguishable, and clear at a glance. For example, you might name it based on the main author’s initials, the dataset being analyzed, the project name, a unique descriptive element related to the project, or the part of the project you are responsible for, along with the date:\n<project>_<keyword>_YYYYMMDD\n\n\n\n\n\n\nNaming examples\n\n\n\n\n\n\n\n\nRNASeq_Mouse_Brain_20230512: a project RNA sequencing data from a mouse brain experiment, created on May 12, 2023\nEHR_COVID19_Study_20230115: a project around electronic health records data for a COVID-19 study, created on January 15, 2023.\n\n\n\n\n\n\nNow, let’s explore an example of a folder structure and the types of files you might encounter within it.\n<project>_<keyword>_YYYYMMDD\n├── data\n│ └── <ID>_<keyword>_YYYYMMDD <- symbolic link\n├── documents\n│ └── research_project_template.docx\n├── metadata.yml\n├── notebooks\n│ └── 01_data_processing.rmd\n│ └── 02_data_analysis.rmd\n│ └── 03_data_visualization.rmd\n├── README.md\n├── reports\n│ └── 01_data_processing.html\n│ └── 02_data_analysis.html\n│ ├── 03_data_visualization.html\n│ │ └── figures\n│ │ └── tables\n├── requirements.txt // env.yaml\n├── results\n│ ├── figures\n│ │ └── 02_data_analysis/\n│ │ └── heatmap_sampleCor_20230102.png\n│ ├── tables\n│ │ └── 02_data_analysis/\n│ │ └── DEA_treat-control_LFC1_p01.tsv\n│ │ └── SumStats_sampleCor_20230102.tsv\n├── pipeline\n│ ├── rules // processes \n│ │ └── step1_data_processing.smk\n│ └── pipeline.md\n├── scratch\n└── scripts\n\ndata: This folder contains symlinks or shortcuts to the actual data files, ensuring that the original files remain unaltered.\ndocuments: This folder houses Word documents, slides, or PDFs associated with the project, including data and project explanations, research papers, and more. It also includes the Data Management Plan.\n\nresearch_project_template.docx. If you download our template you will find a is a pre-filled Data Management Plan based on the Horizon Europe guidelines named ‘Non-sensitive_NGS_research_project_template.docx’.\n\nmetadata.yml: metadata file describing various keys of the project or experiment (see this lesson).\nnotebooks: This folder stores Jupyter, R Markdown, or Quarto notebooks containing the data analysis. Figures and tables used for the reports are organized under subfolders named after the notebook that created them for provenance purposes.\nREADME.md: A detailed project description in markdown or plain-text format.\nreports: Notebooks rendered as HTML, docx, or PDF files for sharing with colleagues or as formal data analysis reports.\n\nfigures: figures produced upon rendering notebooks. The figures will be saved under a subfolder named after the notebook that created them. This is for provenance purposes so we know which notebook created which figures.\n\nrequirements.txt: This file lists the necessary software, libraries, and their versions required to reproduce the code. If you’re using conda environments, you will also find the env.yaml file here, which outlines the specific environment configuration.\nresults: This folder contains analysis results, such as figures and tables. Organizing results by the pipeline, script, or notebook that generated them will make it easier to locate and interpret the data.\npipeline: A folder containing pipeline scripts or workflows for processing and analyzing data.\nscratch: A folder designated for temporary files or workspace for experiments and development.\nscripts: Folder for helper scripts needed to run data analysis or reproduce the work.\n\n\n\nTemplate engine\nCreating a folder template is straightforward with cookiecutter a command-line tool that generates projects from templates (called cookiecutters). For example, it can help you set up a Python package project based on a Python package project template.\n\n\n\n\n\n\nCookiecutter templates\n\n\n\nHere are some template that you can use to get started, adapt and modify them to your own needs:\n\nPython package project\nSandbox test\nData science\nNGS data\n\nCreate your own template from scratch.\n\n\n\nQuick tutorial on cookiecutter\nBuilding a Cookiecutter template from scratch requires defining a folder structure, crafting a cookiecutter.json file, and outlining placeholders (keywords) that will be substituted when generating a new project. Here’s a step-by-step guide on how to proceed:\n\nStep 1: Create a Folder Template\nFirst, begin by creating a folder structure that aligns with your desired template design. For instance, let’s set up a simple Python project template:\nmy_template/\n|-- {{cookiecutter.project_name}}\n| |-- main.py\n|-- tests\n| |-- test_{{cookiecutter.project_name}}.py\n|-- README.md\nIn this example, {cookiecutter.project_name} is a placeholder that will be replaced with the actual project name when the template is used. This directory contains a python script (‘main.py’), a subdirectory (‘tests’) with a second python script named after the project (‘test_{{cookiecutter.project_name}}.py’) and a ‘README.md’ file.\n\n\nStep 2: Create cookiecutter.json\nIn the root of your template folder, create a file named cookiecutter.json. This file will define the variables (keywords) that users will be prompted to fill in. For our Python project template, it might look like this:\n{\n \"project_name\": \"MyProject\",\n \"author_name\": \"Your Name\",\n \"description\": \"A short description of your project\"\n}\nWhen users generate a project based on your template, they will be prompted with these questions. The provided values (“responses”) will be used to substitute the placeholders in your template files.\nBeyond substituting placeholders in file and directory names, Cookiecutter can automatically populate text file contents with information. This feature is useful for offering default configurations or code file templates. Let’s enhance our earlier example by incorporating a placeholder within a text file:\nFirst, modify the my_template/main.py file to include a placeholder inside its contents:\n# main.py\n\ndef hello():\n print(\"Hello, {{cookiecutter.project_name}}!\")\nThe ‘{{cookiecutter.project_name}}’ placeholder is now included within the main.py file. When you execute Cookiecutter, it will automatically replace the placeholders in both file and directory names and within text file contents.\nAfter running Cookiecutter, your generated ‘main.py’ file could appear as follows:\n# main.py\n\ndef hello():\n print(\"Hello, MyProject!\") # Assuming \"MyProject\" was entered as the project_name\n\n\nStep 3: Use Cookiecutter\nOnce your template is prepared, you can utilize Cookiecutter to create a project from it. Open a terminal and execute:\ncookiecutter path/to/your/template\nCookiecutter will prompt you to provide values for project_name, author_name, and description. Once you input these values, Cookiecutter will replace the placeholders in your template files with the entered values.\n\n\nStep 4: Review the Generated Project\nAfter the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will find a project structure with the placeholders replaced by the values you provided.\n\n\n\n\n\n\nExercise 1: Create your own template\n\n\n\n\n\n\n\nUse Cookiecutter to create custom templates for your folders. You can do it from scratch (see Exercise 1, part B) or opt for one of our pre-made templates available as a Github repository (recommended for this workshop). Feel free to tailor the template to your specific requirements—you don’t have to follow our examples exactly.\nRequirements\nWe assume you have already gone through the requirements at the beginning of the practical lesson. This includes installing the necessary tools and setting up accounts as needed.\nProject\n\nGo to our Cookicutter template and click on the Fork button at the top-right corner of the repository page to create a copy of the repository on your own GitHub account or organization. \nOpen a terminal on your computer, copy the URL of your fork and clone the repository to your local machine (the URL should look something like https://github.com/your_username/cookiecutter-template):\ngit clone <your URL to the template>\nIf you have a GitHub Desktop, click Add and select “Clone repository” from the options\nOpen the repository and navigate through the different directories\nModify the contents of the repository as needed to fit your project’s requirements. You can change files, add new ones. remove existing one or adjust the folder structure. For inspiration, review the data structure above under ‘Project folder’. For instance, this template is missing the ‘reports’ directory and add the ‘requirements.txt’ file. Consider creating it, along with a subdirectory named ‘reports/figures’.\n├── results/\n│ ├── figures/\n├── requirements.txt\nHere’s an example of how to do it:\n# Open your terminal and navigate to your template directory. Then: \ncd \\{\\{\\ cookiecutter.project_name\\ \\}\\}/ \nmkdir reports \ntouch requirements.txt\nCommit and push changes when you are done with your modifications\n\n\nStage the changes with git add\nCommit the changes with a meaningful commit message git commit -m \"update cookicutter template\"\nPush the changes to your forked repository on Github git push origin main (or the appropriate branch name)\n\n\nTest your template by using cookiecutter <URL to your GitHub repository \"cookicutter-template\">\nFill up the variables and verify that the new structure (and folders) looks like you would expect. Have any new folders been added, or have some been removed?\n\n\n\n\n\n\n\n\n\n\n\n\nOptional Exercise 1, part B\n\n\n\n\n\n\n\nCreate a template from scratch using this tutorial scratch, it can be as basic as this one below or ‘Data folder’:\nmy_template/\n|-- {{cookiecutter.project_name}}\n| |-- main.py\n|-- tests\n| |-- test_{{cookiecutter.project_name}}.py\n|-- README.md\n\nStep 1: Create a directory for the template.\nStep 2: Write a cookiecutter.json file with variables such as project_name and author.\nStep 3: Set up the folder structure by creating subdirectories and files as needed.\nStep 4: Incorporate cookiecutter variables in the names of files.\nStep 5: Use cookiecutter variables within scripts, such as printing a message that includes the project name." + "text": "1. Organize and structure your datasets and data analysis\nEstablishing a consistent file structure and naming conventions will help you efficiently manage your data. We will classify your data and data analyses into two distinct types of folders to ensure the data can be used and shared by many lab members while preventing modifications by any individual:\n\nData folders (assay or external databases and resources): They house the raw and processed datasets, alongside the pipeline/workflow used to generate the processed data, the provenance of the raw data, and quality control reports of the data. The data should be locked and set to read-only to prevent unintended modifications. This applies to experimental data generated in your lab as well as external resources. Provide an MD5 checksum file when you download them yourself to verify their integrity.\nProject folders: They contain all the essential files for a specific research project. Projects may use data from various resources or experiments, or build upon previous results from other projects. The data should not be copied or duplicated, instead, it should be linked directly from the source.\n\nData and data analysis are kept separate because a project may utilize one or more datasets to address a scientific question. Data can be reused in multiple projects over time, combined with other datasets for comparison, or used to build larger datasets. Additionally, data may be utilized by different researchers to answer various research questions.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nWhen organizing your data folders, separate assays from external resources and maintain a consistent structure. For example, organize genome references by species and further categorize them by versions. Make sure to include all relevant information, and refer to this lesson for additional tips on data organization.\nThis will help you to keep your data tidied up, especially if you are working in a big lab where assays may be used for different purposes and by different people!\n\n\n\n\n\n\nData folders\nWhether your lab generates its own experimental data, receives it from collaborators, or works with previously published datasets, the data folder should follow a similar structure to the one presented here. Create a separate folder for each dataset, including raw files and processed files alongside the corresponding documentation and pipeline that generated the processed data. Raw files should remain untouched, and you should consider locking modifications to the final results once data preprocessing is complete. This precaution helps prevent unwanted changes to the data. Each subfolder should be named in a way that is distinct, easily readable and clear at a glance. Check this lesson for tips on naming conventions.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nUse an acronym (1) that describes the type of NGS assay (RNAseq, ChIPseq, ATACseq) a keyword (2) that represents a unique element to that assay, and the date (3).\n<Assay-ID>_<keyword>_YYYYMMDD\nFor example CHIP_Oct4_20230101 is a ChIPseq assay made on 1st January 2023 with the keyword Oct4, so it is easily identifiable by the eye.\n\n\n\n\n\nLet’s explore a potential folder structure and the types of files you might encounter within it.\n<data_type>_<keyword>_YYYYMMDD/\n├── README.md \n├── CHECKSUMS\n├── pipeline\n ├── pipeline.md\n ├── scripts/\n├── processed\n ├── fastqc/\n ├── multiqc/\n ├── final_fastq/\n└── raw\n ├── .fastq.gz \n └── samplesheet.csv\n\nREADME.md: This file contains a detailed description of the dataset commonly in markdown format. It should include the provenance of the raw data (such as samples, laboratory protocols used, the aim of the project, folder structure, naming conventions, etc.).\nmetadata.yml: This metadata file outlines different keys and essential information, usually presented in YAML format. For more details, refer to this lesson.\npipeline.md: This file provides an overview of the pipeline used to process raw data, as well as the commands to run the pipeline. The pipeline itself and all the required scripts should be collected in the same directory.\nprocessed: This folder contains the results from the preprocessing pipeline. The content vary depending on the specific pipeline used (create additional subdirectories as needed).\nraw: This folder holds the raw data.\n\n.fastq.gz: For example, in NGS assays, there should be ‘fastq’ files.\nsamplesheet.csv: This file holds essential metadata for the samples, including sample identification, experimental variables, batch information, and other metrics crucial for downstream analysis. It is important that this file is complete and current, as it is key to interpreting results. If you are considering running nf-core pipelines, this file will be required.\n\n\n\n\nProject folders\nOn the other hand, we have another type of folder called Projects which refers to data analyses that are specific to particular tasks, such as those involved in preparing a potential article. In this folder, you will create a subfolder for each project that you or your lab is working on. Each Project subfolder should include project-specific information, data analysis pipelines, notebooks, and scripts used for that particular project. Additionally, you should include an environment file with all the required software and dependencies needed for the project, including their versions. This helps ensure that the analyses can be easily replicated and shared with others.\nThe Project folder should be named in a way that is unique, easy to read, distinguishable, and clear at a glance. For example, you might name it based on the main author’s initials, the dataset being analyzed, the project name, a unique descriptive element related to the project, or the part of the project you are responsible for, along with the date:\n<project>_<keyword>_YYYYMMDD\n\n\n\n\n\n\nNaming examples\n\n\n\n\n\n\n\n\nRNASeq_Mouse_Brain_20230512: a project RNA sequencing data from a mouse brain experiment, created on May 12, 2023\nEHR_COVID19_Study_20230115: a project around electronic health records data for a COVID-19 study, created on January 15, 2023.\n\n\n\n\n\n\nNow, let’s explore an example of a folder structure and the types of files you might encounter within it.\n<project>_<keyword>_YYYYMMDD\n├── data\n│ └── <ID>_<keyword>_YYYYMMDD <- symbolic link\n├── documents\n│ └── research_project_template.docx\n├── metadata.yml\n├── notebooks\n│ └── 01_data_processing.rmd\n│ └── 02_data_analysis.rmd\n│ └── 03_data_visualization.rmd\n├── README.md\n├── reports\n│ └── 01_data_processing.html\n│ └── 02_data_analysis.html\n│ ├── 03_data_visualization.html\n│ │ └── figures\n│ │ └── tables\n├── requirements.txt // env.yaml\n├── results\n│ ├── figures\n│ │ └── 02_data_analysis/\n│ │ └── heatmap_sampleCor_20230102.png\n│ ├── tables\n│ │ └── 02_data_analysis/\n│ │ └── DEA_treat-control_LFC1_p01.tsv\n│ │ └── SumStats_sampleCor_20230102.tsv\n├── pipeline\n│ ├── rules // processes \n│ │ └── step1_data_processing.smk\n│ └── pipeline.md\n├── scratch\n└── scripts\n\ndata: This folder contains symlinks or shortcuts to the actual data files, ensuring that the original files remain unaltered.\ndocuments: This folder houses Word documents, slides, or PDFs associated with the project, including data and project explanations, research papers, and more. It also includes the Data Management Plan.\n\nresearch_project_template.docx. If you download our template you will find a is a pre-filled Data Management Plan based on the Horizon Europe guidelines named ‘Non-sensitive_NGS_research_project_template.docx’.\n\nmetadata.yml: metadata file describing various keys of the project or experiment (see this lesson).\nnotebooks: This folder stores Jupyter, R Markdown, or Quarto notebooks containing the data analysis. Figures and tables used for the reports are organized under subfolders named after the notebook that created them for provenance purposes.\nREADME.md: A detailed project description in markdown or plain-text format.\nreports: Notebooks rendered as HTML, docx, or PDF files for sharing with colleagues or as formal data analysis reports.\n\nfigures: figures produced upon rendering notebooks. The figures will be saved under a subfolder named after the notebook that created them. This is for provenance purposes so we know which notebook created which figures.\n\nrequirements.txt: This file lists the necessary software, libraries, and their versions required to reproduce the code. If you’re using conda environments, you will also find the env.yaml file here, which outlines the specific environment configuration.\nresults: This folder contains analysis results, such as figures and tables. Organizing results by the pipeline, script, or notebook that generated them will make it easier to locate and interpret the data.\npipeline: A folder containing pipeline scripts or workflows for processing and analyzing data.\nscratch: A folder designated for temporary files or workspace for experiments and development.\nscripts: Folder for helper scripts needed to run data analysis or reproduce the work.\n\n\n\nTemplate engine\nCreating a folder template is straightforward with cookiecutter a command-line tool that generates projects from templates (called cookiecutters). For example, it can help you set up a Python package project based on a Python package project template.\n\n\n\n\n\n\nCookiecutter templates\n\n\n\nHere are some template that you can use to get started, adapt and modify them to your own needs:\n\nPython package project\nSandbox test\nData science\nNGS data\n\nCreate your own template from scratch.\n\n\n\nQuick tutorial on cookiecutter\nBuilding a Cookiecutter template from scratch requires defining a folder structure, crafting a cookiecutter.json file, and outlining placeholders (keywords) that will be substituted when generating a new project. Here’s a step-by-step guide on how to proceed:\n\nStep 1: Create a Folder Template\nFirst, begin by creating a folder structure that aligns with your desired template design. For instance, let’s set up a simple Python project template:\nmy_template/\n|-- {{cookiecutter.project_name}}\n| |-- main.py\n|-- tests\n| |-- test_{{cookiecutter.project_name}}.py\n|-- README.md\nIn this example, {cookiecutter.project_name} is a placeholder that will be replaced with the actual project name when the template is used. This directory contains a python script (‘main.py’), a subdirectory (‘tests’) with a second python script named after the project (‘test_{{cookiecutter.project_name}}.py’) and a ‘README.md’ file.\n\n\nStep 2: Create cookiecutter.json\nIn the root of your template folder, create a file named cookiecutter.json. This file will define the variables (keywords) that users will be prompted to fill in. For our Python project template, it might look like this:\n{\n \"project_name\": \"MyProject\",\n \"author_name\": \"Your Name\",\n \"description\": \"A short description of your project\"\n}\nWhen users generate a project based on your template, they will be prompted with these questions. The provided values (“responses”) will be used to substitute the placeholders in your template files.\nBeyond substituting placeholders in file and directory names, Cookiecutter can automatically populate text file contents with information. This feature is useful for offering default configurations or code file templates. Let’s enhance our earlier example by incorporating a placeholder within a text file:\nFirst, modify the my_template/main.py file to include a placeholder inside its contents:\n\n\nmain.py\n\n# main.py\ndef hello():\n print(\"Hello, {{cookiecutter.project_name}}!\")\n\nThe ‘{{cookiecutter.project_name}}’ placeholder is now included within the main.py file. When you execute Cookiecutter, it will automatically replace the placeholders in both file and directory names and within text file contents.\nAfter running Cookiecutter, your generated ‘main.py’ file could appear as follows:\n# main.py, assuming \"MyProject\" was entered as the project_name\ndef hello():\n print(\"Hello, MyProject!\") \n\n\nStep 3: Use Cookiecutter\nOnce your template is prepared, you can utilize Cookiecutter to create a project from it. Open a terminal and execute:\ncookiecutter path/to/your/template\nCookiecutter will prompt you to provide values for project_name, author_name, and description. Once you input these values, Cookiecutter will replace the placeholders in your template files with the entered values.\n\n\nStep 4: Review the Generated Project\nAfter the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will find a project structure with the placeholders replaced by the values you provided.\n\n\n\n\n\n\nExercise 1: Create your own template\n\n\n\n\n\n\n\nUse Cookiecutter to create custom templates for your folders. You can do it from scratch (see Exercise 1, part B) or opt for one of our pre-made templates available as a Github repository (recommended for this workshop). Feel free to tailor the template to your specific requirements—you don’t have to follow our examples exactly.\nRequirements\nWe assume you have already gone through the requirements at the beginning of the practical lesson. This includes installing the necessary tools and setting up accounts as needed.\nProject\n\nGo to our Cookicutter template and click on the Fork button at the top-right corner of the repository page to create a copy of the repository on your own GitHub account or organization. \nOpen a terminal on your computer, copy the URL of your fork and clone the repository to your local machine (the URL should look something like https://github.com/your_username/cookiecutter-template):\ngit clone <your URL to the template>\nIf you have a GitHub Desktop, click Add and select “Clone repository” from the options\nOpen the repository and navigate through the different directories\nModify the contents of the repository as needed to fit your project’s requirements. You can change files, add new ones. remove existing one or adjust the folder structure. For inspiration, review the data structure above under ‘Project folder’. For instance, this template is missing the ‘reports’ directory and add the ‘requirements.txt’ file. Consider creating it, along with a subdirectory named ‘reports/figures’.\n├── results/\n│ ├── figures/\n├── requirements.txt\nHere’s an example of how to do it:\n# Open your terminal and navigate to your template directory. Then: \ncd \\{\\{\\ cookiecutter.project_name\\ \\}\\}/ \nmkdir reports \ntouch requirements.txt\nCommit and push changes when you are done with your modifications\n\n\nStage the changes with git add\nCommit the changes with a meaningful commit message git commit -m \"update cookicutter template\"\nPush the changes to your forked repository on Github git push origin main (or the appropriate branch name)\n\n\nTest your template by using cookiecutter <URL to your GitHub repository \"cookicutter-template\">\nFill up the variables and verify that the new structure (and folders) looks like you would expect. Have any new folders been added, or have some been removed?\n\n\n\n\n\n\n\n\n\n\n\n\nOptional Exercise 1, part B\n\n\n\n\n\n\n\nCreate a template from scratch using this tutorial scratch, it can be as basic as this one below or ‘Data folder’:\nmy_template/\n|-- {{cookiecutter.project_name}}\n| |-- main.py\n|-- tests\n| |-- test_{{cookiecutter.project_name}}.py\n|-- README.md\n\nStep 1: Create a directory for the template.\nStep 2: Write a cookiecutter.json file with variables such as project_name and author.\nStep 3: Set up the folder structure by creating subdirectories and files as needed.\nStep 4: Incorporate cookiecutter variables in the names of files.\nStep 5: Use cookiecutter variables within scripts, such as printing a message that includes the project name." }, { "objectID": "develop/practical_workshop.html#data-documentation", "href": "develop/practical_workshop.html#data-documentation", "title": "Practical material", "section": "2. Data documentation", - "text": "2. Data documentation\nData documentation involves organizing, describing, and providing context for datasets and projects. While metadata concentrates on the data itself, README files provide a broader perspective on the overall project or resource.\n\nMetadata\n\n\n\n\n\n\nmetadata.yml\n\n\n\nChoose the format that best suits the project’s needs. In this workshop, we will focus on YAMl as it is highly used for configuration files (e.g., in conda or pipelines).\n\n\n\n\n\n\nFile formats\n\n\n\n\n\n\n\n\nXML (eXtensible Markup Language): uses custom tags to describe data and allows for a hierarchical structure.\nJSON (JavaScript Object Notation): lightweight and human-readable format that is easy to parse and generate.\nCSV (Comma-Separated Values) or TSV (tabulate-separate values): simple and widely supported for representing tabular formats. Easy to manipulate using software or programming languages. It is often use for sample metadata.\nYAML (YAML Ain’t Markup Language): human-readable data serialization format, commonly used as project configuration files.\n\nOthers such as RDF or HDF5.\n\n\n\n\n\nLink to the file format database.\n\n\nMetadata in biological datasets refers to the information that describes the data and provides context for how the data was collected, processed, and analyzed. Metadata is crucial for understanding, interpreting, and using biological datasets effectively. It also ensures that datasets are reusable, reproducible and understandable by other researchers. Some of the components may differ depending on the type of project, but there are general concepts that will always be shared across different projects:\n\nSample information and collection details\nBiological context (such experimental conditions if applicable)\nData description\nData processing steps applied to the raw data\nAnnotation and Ontology terms\nFile metadata (file type, file format, etc.)\nEthical and Legal Compliance (ownership, access, provenance)\n\n\n\n\n\n\n\nMetadata and controlled vocabularies\n\n\n\nTo maximize the usefulness of metadata, aim to use controlled vocabularies across all fields. Read more about data documentation and find ontology services examples in lesson 4. We encourage you to begin implementing them systematically on your own (under the “sources” section, you will find some helpful links to guide you putting them in practice).\nIf you work with NGS data, check out this recommendations and examples of metadata for samples, projects and datasets.\n\n\n\n\nREADME file\n\n\n\n\n\n\nREADME.md\n\n\n\nChoose the format that best suits the project’s needs. In this workshop, we will focused on Markdown as it is the most used format due to its balance of simplicity and expressive formatting options.\n\n\n\n\n\n\nFile formats\n\n\n\n\n\n\n\n\nMarkdown (.md): commonly used because is easy to read and write and is compatible across platforms (e.g., GitHub, GitLab). Supports formatting like headings, lists, links, images, and code blocks.\nPlain Text (.txt): Simple and straightforward format without any rich formatting and great for basic instructions. Lack the ability of structure content effectively.\nReStructuredText (.rst): commonly used for python projects. Supports advanced formatting (takes, links, images and code blocks) .\n\nOthers such as HTML, YAML and Notebooks.\n\n\n\n\n\nLink to the file format database\n\n\nThe README.md file is a markdown file that provides a comprehensive description of the data within a folder. Its rich text format (including bold, italic, links, etc.) allows you to explain the contents of the folder, as well as the reasons and methods behind its creation or collection. The content will vary depending on what it described (data or assays, project, software…).\nHere is an example of a README file for a bioinformatics project:\n\n\n\n\n\n\nREADME\n\n\n\n\n\n# TITLE\nClear and descriptive.\n# OVERVIEW\nIntroduction to the project including its aims, and its significance. Describe the main purpose and the biological questions being addressed.\n\n\n\n\n\n\nExample text\n\n\n\n\n\n\n\nThis project aims to investigate gene expression patterns across various human tissues using Next Generation Sequencing (NGS) data. By analyzing the transcriptomes of different tissues, we seek to uncover tissue-specific gene expression profiles and identify potential markers associated with specific biological functions or diseases.\nUnderstanding tissue-specific gene expression is crucial for deciphering the molecular basis of health and disease. Identifying genes that are uniquely expressed in certain tissues can provide insights into tissue function, development, and potential therapeutic targets. This project contributes to our broader understanding of human biology and has implications for personalized medicine and disease research.\n\n\n\n\n\n# TABLE OF CONTENTS (optional but helpful for others to navigate to different sections)\n# INSTALLATION AND SETUP\nList all prerequisites, software, dependencies, and system requirements needed for others to reproduce the project. If available, you may link to a Docker image, Conda YAML file, or requirements.txt file.\n# USAGE\nInclude command-line examples for various functionalities or steps and path for running a pipeline, if applicable.\n# DATASETS\nDescribe the data,, including its sources, format, and how to access it. If the data has undergone preprocessing, provide a description of the processes applied or the pipeline used.\n\n\n\n\n\n\nExample text\n\n\n\n\n\n\n\nWe have used internal datasets with IDs: RNA_humanSkin_20201030, RNA_humanBrain_20210102, RNA_humanLung_20220304.\nIn addition, we utilized publicly available NGS datasets from the GTEx (Genotype-Tissue Expression) project, which provides comprehensive RNA-seq data across multiple human tissues. These datasets offer a wealth of information on gene expression levels and isoform variations across diverse tissues, making them ideal for our analysis.\n\n\n\n\n\n# RESULTS\nSummarize the results and key findings or outputs.\n\n\n\n\n\n\nExample text\n\n\n\n\n\n\n\nOur analysis revealed distinct gene expression patterns among different human tissues. We identified tissue-specific genes enriched in brain tissues, highlighting their potential roles in neurodevelopment and function. Additionally, we found a set of genes that exhibit consistent expression across a range of tissues, suggesting their fundamental importance in basic cellular processes.\nFurthermore, our differential expression analysis unveiled significant changes in gene expression between healthy and diseased tissues, shedding light on potential molecular factors underlying various diseases. Overall, this project underscores the power of NGS data in unraveling intricate gene expression networks and their implications for human health.\n\n\n\n\n\n# CONTRIBUTIONS AND CONTACT INFO\n# LICENSE\n\n\n\n\n\n\n\n\n\n\nExercise 2: modify the metadata.yml file in your Cookiecutter template\n\n\n\n\n\n\n\nIt is time now to customize your Cookiecutter templates and modify the metadata.yml files so that they fit your needs!\n\nConsider changing variables (add/remove) in the metadata.yml file from the cookicutter template.\nModify the cookiecutter.json file. You could add new variables or change the default key and/or values:\n{\n\"project_name\": \"myProject\",\n\"project_slug\": \"{{ cookiecutter.project_name.lower().replace(' ', '_').replace('-', '_') }}\",\n\"authors\": \"myName\",\n\"start_date\": \"{% now 'utc', '%Y%m%d' %}\",\n\"short_desc\": \"\",\n\"version\": \"0.1.0\"\n}\nThe metadata file will be filled accordingly.\nOptional: You can customize or remove this prompt message entirely, allowing you to tailor the text to your preferences for a unique experience each time you use the template.\n\"__prompts__\": {\n \"project_name\": \"Project directory name [Example: project_short_description_202X]\",\n \"author\": \"Author of the project\",\n \"date\": \"Date of project creation, default is today's date\",\n \"short_description\": \"Provide a detailed description of the project (context/content)\"\n},\nModify the metadata.yml file so that it includes the metadata recorded by the cookiecutter.json file. Hint below:\nproject: {{ cookiecutter.project_name }}\nauthor: {{ cookiecutter.author }}\ndate: {{ cookiecutter.date }}\ndescription: {{ cookiecutter.short_description }}\nModify the README.md file so that it includes the short description recorded by the cookiecutter.json file and the metadata at the top of the markdown file (top between lines of dashed).\n---\ntitle: {{ cookiecutter.project_name }}\ndate: \"{{ cookiecutter.date }}\"\nauthor: {{ cookiecutter.author }}\nversion: {{ cookiecutter.version }}\n---\n\nProject description\n----\n\n{{ cookiecutter.short_description }}\nCommit and push changes when you are done with your modifications\n\n\nStage the changes with git add\nCommit the changes with a meaningful commit message git commit -m \"update cookicutter template\"\nPush the changes to your forked repository on Github git push origin main (or the appropriate branch name)\n\n\nTest your template by using cookiecutter <URL to your GitHub repository \"cookicutter-template\">\nFill up the variables and verify that the modified information looks like you would expect." + "text": "2. Data documentation\nData documentation involves organizing, describing, and providing context for datasets and projects. While metadata concentrates on the data itself, README files provide a broader perspective on the overall project or resource.\n\nMetadata\n\n\n\n\n\n\nmetadata.yml\n\n\n\nChoose the format that best suits the project’s needs. In this workshop, we will focus on YAMl as it is highly used for configuration files (e.g., in conda or pipelines).\n\n\n\n\n\n\nFile formats\n\n\n\n\n\n\n\n\nXML (eXtensible Markup Language): uses custom tags to describe data and allows for a hierarchical structure.\nJSON (JavaScript Object Notation): lightweight and human-readable format that is easy to parse and generate.\nCSV (Comma-Separated Values) or TSV (tabulate-separate values): simple and widely supported for representing tabular formats. Easy to manipulate using software or programming languages. It is often use for sample metadata.\nYAML (YAML Ain’t Markup Language): human-readable data serialization format, commonly used as project configuration files.\n\nOthers such as RDF or HDF5.\n\n\n\n\n\nLink to the file format database.\n\n\nMetadata in biological datasets refers to the information that describes the data and provides context for how the data was collected, processed, and analyzed. Metadata is crucial for understanding, interpreting, and using biological datasets effectively. It also ensures that datasets are reusable, reproducible and understandable by other researchers. Some of the components may differ depending on the type of project, but there are general concepts that will always be shared across different projects:\n\nSample information and collection details\nBiological context (such experimental conditions if applicable)\nData description\nData processing steps applied to the raw data\nAnnotation and Ontology terms\nFile metadata (file type, file format, etc.)\nEthical and Legal Compliance (ownership, access, provenance)\n\n\n\n\n\n\n\nMetadata and controlled vocabularies\n\n\n\nTo maximize the usefulness of metadata, aim to use controlled vocabularies across all fields. Read more about data documentation and find ontology services examples in lesson 4. We encourage you to begin implementing them systematically on your own (under the “sources” section, you will find some helpful links to guide you putting them in practice).\nIf you work with NGS data, check out this recommendations and examples of metadata for samples, projects and datasets.\n\n\n\n\nREADME file\n\n\n\n\n\n\nREADME.md\n\n\n\nChoose the format that best suits the project’s needs. In this workshop, we will focused on Markdown as it is the most used format due to its balance of simplicity and expressive formatting options.\n\n\n\n\n\n\nFile formats\n\n\n\n\n\n\n\n\nMarkdown (.md): commonly used because is easy to read and write and is compatible across platforms (e.g., GitHub, GitLab). Supports formatting like headings, lists, links, images, and code blocks.\nPlain Text (.txt): Simple and straightforward format without any rich formatting and great for basic instructions. Lack the ability of structure content effectively.\nReStructuredText (.rst): commonly used for python projects. Supports advanced formatting (takes, links, images and code blocks) .\n\nOthers such as HTML, YAML and Notebooks.\n\n\n\n\n\nLink to the file format database\n\n\nThe README.md file is a markdown file that provides a comprehensive description of the data within a folder. Its rich text format (including bold, italic, links, etc.) allows you to explain the contents of the folder, as well as the reasons and methods behind its creation or collection. The content will vary depending on what it described (data or assays, project, software…).\nHere is an example of a README file for a bioinformatics project:\n\n\n\n\n\n\nREADME\n\n\n\n\n\n# TITLE\nClear and descriptive.\n# OVERVIEW\nIntroduction to the project including its aims, and its significance. Describe the main purpose and the biological questions being addressed.\n\n\n\n\n\n\nExample text\n\n\n\n\n\n\n\nThis project aims to investigate gene expression patterns across various human tissues using Next Generation Sequencing (NGS) data. By analyzing the transcriptomes of different tissues, we seek to uncover tissue-specific gene expression profiles and identify potential markers associated with specific biological functions or diseases.\nUnderstanding tissue-specific gene expression is crucial for deciphering the molecular basis of health and disease. Identifying genes that are uniquely expressed in certain tissues can provide insights into tissue function, development, and potential therapeutic targets. This project contributes to our broader understanding of human biology and has implications for personalized medicine and disease research.\n\n\n\n\n\n# TABLE OF CONTENTS (optional but helpful for others to navigate to different sections)\n# INSTALLATION AND SETUP\nList all prerequisites, software, dependencies, and system requirements needed for others to reproduce the project. If available, you may link to a Docker image, Conda YAML file, or requirements.txt file.\n# USAGE\nInclude command-line examples for various functionalities or steps and path for running a pipeline, if applicable.\n# DATASETS\nDescribe the data,, including its sources, format, and how to access it. If the data has undergone preprocessing, provide a description of the processes applied or the pipeline used.\n\n\n\n\n\n\nExample text\n\n\n\n\n\n\n\nWe have used internal datasets with IDs: RNA_humanSkin_20201030, RNA_humanBrain_20210102, RNA_humanLung_20220304.\nIn addition, we utilized publicly available NGS datasets from the GTEx (Genotype-Tissue Expression) project, which provides comprehensive RNA-seq data across multiple human tissues. These datasets offer a wealth of information on gene expression levels and isoform variations across diverse tissues, making them ideal for our analysis.\n\n\n\n\n\n# RESULTS\nSummarize the results and key findings or outputs.\n\n\n\n\n\n\nExample text\n\n\n\n\n\n\n\nOur analysis revealed distinct gene expression patterns among different human tissues. We identified tissue-specific genes enriched in brain tissues, highlighting their potential roles in neurodevelopment and function. Additionally, we found a set of genes that exhibit consistent expression across a range of tissues, suggesting their fundamental importance in basic cellular processes.\nFurthermore, our differential expression analysis unveiled significant changes in gene expression between healthy and diseased tissues, shedding light on potential molecular factors underlying various diseases. Overall, this project underscores the power of NGS data in unraveling intricate gene expression networks and their implications for human health.\n\n\n\n\n\n# CONTRIBUTIONS AND CONTACT INFO\n# LICENSE\n\n\n\n\n\n\n\n\n\n\nExercise 2: modify the metadata.yml file in your Cookiecutter template\n\n\n\n\n\n\n\nIt is time now to customize your Cookiecutter templates and modify the metadata.yml files so that they fit your needs!\n\nConsider changing variables (add/remove) in the metadata.yml file from the cookicutter template.\nModify the cookiecutter.json file. You could add new variables or change the default key and/or values:\n\n\ncookiecutter.json\n\n{\n\"project_name\": \"myProject\",\n\"project_slug\": \"{{ cookiecutter.project_name.lower().replace(' ', '_').replace('-', '_') }}\",\n\"authors\": \"myName\",\n\"start_date\": \"{% now 'utc', '%Y%m%d' %}\",\n\"short_desc\": \"\",\n\"version\": \"0.1.0\"\n}\n\nThe metadata file will be filled accordingly.\nOptional: You can customize or remove this prompt message entirely, allowing you to tailor the text to your preferences for a unique experience each time you use the template.\n\n\ncookiecutter.json\n\n\"__prompts__\": {\n \"project_name\": \"Project directory name [Example: project_short_description_202X]\",\n \"author\": \"Author of the project\",\n \"date\": \"Date of project creation, default is today's date\",\n \"short_description\": \"Provide a detailed description of the project (context/content)\"\n},\n\nModify the metadata.yml file so that it includes the metadata recorded by the cookiecutter.json file. Hint below:\n\n\nmetadata.yml\n\nproject: {{ cookiecutter.project_name }}\nauthor: {{ cookiecutter.author }}\ndate: {{ cookiecutter.date }}\ndescription: {{ cookiecutter.short_description }}\n\nModify the README.md file so that it includes the short description recorded by the cookiecutter.json file and the metadata at the top of the markdown file (top between lines of dashed).\n\n\nREADME.md\n\n---\ntitle: {{ cookiecutter.project_name }}\ndate: \"{{ cookiecutter.date }}\"\nauthor: {{ cookiecutter.author }}\nversion: {{ cookiecutter.version }}\n---\n\nProject description\n----\n\n{{ cookiecutter.short_description }}\n\nCommit and push changes when you are done with your modifications\n\n\nStage the changes with git add\nCommit the changes with a meaningful commit message git commit -m \"update cookicutter template\"\nPush the changes to your forked repository on Github git push origin main (or the appropriate branch name)\n\n\nTest your template by using cookiecutter <URL to your GitHub repository \"cookicutter-template\">\nFill up the variables and verify that the modified information looks like you would expect." }, { "objectID": "develop/practical_workshop.html#overview", @@ -1662,6 +1662,25 @@ "href": "develop/practical_workshop.html#create-a-catalog-of-your-data-folder", "title": "Practical material", "section": "4. Create a catalog of your data folder", - "text": "4. Create a catalog of your data folder\nThe next step is to collect all the NGS datasets that you have created in the manner explained above. Since your folders all should contain the metadata.yml file in the same place with the same metadata, it should be very easy to iteratively go through all the folders and merge all the metadata.yml files into a one single table. This table can be then browsed easily with Microsoft Excel, for example. If you are interested in making a Shiny app or Python Panel tool to interactively browse the catalog, check out this lesson.\n\n\n\n\n\n\nExercise 4: create a metadata.tsv catalog\n\n\n\n\n\n\n\nWe will make a small script in R (or you can make one with Python) that recursively goes through all the folders inside an input path (like your Assays folder), fetches all the metadata.yml files, and merges them. Finally, it will write a TSV file as an output.\n\nCreate a folder called dataset and change directory cd dataset\nFork this repository: a Cookiecutter template designed for NGS datasets. While you are welcome to create your own template from scratch, we recommend using this one to save time.\nRun the cookiecutter cc-data-template command at least twice to create multiple datasets or projects. Use different values each time to simulate various scenarios (do this in the dataset directory that you have previously created). Execute the script below using R (or create your own script in Python). Adjust the folder_path variable so that it matches the path to the Assays folder. The resulting table will be saved in the same folder_path.\nOpen your database_YYYYMMDD.tsv table in a text editor from the command-line, or view it in Excel for better visualization.\n\n\nlibrary(yaml)\nlibrary(dplyr)\nlibrary(lubridate)\n\n# Function to read a YAML file and transform it into a dataframe format.\nread_yaml <- function(file_path) {\n # Read the YAML file and convert it to a data frame\n df <- yaml::yaml.load_file(file_path) %>% as.data.frame(stringsAsFactors = FALSE)\n \n # Return the data frame\n return(df)\n}\n\n# Function to recursively fetch metadata.yml files\nget_metadata <- function(folder_path) {\n file_list <- list.files(path = folder_path, pattern = \"metadata\\\\.yml$\", recursive = TRUE, full.names = TRUE)\n\n metadata_list <- lapply(file_list, read_yaml)\n \n # Combine the list of data frames into a single data frame using dplyr::bind_rows()\n combined_metadata <- bind_rows(metadata_list)\n\n return(combined_metadata)\n}\n\n# Specify the folder path\nfolder_path <- \"/path/to/your/folder\"\n\n# Fetch metadata from the specified folder\nmetadata <- get_metadata(folder_path)\n\n# Save the data frame as a TSV file\noutput_file <- paste0(\"database_\", format(Sys.Date(), \"%Y%m%d\"), \".tsv\")\nwrite.table(metadata, file = output_file, sep = \"\\t\", quote = FALSE, row.names = FALSE)\n\n# Print confirmation message\ncat(\"Database saved as\", output_file, \"\\n\")" + "text": "4. Create a catalog of your data folder\nThe next step is to collect all the datasets that you have created in the manner explained above. Since your folders all should contain the metadata.yml file in the same place with the same metadata, it should be very easy to iteratively go through all the folders and merge all the metadata.yml files into a one single table. he table can be easily viewed in your terminal or even with Microsoft Excel.\n\n\n\n\n\n\nExercise 4: create a metadata.tsv catalog\n\n\n\n\n\n\n\nWe will make a small script in R (or you can make one with Python) that recursively goes through all the folders inside an input path (like your Assays folder), fetches all the metadata.yml files, merges them and writes a TSV file as an output.\n\nCreate a folder called dataset and change directory cd dataset\nFork this repository: a Cookiecutter template designed for NGS datasets.While you are welcome to create your own template from scratch, we recommend using this one to save time.\nRun the cookiecutter cc-data-template command at least twice to create multiple datasets or projects. Use different values each time to simulate various scenarios (do this in the dataset directory that you have previously created).\nExecute the script below using R (or create your own script in Python). Adjust the folder_path variable so that it matches the path to the Assays folder. The resulting table will be saved in the same folder_path.\nOpen your database_YYYYMMDD.tsv table in a text editor from the command-line, or view it in Excel for better visualization.\n\n\nSolution A. From a TSV\n\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\n# R version 4.3.2\n# RScript to read all yaml files in directory and save the metadata into a dataframe\nquiet <- function(package_name) {\n # Suppress warnings and messages while checking and installing the package\n suppressMessages(suppressWarnings({\n # Check if the package is available and load it\n if (!requireNamespace(package_name, quietly = TRUE)) {\n install.packages(package_name)\n }\n # Load the package\n library(package_name, character.only = TRUE)\n }))\n}\n\n# Check and install necessary libraries\nquiet(\"yaml\")\nquiet(\"dplyr\")\nquiet(\"lubridate\")\n\n\nread_yaml <- function(file_path) {\n # Read the YAML file and convert it to a data frame\n df <- yaml::yaml.load_file(file_path) %>% as.data.frame(stringsAsFactors = FALSE)\n \n # Return the data frame\n return(df)\n}\n\n# Function to recursively fetch metadata.yml files\nget_metadata <- function(folder_path) {\n file_list <- list.files(path = folder_path, pattern = \"metadata\\\\.yml$\", recursive = TRUE, full.names = TRUE)\n\n metadata_list <- lapply(file_list, read_yaml)\n \n # Combine the list of data frames into a single data frame using dplyr::bind_rows()\n combined_metadata <- bind_rows(metadata_list)\n\n return(combined_metadata)\n}\n\n# Specify the folder path\nfolder_path <- \"./\" #/path/to/your/folder\n\n# Fetch metadata from the specified folder\ndf <- get_metadata(folder_path)\n\n# Save the data frame as a TSV file\noutput_file <- paste0(\"database_\", format(Sys.Date(), \"%Y%m%d\"), \".tsv\")\nwrite.table(df, file = output_file, sep = \"\\t\", quote = FALSE, row.names = FALSE)\n\n# Print confirmation message\ncat(\"Database saved as\", output_file, \"\\n\")\n\n\n\n\n\nExercise 4, option B: create a SQLite database \nAlternatively, create a SQLite database from a metadata. If you opt for this option in the exercise, you must still complete the first three steps outlined above. Read more from the RSQLite documentation.\n\nSolution B. SQLite database\n\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nprint(\"Assuming the libraries from Exercise 4 are already loaded and a dataframe has been generated from the YAML files...\")\n\n# check_and_install() form Exercise 4, and load the other packages. \nquiet(\"DBI\")\nquiet(\"RSQLite\")\n\n# Initialize a temporary in memory database and copy the data.frame into it\n\ndb_file_path <- paste0(\"database_\", format(Sys.Date(), \"%Y%m%d\"), \".sqlite\")\ncon <- dbConnect(RSQLite::SQLite(), db_file_path)\n\ndbWriteTable(con, \"metadata\", df, overwrite=TRUE) #row.names = FALSE,append =\n\n# Print confirmation message\ncat(\"Database saved as\", db_file_path, \"\\n\")\n\n# Close the database connection\ndbDisconnect(con)\n\n\n\n\n\n\n\n\n\n\n\nShiny apps\nTo get the most out of your metadata file and the ones from other colleagues, you can combine them and explore them by creating an interactive catalog browser. You can create interactive web apps straight from R or Python. Whether you have generated a tabulated-file or a sqlite database, browse through the metadata using Shiny. Shiny apps are perfect for researchers because they enable you to create interactive visualizations and dashboards with dynamic data inputs and outputs without needing extensive web development knowledge. Shiny provides a variety of user interface components such as forms, tables, graphs, and maps to help you organize and present your data effectively. It also allows you to filter, sort, and segment data for deeper insights.\n\n\n\n\n\n\nTip\n\n\n\n\nFor R Enthusiasts\n\nExplore demos from the R Shiny community to kickstart your projects or for inspiration.\n\nFor python Enthusiasts\n\nShiny for Python provides live, interactive code throughout its entire tutorial. Additionally, it offers a great tool called Playground, where you can code and test your own app to explore how different features render.\n\n\n\n\n\n\n\n\nExercise 5: Skill Booster, build an interactive catalog browser\n\n\n\n\n\n\n\nBuild an interactive web app straight from R or Python. Below, you will find an example of an R shiny app. In either case, you will need to define a user interface (UI) and a server function. The UI specifies the layout and appearance of the app, including input controls and output displays. The server function contains the app’s logic, handling data manipulation, and responding to user interactions. Once you set up the UI and server, you can launch the app!\nHere’s the UI and server function structure for an R Shiny app:\n# Don't forget to load shiny and DT libraries!\n\n# Specify the layout\nui <- fluidPage(\n titlePanel(...)\n # Define the appearance of the app\n sidebarLayout(\n sidebarPanel(...)\n mainPanel(...)\n )\n)\n\nserver <- function(input, output, session) {\n # Define a reactive expression for data based on user inputs\n data <- reactive({\n req(input$dataInput) # Ensure data input is available\n # Load or manipulate data here\n })\n\n # Define an output table based on data\n output$dataTable <- renderTable({\n data() # Render the data as a table\n })\n\n # Observe a button click event and perform an action\n observeEvent(input$actionButton, {\n # Perform an action when the button is clicked\n })\n\n # Define cleanup tasks when the app stops\n onStop(function() {\n # Close connections or save state if necessary\n })\n}\n# Run the app\nshinyApp(ui, server)\nIf you need more assistance, take a look at the code below (Hint).\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\n# R version 4.3.2\nprint(\"Assuming the libraries from Exercise 4 are already loaded and a dataframe has been generated from the YAML files...\")\n\n# check_and_install() form Exercise 4. \nquiet(\"shiny\")\nquiet(\"DT\")\n\n# UI\nui <- fluidPage(\n titlePanel(\"TSV File Viewer\"),\n \n sidebarLayout(\n sidebarPanel(\n fileInput(\"file\", \"Choose a TSV file\", accept = c(\".tsv\")),\n selectInput(\"filter_column\", \"Filter by Column:\", choices = c(\"n_samples\", \"technology\"), selected = \"technology\"),\n textInput(\"filter_value\", \"Filter Value:\", value = \"\"),\n # if only numbers, numericInput()\n radioButtons(\"sort_order\", \"Sort Order:\", choices = c(\"Ascending\", \"Descending\"), selected = \"Ascending\")\n ),\n \n mainPanel(\n DTOutput(\"table\")\n )\n )\n)\n\n# Server\nserver <- function(input, output) {\n \n data <- reactive({\n req(input$file)\n df <- read.delim(input$file$datapath, sep = \"\\t\")\n print(str(df))\n\n # Filter the DataFrame based on user input\n if (input$filter_column != \"\" && input$filter_value != \"\") {\n # Check if the column is numeric, and filter for value\n if (is.numeric(df[[input$filter_column]])) {\n df <- df[df[[input$filter_column]] >= as.numeric(input$filter_value), ]\n }\n # Check if the column is a string\n else if (is.character(df[[input$filter_column]])) {\n df <- df[df[[input$filter_column]] == input$filter_value, ]\n }\n }\n \n # Sort the DataFrame based on user input\n sort_order <- if (input$sort_order == \"Ascending\") TRUE else FALSE\n df <- df[order(df[[input$filter_column]], decreasing = !sort_order), ]\n df\n })\n \n output$table <- renderDT({\n datatable(data())\n })\n}\n\n# Run the app\nshinyApp(ui, server)\n\n\n\n\n\nIn the optional exercise below, you’ll find a code example for using an SQLite database as input instead of a tabulated file.\n\n\n\n\n\n\n\n\n\n\n\nExercise (optional)\n\n\n\n\n\n\n\nOnce you’ve finished the previous exercise, consider implementing these additional ideas to maximize the utility of your catalog browser.\n\nUse SQLite databases as input\nAdd a functionality to only select certain columns uiOutput(\"column_select\")\nFilter columns by value using column_filter_select()\nAdd multiple tabs using tabsetPanel()\nAdd buttons to order numeric columns ascending or descending using radioButtons()\nUse SQL aggregation functions (e.g., SUM, COUNT, AVG) to perform custom data summaries and calculations.\nAdd a tab tabPanel() to create a project directory interactively (and fill up the metadata fields), tips: dir.create(), data.frame(), write.table()\nModify existing entries\nVisualize results using Cirrocumulus, an interactive visualization tool for large-scale single-cell genomics data.\n\nIf you need some assistance, take a look at the code below (Hint).\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nExplore an example with advanced features such as a two-tab layout, filtering by numeric values and matching strings, and a color-customized dashboard here." + }, + { + "objectID": "develop/practical_workshop.html#version-control-using-git-and-github", + "href": "develop/practical_workshop.html#version-control-using-git-and-github", + "title": "Practical material", + "section": "5. Version control using Git and GitHub", + "text": "5. Version control using Git and GitHub\nVersion control involves systematically tracking changes to a project over time, offering a structured way to document revisions and understand the progression of your work. In research data management and data analytics, it plays a critical role and provides numerous benefits.\nGit is a distributed version control system that helps developers and researchers efficiently manage project history, collaborate seamlessly, and maintain data integrity. On the other hand, GitHub is a web-based platform that builds on Git’s functionality by providing a centralized, collaborative hub for hosting Git repositories. It offers several key functionalities, such as tracking issues, security features to safeguard your repos, and GitHub Pages that allow you to create websites to showcase your projects.\n\n\n\n\n\n\nCreate a GitHub organization for your lab or department\n\n\n\nGitHub users can create organizations, allowing groups to collaborate or create repositories under the same organization umbrella. You can create an educational organization on Github for free, by setting up a Github account for your lab.\nFollow these instructions to create a GitHub organization.\nOnce you’ve established your GitHub organization, be sure to create your repositories within the organization’s space rather than under your personal user account. This keeps your projects centralized and accessible to the entire group. Best practices for managing an organization on GitHub include setting clear access permissions, regularly reviewing roles and memberships, and organizing repositories effectively to keep your projects structured and easy to navigate.\n\n\n\nSetting up a GitHub repository for your project folder\nVersion controlling your data analysis folders becomes straightforward once you’ve established your Cookiecutter templates. After you’ve created several folder structures and metadata using your Cookiecutter template, you can manage version control by either converting those folders into Git repositories or copying a folder into an existing Git repository. Both approaches are explained in Lesson 5.\n\n\n\n\n\n\nExercise 6: initialize a repository from an existing folder:\n\n\n\n\n\n\n\n\nInitialize the repository: Begin by running the command git init in your project directory. This command sets up a new Git repository in the current directory and is executed only once, even for collaborative projects. See (git init) for more details.\nCreate a remote repository: Once the local repository is initialized, create an empty new repository on GitHub (website or Github Desktop).\nConnect the remote repository: Add the GitHub repository URL to your local repository using the command git remote add origin <URL>. This associates the remote repository with the name “origin.”\nCommit changes: If you have files you want to add to your repository, stage them using git add ., then create a commit to save a snapshot of your changes with git commit -m \"add local folder\".\nPush to GitHub: To synchronize your local repository with the remote repository and establish a tracking relationship, push your commits to the GitHub repository using git push -u origin main.\n\n\n\n\n\n\n\n\n\n\n\n\nTips to write good commit messages\n\n\n\nIf you would like to know more about Git commits and the best way to make clear Git messages, check out this post!\n\n\n\n\nGitHub Pages\nAfter creating your repository and hosting it on GitHub, you can now add your data analysis reports—such as Jupyter Notebooks, R Markdown files, or HTML reports—to a GitHub Page website. Setting up a GitHub Page is straightforward, and we recommend following GitHub’s helpful tutorial. However, we will go through the key steps in the exercise below. There are several ways to create your web pages, but we suggest using Quarto as a framework to build a sleek, professional-looking website with ease. The folder templates from the previous exercise already contain the necessary elements to launch a webpage. Familiarizing yourself with the basics of Quarto will help you design a webpage that suits your preferences. Other common options include MkDocs. If you want to use MkDocs instead, click here and follow the instructions.\n\n\n\n\n\n\nTip\n\n\n\nHere are some useful links to get started with Github Pages:\n\nGithub Pages\nQuarto Github Pages\n\n\n\n\n\n\n\n\n\nExercise 7: Create a Github Page using Quarto\n\n\n\n\n\n\n\n\nHead over to GitHub and create a new public repository named username.github.io, where username is your username (or organization name) on GitHub. If the first part of the repository doesn’t exactly match your username, it won’t work, so make sure to get it right.\nGo to the folder where you want to store your project, and clone the new repository: git clone https://github.com/username/username.github.io (or use Github Desktop)\nCreate a new file named _quarto.yml\n\n\n_quarto.yml\n\nproject:\n type: website\n\nOpen the terminal ```{.bash filename=“Terminal”} # Add a .nojekyll file to the root of the repository not to do additional processing of your published site touch .nojekyll #copy NUL .nojekyll for windows\n# Render and push it to Github quarto render git commit -m “Publish site to docs/” git push ```\nIf you do not have a gh-pages, you can create one as follows\n\n\nTerminal\n\ngit checkout --orphan gh-pages\ngit reset --hard # make sure all changes are committed before running this!\ngit commit --allow-empty -m \"Initialising gh-pages branch\"\ngit push origin gh-pages\n\nBefore attempting to publish you should ensure that the Source branch for your repository is gh-pages and that the site directory is set to the repository root (/)\n\nIt is important to not check your _site directory into version control, add the output directory of your project to .gitignore\n\n\n.gitignore\n\n/.quarto/\n/_site/\n\nNow is time to publish your website\n\n\n.Terminal\n\nquarto publish gh-pages\n\nOnce you’ve completed a local publish, add a publish.yml GitHub Action to your project by creating this YAML file and saving it to .github/workflows/publish.yml. Read how to do it here" + }, + { + "objectID": "develop/examples/mkdocs_pages.html", + "href": "develop/examples/mkdocs_pages.html", + "title": "Build your GitHub Page using Mkdocs", + "section": "", + "text": "Build your GitHub Page using Mkdocs\n\n\n\n\n\n\nExercise 5: make a project folder and publish a data analysis webpage\n\n\n\n\n\n\n\n\nConfigure your main GitHub Page and its repo\nThe first step is to set up the main GitHub Page site and the repository that will host it. This is very simple, as you will only need to follow these steps. In a Markdown document, outline the primary objectives of the organization and provide an overview of ongoing research projects. After you have created the organization/usernamegithub.io, it is time to configure your Project repository webpage using MkDocs!\nStart a new project from Cookiecutter or use one from the previous exercise.\nIf you use a Project repo from the first exercise, go to the next paragraph. Using Cookiecutter, create a new data analysis project. Remember to fill up your metadata and description files! After you have created the folder, it would be best to initialize a Git repo following the instructions from the previous section.\nNext, link your data of interest (or create a small fake dataset) and make an example of a data analysis notebook/report (this could be just a scatter plot of a random matrix of values). Depending on your setup, you might be using Jupyter Notebooks or R Markdown files. The extensions that we have installed using pip allow you to directly add a Jupyter Notebook file to the mkdocs.yml navigation section. On the other hand, if you are using R Markdown files, you will have to knit your document into either an HTML page or a GitHub document.\nFor the purposes of this exercise, we have already included a basic index.md markdown file that can serve as the intro page of your repo, and a jupyter_example.ipynb with some code in it. You are welcome to modify them further to test them out!\nUse MkDocs to create your webpage\nWhen you are happy with your files and are ready to publish them, make sure to add, commit, and push the changes to the remote. Then, build up your webpage using MkDocs and the mkdocs gh-deploy command from the same directory where the mkdocs.yml file is. For example, if your mkdocs.yml for your Project folder is in /Users/JARH/Projects/project1_JARH_20231010/mkdocs.yml, do cd /Users/JARH/Projects/project1_JARH_20231010/ and then mkdocs gh-deploy. This requires a couple of changes in your GitHub organization settings.\nRemember to make sure that your markdowns, images, reports, etc., are included in the docs folder and properly set up in the navigation section of your mkdocs.yml file.\nFinally, we only need to set up the GitHub Project repo settings.\nPublishing your GitHub Page\nGo to your GitHub repo settings and configure the Page section. Since you are using the mkdocs gh-deploy command to publish your site in the gh-pages branch (as explained the the mkdocs documentation), we need to change where GitHub is fetching the website. You will need to configure the settings of this repository in GitHub so that the Page is taken from the gh-pages branch and the root folder.\n\n\n\nGitHub Pages setup\n\n\n\nBranch should be gh-pages\nFolder should be root\n\nAfter a couple of minutes, your webpage should be ready! You should be able to see your webpage through the link provided in the Page section!\n\nNow it is also possible to include this repository webpage in your main webpage <organization>.github.io by including the link of the repo website (https://<organization>.github.io/repo-name) in the navigation section of the mkdocs.yml file in the main organizationgithub.io repo.\n\n\n\n\n\n\n\n\n\nCopyrightCC-BY-SA 4.0 license", + "crumbs": [ + "Use cases", + "General", + "Build your GitHub Page using Mkdocs" + ] } ] \ No newline at end of file diff --git a/develop/03_DOD.qmd b/develop/03_DOD.qmd index bba631eb..93e5022e 100644 --- a/develop/03_DOD.qmd +++ b/develop/03_DOD.qmd @@ -234,6 +234,8 @@ Next, let's take a look at a possible folder structure and what kind of files yo Setting up folder structures manually for each new project can be time-consuming. Thankfully, tools like [Cookiecutter](https://github.com/cookiecutter/cookiecutter) offer a solution by allowing users to create project templates easily. These templates can ensure consistency across projects and save time. Additionally, using [cruft](https://github.com/cruft/cruft) alongside Cookiecutter can assist in maintaining older templates when updates are made (by synchronizing them with the latest version). :::{.callout-note title="Cookiecutter templates"} +- [Sandbox Project/Data analysis template](https://github.com/hds-sandbox/cookiecutter-template) +- [Sandbox Data/Assay template](https://github.com/hds-sandbox/cc-data-template) - Cookiecutter template for [Data science projects](https://github.com/drivendata/cookiecutter-data-science) - Brickmanlab template for [NGS data](https://github.com/brickmanlab/ngs-template): similar to the folder structures in the examples above. You can download and modify it to suit your needs. ::: @@ -241,8 +243,6 @@ Setting up folder structures manually for each new project can be time-consuming ### Quick tutorial on cookiecutter :::{.callout-caution title="Sandbox Tutorial"} **Learn how to create your own template [here](./practical_workshop.qmd).** - -We offer workshops on practical RDM for biodata. Keep an eye on the upcoming events on the [Sandbox website](https://hds-sandbox.github.io/news/news.html). ::: diff --git a/develop/04_metadata.qmd b/develop/04_metadata.qmd index da01d264..f3755fd7 100644 --- a/develop/04_metadata.qmd +++ b/develop/04_metadata.qmd @@ -153,14 +153,26 @@ Researchers encountering inconsistent and non-standardized terms (e.g., gene nam :::{.callout-note} # Examples of ontology services -- [Uberon anatomy ontology](https://www.ebi.ac.uk/ols4/ontologies/uberon) -- [Gene ontology](https://geneontology.org/docs/tools-overview/) -- [Ensembl gene IDs](https://www.ebi.ac.uk/training/online/courses/ensembl-browsing-genomes/navigating-ensembl/investigating-a-gene/#:~:text=Ensembl%20gene%20IDs%20begin%20with,of%20species%20other%20than%20human) -- [Medical Subject Headings (MeSH)](https://www.ncbi.nlm.nih.gov/mesh) -- [Chemical Entities of Biological Interest](https://www.ebi.ac.uk/chebi/) -- [Microarray Gene Expression Society Ontology (MGED)](https://mged.sourceforge.net/ontologies/index.php) -- [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) -- [Mondo disease database](https://mondo.monarchinitiative.org/) +- Biological ontologies for data scientists - [Bionty](https://lamin.ai/docs/bionty) +- Anatomy - [Uberon](https://www.ebi.ac.uk/ols4/ontologies/uberon) +- Tissue - [Uberon](http://obophenotype.github.io/uberon) +- Chemical compounds[Chemical Entities of Biological Interest](https://www.ebi.ac.uk/chebi/) +- ExperimentalFactor - [Experimental Factor Ontology](https://www.ebi.ac.uk/ols/ontologies/efo) +- Species - [NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy), [Ensembl Species](https://useast.ensembl.org/info/about/species.html) +- Disease - [Mondo](https://mondo.monarchinitiative.org/), [Human Disease](https://disease-ontology.org/) +- Gene - [Ensembl](https://ensembl.org/), [NCBI Gene](https://www.ncbi.nlm.nih.gov/gene), [Gene ontology](https://geneontology.org/docs/tools-overview/),[Microarray Gene Expression Society Ontology (MGED)](https://mged.sourceforge.net/ontologies/index.php) +- Protein - [Uniprot](https://www.uniprot.org/) +- CellLine - [Cell Line Ontology](https://github.com/CLO-ontology/CLO) +- CellType - [Cell Ontology](https://obophenotype.github.io/cell-ontology) +- CellMarker - [CellMarker](http://xteam.xbio.top/CellMarker) +- Phenotype - [Human Phenotype](https://hpo.jax.org/app), [Phecodes](https://phewascatalog.org/phecodes_icd10), [PATO](https://github.com/pato-ontology/pato), [Mammalian Phenotype](http://obofoundry.org/ontology/mp.html), [Zebrafish Phenotype](http://obofoundry.org/ontology/zp.html) +- Pathway - [Gene Ontology](https://bioportal.bioontology.org/ontologies/GO), [Pathway Ontology](https://bioportal.bioontology.org/ontologies/PW) +- DevelopmentalStage - [Human Developmental Stages](https://github.com/obophenotype/developmental-stage-ontologies/wiki/HsapDv), [Mouse Developmental Stages](https://github.com/obophenotype/developmental-stage-ontologies/wiki/MmusDv) +- Drug - [Drug Ontology](https://bioportal.bioontology.org/ontologies/DRON) +- Ethnicity - [Human Ancestry Ontology](https://github.com/EBISPOT/hancestro) +- BFXPipeline - largely based on [nf-core](https://nf-co.re/) +- BioSample - [NCBI BioSample attributes](https://www.ncbi.nlm.nih.gov/biosample/docs/attributes) +- Articles Indexing [Medical Subject Headings (MeSH)](https://www.ncbi.nlm.nih.gov/mesh) ::: :::{.callout-definition} @@ -191,43 +203,19 @@ Requirements: Click on the hint to reveal the solution and a code example for the exercise, which may serve as inspiration. -:::{.callout-hint} -```{.r} -quiet <- function(x) { suppressMessages(suppressWarnings(x)) } -quiet(library(yaml)) -quiet(library(dplyr)) -quiet(library(lubridate)) - -# Function to recursively fetch metadata.yml files -get_metadata <- function(folder_path) { - file_list <- list.files(path = folder_path, - pattern = "metadata\\.yml$", - recursive = TRUE, full.names = TRUE) - metadata_list <- lapply(file_list, yaml::yaml.load_file) - return(metadata_list) -} +You can find a thorough guided exercise in the practical material - [Exercise 4](https://hds-sandbox.github.io/RDM_NGS_course/develop/practical_workshop.html#step-4-review-the-generated-project). -# Specify the folder path -folder_path <- "/path/to/your/folder" +:::{.callout-hint} +```{.r .code-overflow-wrap} +# Load required packages +packages <- c("yaml", "ggplot2", "lubridate") -# Fetch metadata from the specified folder -metadata <- get_metadata(folder_path) +# Function to recursively fetch YAML files files, read and convert them to a data frame -# Convert metadata to a data frame -metadata_df <- data.frame(matrix(unlist(metadata), -ncol = length(metadata), byrow = TRUE)) -colnames(metadata_df) <- names(metadata[[1]]) +df = lapply(file_list, yaml::yaml.load_file) # Save the data frame as a TSV file -output_file <- paste0("database_", format(Sys.Date(), "%Y%m%d"), ".tsv") -write.table(metadata_df, - file = output_file, - sep = "\t", - quote = FALSE, - row.names = FALSE) - -# Print confirmation message -print("Database saved as", output_file, "\n") + ``` ::: ::: @@ -251,36 +239,31 @@ An alternative to the tabular format is SQLite, a lightweight and self-contained :::{.callout-exercise} # Exercise 3: Generate a SQLite database from metadata -Click on the hint to reveal the solution and a code example for the exercise, which may serve as inspiration. +Click on the hint to reveal the necessary libraries and some functions, which may serve as inspiration. + +You can find a thorough guided exercise, complete with code example, in the practical material - [Exercise 4, option B](https://hds-sandbox.github.io/RDM_NGS_course/develop/practical_workshop.html#step-4-review-the-generated-project). :::{.callout-hint} ```{.r .code-overflow-wrap} -quiet <- function(x) { suppressMessages(suppressWarnings(x)) } -quiet(library(yaml)) -quiet(library(dplyr)) -quiet(library(lubridate)) -quiet(library(DBI)) - -# Generate the metadata_df using the script from the example above (recursively fetching metadata.yml files) +# Load required packages +packages <- c("yaml", "ggplot2", "lubridate", "DBI") -# Create an SQLite database and insert data -db_file <- paste0("database_", format(Sys.Date(), "%Y%m%d"), ".sqlite") -con <- dbConnect(SQLite(), db_file) +# Function to recursively fetch YAML files files, read and convert them to a data frame -dbWriteTable(con, "metadata", metadata_df, row.names = FALSE) +df = lapply(file_list, yaml::yaml.load_file) -# Print confirmation message -cat("Database saved as", db_file, "\n") - -# Close the database connection -dbDisconnect(con) +# Create an SQLite database from a dataframe and insert data +dbConnect(SQLite(), "filenameXXX.sqlite") +dbWriteTable() ``` ::: ::: ### Catalog browser -You can design a user-friendly catalog browser for your database using tools like [Rshiny](https://www.rstudio.com/products/shiny/) or [Panel](https://panel.holoviz.org/). These frameworks provide interfaces for dynamic search, filtering, and visualization, facilitating efficient exploration of database contents. Creating such a tool with Rshiny from both a TSV file and a SQLite database will be demonstrated below. +To further optimize the use of your metadata and improve the integration of all your lab metadata, you can design a user-friendly catalog browser for your database using tools like [Rshiny](https://www.rstudio.com/products/shiny/) or [Panel](https://panel.holoviz.org/). These frameworks provide interfaces for dynamic search, filtering, and visualization, facilitating efficient exploration of database contents. + +Creating such a tool with RShiny is straightforward and does not require extensive development knowledge, whether using a TSV file or a SQLite database. In the [practical materials](https://hds-sandbox.github.io/RDM_NGS_course/develop/practical_workshop.html#create-a-catalog-of-your-data-folder), we demonstrate both scenarios and showcase various functionalities for inspiration. SQLite files are particularly advantageous for data fetching and other operations due to their efficient querying and indexing capabilities. Here's an example of an SQLite database catalog created by the [Brickman Lab](https://renew.ku.dk/research/reseach-groups/brickman-group/) at the Center for Stem Cell Medicine. It's simple yet effective! Clicking on a data row opens the metadata.yml file, allowing access to detailed metadata for that assay. @@ -288,114 +271,43 @@ Here's an example of an SQLite database catalog created by the [Brickman Lab](ht :::{.callout-exercise} # Exercise 4: Create your first catalog browser using Rshiny -Click on the hint to reveal the solution and a code example for the exercise, which may serve as inspiration. - -- Solution A. From a TSV +Go to the [practical material](https://hds-sandbox.github.io/RDM_NGS_course/develop/practical_workshop.html#create-a-catalog-of-your-data-folder) for complete exercise instructions and solutions. The code provided can serve as inspiration for you to adapt as needed. :::{.callout-hint .code-overflow-wrap} -R script -```{.r} - -quiet <- function(x) { suppressMessages(suppressWarnings(x)) } -quiet(library(shiny)) -quiet(library(DT)) - -# UI -ui <- fluidPage( - titlePanel("TSV File Viewer"), - - sidebarLayout( - sidebarPanel( - fileInput("file", "Choose a TSV file", accept = c(".tsv")) - ), - - mainPanel( - DTOutput("table") - ) - ) -) - -# Server -server <- function(input, output) { - - data <- reactive({ - req(input$file) - read.delim(input$file$datapath, sep = "\t") - }) - - output$table <- renderDT({ - datatable(data()) - }) -} +These are some of the libraries required: +`install.packages(c("shiny", "DT", "DBI"))` -# Run the app -shinyApp(ui, server) -``` -::: +You need to define both a user interface (UI) and a server function. The UI (`fluidPage()`) outlines the app's layout using for example, the `sidebarLayout()` and `mainPanel()` functions for input controls and output displays. -- Solution B. From an SQLite database +The server function manages data manipulation and user interactions. Use `shinyApp()` to launch the app once the UI and server are set up. -:::{.callout-hint .code-overflow-wrap} -R script -```{.r} -quiet <- function(x) { suppressMessages(suppressWarnings(x)) } -quiet(library(shiny)) -quiet(library(DT)) -quiet(library(DBI)) - -# UI -ui <- fluidPage( - titlePanel("SQLite Database Viewer"), - - sidebarLayout( - sidebarPanel( - fileInput("db_file", "Choose an SQLite Database", accept = c(".sqlite")), - textInput("table_name", "Enter Table Name:", value = ""), - actionButton("load_button", "Load Table") - ), - - mainPanel( - DTOutput("table") - ) - ) -) - -# Server -server <- function(input, output, session) { - - con <- reactive({ - if (!is.null(input$db_file)) { - dbConnect(SQLite(), input$db_file$datapath) - } - }) - - data <- reactive({ - req(input$load_button > 0, input$table_name, con()) - query <- glue::glue_sql("SELECT * FROM {dbQuoteIdentifier(con(), input$table_name)}") - dbGetQuery(con(), query) - }) - - output$table <- renderDT({ - datatable(data()) - }) - - observeEvent(input$load_button, { - output$table <- renderDT({ - datatable(data()) +Here is a simple example of a server function settup including the main parts (additional components provide advanced functionalities): + +```{.r .code-overflow-wrap} + server <- function(input, output, session) { + # Define a reactive expression for data based on user inputs + data <- reactive({ + req(input$dataInput) # Ensure data input is available + # Load or manipulate data here + }) + + # Define an output table based on data + output$dataTable <- renderTable({ + data() # Render the data as a table + }) + + # Observe a button click event and perform an action + observeEvent(input$actionButton, { + # Perform an action when the button is clicked }) - }) - - # Disconnect from the database when app closes - observe({ - on.exit(dbDisconnect(con()), add = TRUE) - }) -} -# Run the app -shinyApp(ui, server) + # Define cleanup tasks when the app stops + onStop(function() { + # Close connections or save state if necessary + }) +} ``` ::: - ::: @@ -403,10 +315,24 @@ shinyApp(ui, server) # Exercise 5: Add complex features to your catalog browser Once you've finished the previous exercise, consider implementing these additional ideas to maximize the utility of your catalog browser. -- Add a tab to create a project directory interactively (and fill up the metadata fields) +- Add a functionality to only select certain columns `uiOutput("column_select")` +- Add buttons to order numeric columns ascending or descending using `radioButtons()` +- Use SQL aggregation functions (e.g., SUM, COUNT, AVG) to perform custom data summaries and calculations. +- Add a tab `tabPanel()` to create a project directory interactively (and fill up the metadata fields), tips: `dir.create()`, `data.frame()`, `write.table()` - Modify existing entries -- Visualize results using [Cirrocumulus](https://cirrocumulus.readthedocs.io/en/latest/) +- Visualize results using [Cirrocumulus](https://cirrocumulus.readthedocs.io/en/latest/), an interactive visualization tool for large-scale single-cell genomics data. + +:::{.callout-hint} +Explore this example with advanced features such as a two-tab layout, filtering by numeric values and matching strings, and a color-customized dashboard [here](./scripts/shiny_sqlite_advanced.r){ target="_blank"}. +::: + +::: +:::{.callout-tip} +- For R Enthusiasts +Explore [demos](https://shiny.posit.co/r/gallery/#feature-demos)from the R Shiny community to kickstart your projects or for inspiration. +- For python Enthusiasts +If you want to dive deeper into Shiny apps and their various uses (such as dynamic plots or other interactive widgets), Shiny for Python provides live, interactive code throughout its entire tutorial. Additionally, it offers a great tool called [Playground](https://shinylive.io/py/examples/#basic-app), where you can code and test your own app to explore how different features render. ::: ## Wrap up @@ -424,7 +350,6 @@ Other sources: - [Johns Hopkins Sheridan libraries, RDM](https://guides.library.jhu.edu/documenting_data/medical_research#s-lg-box-wrapper-31197839). They provide a list of medical metadata standards resources. - KU Leuven Guidance: - [Transcriptomics metadata standards and fields](https://faircookbook.elixir-europe.org/content/recipes/interoperability/transcriptomics-metadata.html#analysis-metadata) -- Biological ontologies for data scientists,[Bionty](https://lamin.ai/docs/bionty) - [NIH standardizing data collection](https://www.nlm.nih.gov/oet/ed/cde/tutorial/index.html) - [Observational Health Data Sciences and Informatics (OHDSI) OMOP Common Data Model](https://www.ohdsi.org/data-standardization/) @@ -432,4 +357,5 @@ Other sources: ### Tools and software - [Rightfield](https://rightfield.org.uk/): open source tool facilitates the integration of ontology terms into Excel spreadsheet. - [Owlready2](https://pypi.org/project/owlready2/): Python package, enables the loading of ontologies as Python objects. This versatile tool allows users to manipulate and store ontology classes, instances, and properties as needed. +- [Shiny Apps](https://shiny.posit.co/): easy interactive web apps for data science diff --git a/develop/05_VC.qmd b/develop/05_VC.qmd index 8ce1652d..54f6d701 100644 --- a/develop/05_VC.qmd +++ b/develop/05_VC.qmd @@ -80,7 +80,7 @@ We will discuss repositories for archiving experimental or large datasets in [le Moving from Git to GitHub involves transitioning from a local version control setup to a remote hosting platform. You will need a GitHub account for the exercise in this section. :::{.callout-tip title="Create a GitHub account"} -- If you don't have a GitHub account yet, click [here](https://github.com/signup). +- If you don't have a GitHub account yet, click [here](https://github.com/signup) - Install Git from [Git webpage](https://git-scm.com/downloads) ::: @@ -95,14 +95,14 @@ If you completed all the exercises in [lesson 3](./03_DOD.qmd), you should have :::{.callout-exercise} # Exercise 1: initialize a repository from an existing folder: -1. First, initialize the repository using the command `git init`. This command is run only once, even in collaborative projects ([`git init`](https://git-scm.com/docs/git-init)). -2. Once the repository is initialized, create a remote repository on GitHub. -3. Add the remote URL to your local git repository using git remote add origin \`. This associates the remote URL with the name "origin". -4. Ensure you have at least one commit in your history by staging existing files with `git add` and then creating a snapshot, known as committing, with `git commit`. -5. Finally, push your local commits to the remote repository and establish a tracking relationship using `git push -u origin master`. +1. Initialize the repository: Begin by running the command `git init` in your project directory. This command sets up a new Git repository in the current directory and is executed only once, even for collaborative projects. See ([`git init`](https://git-scm.com/docs/git-init)) for more details. +2. Create a remote repository: Once the local repository is initialized, create am empty new repository on GitHub. +3. Connect the remote repository: Add the GitHub repository URL to your local repository using the command `git remote add origin `. This associates the remote repository with the name "origin." +4. Commit changes: If you have files you want to add to your repository, stage them using `git add .`, then create a commit to save a snapshot of your changes with `git commit -m "add local folder"`. +5. Push to GitHub: To synchronize your local repository with the remote repository and establish a tracking relationship, push your commits to the GitHub repository using `git push -u origin main`. ::: -##### Set Up a Git Repository and copy your project folder +##### Setting Up a Git Repository and copying an existing folder Alternatively to converting folders to repositories, you can create a new repository remotely, and then clone (`git clone`) it locally. Here, `git init` is not needed. You can move the files into the repository locally (`git add`, `git commit`, and `git push`). If you are creating a collaborative repository, you can now share it with your colleagues. @@ -113,9 +113,9 @@ Write useful and clear Git commits. Check out [this post](https://www.convention ### Github pages -After setting up your repository on GitHub, take advantage of the opportunity to enhance it by adding your data analysis reports. Whether they are in Jupyter Notebooks, Rmarkdowns, or HTML reports, you can showcase them on a [GitHub Page](https://pages.github.com/). +After setting up your repository on GitHub, take advantage of the opportunity to enhance it by adding your data analysis reports. Whether they are in Jupyter Notebooks, R Markdown files, or HTML reports, you can showcase them on a [GitHub Page](https://pages.github.com/). -Once you have created your repository (and put it in GitHub), you have now the opportunity to add your data analysis reports that you created, in either Jupyter Notebooks, Rmarkdowns, or HTML reports, in a [GitHub Page website](https://pages.github.com/). Creating a GitHub page is very simple, and we recommend that you follow the nice tutorial that GitHub has put for you. +Once you have created your repository (and put it in GitHub), you have now the opportunity to add your data analysis reports that you created, in either Jupyter Notebooks, R Markdown files, or HTML reports, in a [GitHub Page website](https://pages.github.com/). Creating a GitHub page is very simple, and we recommend that you follow the nice tutorial that GitHub has put for you. For simplicity, we recommend using [Quarto](https://quarto.org/) or [MkDocs](https://www.mkdocs.org/). Visit their websites and follow the instructions to get started. @@ -134,3 +134,4 @@ In this lesson, we explored version control and utilized Git and GitHub to estab ### Sources - [Version Control and Code Repository Link](https://guides.library.jhu.edu/c.php?g=1096705&p=8066729). +- [Git cheat sheet](https://education.github.com/git-cheat-sheet-education.pdf). \ No newline at end of file diff --git a/develop/06_pipelines.qmd b/develop/06_pipelines.qmd index 8367b375..13d3b30c 100644 --- a/develop/06_pipelines.qmd +++ b/develop/06_pipelines.qmd @@ -60,3 +60,4 @@ This lesson emphasized the importance of reproducibility in computational resear - [Guide to reproducible code in ecology and evolution](https://www.britishecologicalsociety.org/wp-content/uploads/2017/12/guide-to-reproducible-code.pdf) - [Best practices for Scientific computing](https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745) - [Elixir Software Best Practices](https://elixir-europe.org/platforms/tools/software-best-practices) +- [faircookbook worflows](https://faircookbook.elixir-europe.org/content/recipes/applied-examples/fair-workflows.html) diff --git a/develop/assets/other_metadata.txt b/develop/assets/other_metadata.txt deleted file mode 100644 index 6e697261..00000000 --- a/develop/assets/other_metadata.txt +++ /dev/null @@ -1,34 +0,0 @@ -https://lamin.ai/docs/bionty - -Gene - [Ensembl](https://ensembl.org/), [NCBI Gene](https://www.ncbi.nlm.nih.gov/gene) - -Protein - [Uniprot](https://www.uniprot.org/) - -Species - [NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy), [Ensembl Species](https://useast.ensembl.org/info/about/species.html) - -CellLine - [Cell Line Ontology](https://github.com/CLO-ontology/CLO) - -CellType - [Cell Ontology](https://obophenotype.github.io/cell-ontology) - -CellMarker - [CellMarker](http://xteam.xbio.top/CellMarker) - -Tissue - [Uberon](http://obophenotype.github.io/uberon) - -Disease - [Mondo](https://mondo.monarchinitiative.org/), [Human Disease](https://disease-ontology.org/) - -Phenotype - [Human Phenotype](https://hpo.jax.org/app), [Phecodes](https://phewascatalog.org/phecodes_icd10), [PATO](https://github.com/pato-ontology/pato), -[Mammalian Phenotype](http://obofoundry.org/ontology/mp.html), [Zebrafish Phenotype](http://obofoundry.org/ontology/zp.html) - -Pathway - [Gene Ontology](https://bioportal.bioontology.org/ontologies/GO), [Pathway Ontology](https://bioportal.bioontology.org/ontologies/PW) - -ExperimentalFactor - [Experimental Factor Ontology](https://www.ebi.ac.uk/ols/ontologies/efo) - -DevelopmentalStage - [Human Developmental Stages](https://github.com/obophenotype/developmental-stage-ontologies/wiki/HsapDv), [Mouse Developmental Stages](https://github.com/obophenotype/developmental-stage-ontologies/wiki/MmusDv) - -Drug - [Drug Ontology](https://bioportal.bioontology.org/ontologies/DRON) - -Ethnicity - [Human Ancestry Ontology](https://github.com/EBISPOT/hancestro) - -BFXPipeline - largely based on [nf-core](https://nf-co.re/) - -BioSample - [NCBI BioSample attributes](https://www.ncbi.nlm.nih.gov/biosample/docs/attributes) \ No newline at end of file diff --git a/develop/examples/mkdocs_pages.qmd b/develop/examples/mkdocs_pages.qmd new file mode 100644 index 00000000..42e80a52 --- /dev/null +++ b/develop/examples/mkdocs_pages.qmd @@ -0,0 +1,45 @@ +--- +format: html +summary: Build git pages using mkdocs +--- +# Build your GitHub Page using Mkdocs + +:::{.callout-exercise} + +# Exercise 5: make a project folder and publish a data analysis webpage + +1. Configure your main GitHub Page and its repo + + The first step is to set up the main GitHub Page site and the repository that will host it. This is very simple, as you will only need to follow [these steps](https://pages.github.com/). In a Markdown document, outline the primary objectives of the organization and provide an overview of ongoing research projects. + After you have created the *organization/username*github.io, it is time to configure your `Project` repository webpage using MkDocs! + +2. Start a new project from Cookiecutter or use one from the previous exercise. + + If you use a `Project` repo from the first exercise, go to the next paragraph. Using Cookiecutter, create a new data analysis project. Remember to fill up your metadata and description files! After you have created the folder, it would be best to initialize a Git repo following the instructions from the [previous section](#creating-a-git-repo-online-and-copying-your-project-folder). + + Next, link your data of interest (or create a small fake dataset) and make an example of a data analysis notebook/report (this could be just a scatter plot of a random matrix of values). Depending on your setup, you might be using Jupyter Notebooks or R Markdown files. The extensions that we have installed using `pip` allow you to directly add a Jupyter Notebook file to the `mkdocs.yml` navigation section. On the other hand, if you are using R Markdown files, you will have to knit your document into either an HTML page or a GitHub document. + + For the purposes of this exercise, we have already included a basic `index.md` markdown file that can serve as the intro page of your repo, and a `jupyter_example.ipynb` with some code in it. You are welcome to modify them further to test them out! + +3. Use MkDocs to create your webpage + + When you are happy with your files and are ready to publish them, make sure to add, commit, and push the changes to the remote. Then, build up your webpage using MkDocs and the [`mkdocs gh-deploy`](https://www.mkdocs.org/user-guide/deploying-your-docs/) command from the same directory where the `mkdocs.yml` file is. For example, if your `mkdocs.yml` for your `Project` folder is in `/Users/JARH/Projects/project1_JARH_20231010/mkdocs.yml`, do `cd /Users/JARH/Projects/project1_JARH_20231010/` and then `mkdocs gh-deploy`. + This requires a couple of changes in your GitHub organization settings. + + Remember to make sure that your markdowns, images, reports, etc., are included in the `docs` folder and properly set up in the navigation section of your `mkdocs.yml` file. + + Finally, we only need to set up the GitHub `Project` repo settings. + +4. Publishing your GitHub Page + + Go to your GitHub repo settings and configure the Page section. Since you are using the `mkdocs gh-deploy` command to publish your site in the `gh-pages` branch (as explained the the mkdocs documentation), we need to change where GitHub is fetching the website. You will need to configure the settings of this repository in GitHub so that the Page is taken from the `gh-pages` branch and the `root` folder. + + ![GitHub Pages setup](../images/git_pages.png) + + - Branch should be `gh-pages` + - Folder should be `root` + + After a couple of minutes, your webpage should be ready! You should be able to see your webpage through the link provided in the Page section! + +Now it is also possible to include this repository webpage in your main webpage ``.*github.io by including the link of the repo website (https://``.*github.io/*repo-name*) in the navigation section of the `mkdocs.yml` file in the main *organization*github.io repo. +::: \ No newline at end of file diff --git a/develop/images/github_pages_quarto.png b/develop/images/github_pages_quarto.png new file mode 100644 index 00000000..4c326730 Binary files /dev/null and b/develop/images/github_pages_quarto.png differ diff --git a/develop/practical_workshop.qmd b/develop/practical_workshop.qmd index 130153d5..db59e006 100644 --- a/develop/practical_workshop.qmd +++ b/develop/practical_workshop.qmd @@ -1,6 +1,8 @@ --- title: Practical material -format: html +format: + html: + code-copy: true date-modified: last-modified date-format: long date: 2023-11-30 @@ -226,9 +228,8 @@ Beyond substituting placeholders in file and directory names, Cookiecutter can a First, modify the `my_template/main.py` file to include a placeholder inside its contents: -```{.python .code-overflow-wrap} +```{.python .code-overflow-wrap filename="main.py"} # main.py - def hello(): print("Hello, {{cookiecutter.project_name}}!") ``` @@ -238,10 +239,9 @@ The '{{cookiecutter.project_name}}' placeholder is now included within the main. After running Cookiecutter, your generated 'main.py' file could appear as follows: ```{.python .code-overflow-wrap} -# main.py - +# main.py, assuming "MyProject" was entered as the project_name def hello(): - print("Hello, MyProject!") # Assuming "MyProject" was entered as the project_name + print("Hello, MyProject!") ``` ##### Step 3: Use Cookiecutter @@ -446,7 +446,7 @@ It is time now to customize your Cookiecutter templates and modify the metadata. 0. Consider changing variables (add/remove) in the metadata.yml file from the cookicutter template. 1. Modify the `cookiecutter.json` file. You could add new variables or change the default key and/or values: - ```{.json .code-overflow-wrap} + ```{.json .code-overflow-wrap filename="cookiecutter.json"} { "project_name": "myProject", "project_slug": "{{ cookiecutter.project_name.lower().replace(' ', '_').replace('-', '_') }}", @@ -460,7 +460,7 @@ The metadata file will be filled accordingly. 2. Optional: You can customize or remove this prompt message entirely, allowing you to tailor the text to your preferences for a unique experience each time you use the template. - ```{.json .code-overflow-wrap} + ```{.json .code-overflow-wrap filename="cookiecutter.json"} "__prompts__": { "project_name": "Project directory name [Example: project_short_description_202X]", "author": "Author of the project", @@ -471,7 +471,7 @@ The metadata file will be filled accordingly. 3. Modify the `metadata.yml` file so that it includes the metadata recorded by the `cookiecutter.json` file. Hint below: - ```{.json .code-overflow-wrap} + ```{.yml .code-overflow-wrap filename="metadata.yml"} project: {{ cookiecutter.project_name }} author: {{ cookiecutter.author }} date: {{ cookiecutter.date }} @@ -479,7 +479,7 @@ The metadata file will be filled accordingly. ``` 4. Modify the `README.md` file so that it includes the short description recorded by the `cookiecutter.json` file and the metadata at the top of the markdown file (top between lines of dashed). - ```{.md .code-overflow-wrap} + ```{.md .code-overflow-wrap filename="README.md"} --- title: {{ cookiecutter.project_name }} date: "{{ cookiecutter.date }}" @@ -524,27 +524,43 @@ Avoid long and complicated names and ensure your file names are both informative ## 4. Create a catalog of your data folder -The next step is to collect all the NGS datasets that you have created in the manner explained above. Since your folders all should contain the `metadata.yml` file in the same place with the same metadata, it should be very easy to iteratively go through all the folders and merge all the metadata.yml files into a one single table. This table can be then browsed easily with Microsoft Excel, for example. If you are interested in making a Shiny app or Python Panel tool to interactively browse the catalog, check out this [lesson](./04_metadata.qmd). +The next step is to collect all the datasets that you have created in the manner explained above. Since your folders all should contain the `metadata.yml` file in the same place with the same metadata, it should be very easy to iteratively go through all the folders and merge all the metadata.yml files into a one single table. he table can be easily viewed in your terminal or even with Microsoft Excel. :::{.callout-exercise} # Exercise 4: create a metadata.tsv catalog -We will make a small script in R (or you can make one with Python) that recursively goes through all the folders inside an input path (like your `Assays` folder), fetches all the `metadata.yml` files, and merges them. Finally, it will write a TSV file as an output. +We will make a small script in R (or you can make one with Python) that recursively goes through all the folders inside an input path (like your `Assays` folder), fetches all the `metadata.yml` files, merges them and writes a TSV file as an output. 1. Create a folder called `dataset` and change directory `cd dataset` -2. Fork [this repository](https://github.com/hds-sandbox/cc-data-template): a Cookiecutter template designed for NGS datasets. -While you are welcome to create your own template from scratch, we recommend using this one to save time. +2. Fork [this repository](https://github.com/hds-sandbox/cc-data-template): a Cookiecutter template designed for NGS datasets.*While you are welcome to create your own template from scratch, we recommend using this one to save time.* 3. Run the `cookiecutter cc-data-template` command at least twice to create multiple datasets or projects. Use different values each time to simulate various scenarios (do this in the dataset directory that you have previously created). -Execute the script below using R (or create your own script in Python). Adjust the `folder_path` variable so that it matches the path to the Assays folder. The resulting table will be saved in the same `folder_path`. -4. Open your `database_YYYYMMDD.tsv` table in a text editor from the command-line, or view it in Excel for better visualization. +4. Execute the script below using R (or create your own script in Python). **Adjust the `folder_path`** variable so that it matches the path to the Assays folder. The resulting table will be saved in the same `folder_path`. +5. Open your `database_YYYYMMDD.tsv` table in a text editor from the command-line, or view it in Excel for better visualization. + +- Solution A. From a TSV +:::{.callout-hint} ```{.r .code-overflow-wrap} +# R version 4.3.2 +# RScript to read all yaml files in directory and save the metadata into a dataframe +quiet <- function(package_name) { + # Suppress warnings and messages while checking and installing the package + suppressMessages(suppressWarnings({ + # Check if the package is available and load it + if (!requireNamespace(package_name, quietly = TRUE)) { + install.packages(package_name) + } + # Load the package + library(package_name, character.only = TRUE) + })) +} + +# Check and install necessary libraries +quiet("yaml") +quiet("dplyr") +quiet("lubridate") -library(yaml) -library(dplyr) -library(lubridate) -# Function to read a YAML file and transform it into a dataframe format. read_yaml <- function(file_path) { # Read the YAML file and convert it to a data frame df <- yaml::yaml.load_file(file_path) %>% as.data.frame(stringsAsFactors = FALSE) @@ -566,88 +582,292 @@ get_metadata <- function(folder_path) { } # Specify the folder path -folder_path <- "/path/to/your/folder" +folder_path <- "./" #/path/to/your/folder # Fetch metadata from the specified folder -metadata <- get_metadata(folder_path) +df <- get_metadata(folder_path) # Save the data frame as a TSV file output_file <- paste0("database_", format(Sys.Date(), "%Y%m%d"), ".tsv") -write.table(metadata, file = output_file, sep = "\t", quote = FALSE, row.names = FALSE) +write.table(df, file = output_file, sep = "\t", quote = FALSE, row.names = FALSE) # Print confirmation message cat("Database saved as", output_file, "\n") ``` ::: -## 5. Version control of your data analysis using Git and GitHub -Version control is a systematic approach to tracking changes made to a project over time. It provides a structured means of documenting alterations, allowing you to revisit and understand the evolution of your work. In research data management and data analytics, version control is very important and gives you a lot of advantages. +**Exercise 4, option B: create a SQLite database ** -[Git](https://git-scm.com/about) is a distributed version control system that enables developers and researchers to efficiently manage their project's history, collaborate seamlessly, and ensure data integrity. At its core, Git operates through the following principles and mechanisms: -On the other hand, [GitHub](https://github.com/) is a web-based platform that enhances Git's capabilities by providing a collaborative and centralized hub for hosting Git repositories. It offers several key functionalities, such as tracking issues, security features to safeguard your repos, and GitHub Pages that allow you to create websites to showcase your projects. +Alternatively, create a SQLite database from a metadata. If you opt for this option in the exercise, you must still complete the first three steps outlined above. Read more from the [RSQLite documentation](https://www.rdocumentation.org/packages/RSQLite/versions/2.3.6). -:::{.callout-tip title="Create a GitHub organization for your lab or department"} -GitHub allows users to create organizations and teams that will collaborate or create repositories under the same umbrella organization. If you would like to create an educational organization in GitHub, you can do so for free! For example, you could create a GitHub account for your lab. +- Solution B. SQLite database -To create a GitHub organization, follow these [instructions](https://docs.github.com/en/organizations/collaborating-with-groups-in-organizations/creating-a-new-organization-from-scratch) +:::{.callout-hint} +```{.r .code-overflow-wrap} +print("Assuming the libraries from Exercise 4 are already loaded and a dataframe has been generated from the YAML files...") + +# check_and_install() form Exercise 4, and load the other packages. +quiet("DBI") +quiet("RSQLite") + +# Initialize a temporary in memory database and copy the data.frame into it + +db_file_path <- paste0("database_", format(Sys.Date(), "%Y%m%d"), ".sqlite") +con <- dbConnect(RSQLite::SQLite(), db_file_path) + +dbWriteTable(con, "metadata", df, overwrite=TRUE) #row.names = FALSE,append = -After you have created the GitHub organization, make sure that you create your repositories under the organization space and not your user! +# Print confirmation message +cat("Database saved as", db_file_path, "\n") + +# Close the database connection +dbDisconnect(con) + +``` +::: ::: -### Creating a git repo online and copying your project folder +### Shiny apps -Version controlling your data analysis folders, a.k.a. `Project` folder, is very easy once you have set up your Cookiecutter templates. The simplest way of doing this is to first create a remote GitHub repository from the webpage (or from the Desktop app, if you are using it) with a proper project name. Then `git clone` that repository you just made into your `Projects` main folder. Then, use cookiecutter to create a project folder template and copy-paste the contents of the folder template to your cloned repo. Remember to fill up your metadata and description files! If you wish, you could already git add, commit, and push the first changes to the folders and continue from there on. +To get the most out of your metadata file and the ones from other colleagues, you can combine them and explore them by creating an interactive catalog browser. You can create interactive web apps straight from R or Python. Whether you have generated a tabulated-file or a sqlite database, browse through the metadata using [Shiny](https://shiny.posit.co/). Shiny apps are perfect for researchers because they enable you to create interactive visualizations and dashboards with dynamic data inputs and outputs without needing extensive web development knowledge. Shiny provides a variety of user interface components such as forms, tables, graphs, and maps to help you organize and present your data effectively. It also allows you to filter, sort, and segment data for deeper insights. -Go back to the course material [lesson 5](./05_VC.qmd) and read the differences between converting folders to git repositories and cloning a folder to an existing git repository. -:::{.callout-tip title="Tips to write good commit messages"} -If you would like to know more about Git commits and the best way to make clear Git messages, check out [this post](https://www.conventionalcommits.org/en/v1.0.0/)! +:::{.callout-tip} +- For R Enthusiasts + +Explore [demos](https://shiny.posit.co/r/gallery/#feature-demos) from the R Shiny community to kickstart your projects or for inspiration. + +- For python Enthusiasts + +Shiny for Python provides live, interactive code throughout its entire tutorial. Additionally, it offers a great tool called [Playground](https://shinylive.io/py/examples/#basic-app), where you can code and test your own app to explore how different features render. ::: -### GitHub Pages -Once you have created your repository (and put it in GitHub), you have now the opportunity to add your data analysis reports that you created, in either Jupyter Notebooks, Rmarkdowns, or HTML reports, in a [GitHub Page website](https://pages.github.com/). Creating a GitHub page is very simple, and we really recommend that you follow the nice tutorial that GitHub has put for you. Nonetheless, we will see the main steps in the exercise below. +:::{.callout-exercise} +# Exercise 5: Skill Booster, build an interactive catalog browser + +Build an interactive web app straight from R or Python. Below, you will find an example of an R shiny app. In either case, you will need to define a user interface (UI) and a server function. The UI specifies the layout and appearance of the app, including input controls and output displays. The server function contains the app's logic, handling data manipulation, and responding to user interactions. Once you set up the UI and server, you can launch the app! + +Here's the UI and server function structure for an R Shiny app: + +```{.r .code-overflow-wrap} +# Don't forget to load shiny and DT libraries! + +# Specify the layout +ui <- fluidPage( + titlePanel(...) + # Define the appearance of the app + sidebarLayout( + sidebarPanel(...) + mainPanel(...) + ) +) + +server <- function(input, output, session) { + # Define a reactive expression for data based on user inputs + data <- reactive({ + req(input$dataInput) # Ensure data input is available + # Load or manipulate data here + }) + + # Define an output table based on data + output$dataTable <- renderTable({ + data() # Render the data as a table + }) + + # Observe a button click event and perform an action + observeEvent(input$actionButton, { + # Perform an action when the button is clicked + }) + + # Define cleanup tasks when the app stops + onStop(function() { + # Close connections or save state if necessary + }) +} +# Run the app +shinyApp(ui, server) +``` +If you need more assistance, take a look at the code below (Hint). + +:::{.callout-hint} +```{.r .code-overflow-wrap} +# R version 4.3.2 +print("Assuming the libraries from Exercise 4 are already loaded and a dataframe has been generated from the YAML files...") + +# check_and_install() form Exercise 4. +quiet("shiny") +quiet("DT") + +# UI +ui <- fluidPage( + titlePanel("TSV File Viewer"), + + sidebarLayout( + sidebarPanel( + fileInput("file", "Choose a TSV file", accept = c(".tsv")), + selectInput("filter_column", "Filter by Column:", choices = c("n_samples", "technology"), selected = "technology"), + textInput("filter_value", "Filter Value:", value = ""), + # if only numbers, numericInput() + radioButtons("sort_order", "Sort Order:", choices = c("Ascending", "Descending"), selected = "Ascending") + ), + + mainPanel( + DTOutput("table") + ) + ) +) + +# Server +server <- function(input, output) { + + data <- reactive({ + req(input$file) + df <- read.delim(input$file$datapath, sep = "\t") + print(str(df)) + + # Filter the DataFrame based on user input + if (input$filter_column != "" && input$filter_value != "") { + # Check if the column is numeric, and filter for value + if (is.numeric(df[[input$filter_column]])) { + df <- df[df[[input$filter_column]] >= as.numeric(input$filter_value), ] + } + # Check if the column is a string + else if (is.character(df[[input$filter_column]])) { + df <- df[df[[input$filter_column]] == input$filter_value, ] + } + } + + # Sort the DataFrame based on user input + sort_order <- if (input$sort_order == "Ascending") TRUE else FALSE + df <- df[order(df[[input$filter_column]], decreasing = !sort_order), ] + df + }) + + output$table <- renderDT({ + datatable(data()) + }) +} + +# Run the app +shinyApp(ui, server) +``` +::: +In the optional exercise below, you'll find a code example for using an SQLite database as input instead of a tabulated file. +::: -There are many different ways to create your web pages. We recommend using Mkdocs and Mkdocs materials as a framework to create a nice webpage simply. The folder templates that we used as an example in the previous exercise already contain everything you need to start a webpage. Nonetheless, you will need to understand the basics of [MkDocs](https://www.mkdocs.org/) and [MkDocs materials](https://squidfunk.github.io/mkdocs-material/) to design a webpage to your liking. MkDocs is a static webpage generator that is very easy to use, while MkDocs materials is an extension of the tool that gives you many more options to customize your website. Check out their web pages to get started! :::{.callout-exercise} -# Exercise 5: make a project folder and publish a data analysis webpage +# Exercise (optional) +Once you've finished the previous exercise, consider implementing these additional ideas to maximize the utility of your catalog browser. -1. Configure your main GitHub Page and its repo +- Use SQLite databases as input +- Add a functionality to only select certain columns `uiOutput("column_select")` +- Filter columns by value using `column_filter_select()` +- Add multiple tabs using `tabsetPanel()` +- Add buttons to order numeric columns ascending or descending using `radioButtons()` +- Use SQL aggregation functions (e.g., SUM, COUNT, AVG) to perform custom data summaries and calculations. +- Add a tab `tabPanel()` to create a project directory interactively (and fill up the metadata fields), tips: `dir.create()`, `data.frame()`, `write.table()` +- Modify existing entries +- Visualize results using [Cirrocumulus](https://cirrocumulus.readthedocs.io/en/latest/), an interactive visualization tool for large-scale single-cell genomics data. - The first step is to set up the main GitHub Page site and the repository that will host it. This is very simple, as you will only need to follow [these steps](https://pages.github.com/). In a Markdown document, outline the primary objectives of the organization and provide an overview of ongoing research projects. - After you have created the *organization/username*github.io, it is time to configure your `Project` repository webpage using MkDocs! +If you need some assistance, take a look at the code below (Hint). -2. Start a new project from Cookiecutter or use one from the previous exercise. +:::{.callout-hint} +Explore an example with advanced features such as a two-tab layout, filtering by numeric values and matching strings, and a color-customized dashboard [here](./scripts/shiny_sqlite_advanced.r){ target="_blank"}. +::: +::: - If you use a `Project` repo from the first exercise, go to the next paragraph. Using Cookiecutter, create a new data analysis project. Remember to fill up your metadata and description files! After you have created the folder, it would be best to initialize a Git repo following the instructions from the [previous section](#creating-a-git-repo-online-and-copying-your-project-folder). +## 5. Version control using Git and GitHub - Next, link your data of interest (or create a small fake dataset) and make an example of a data analysis notebook/report (this could be just a scatter plot of a random matrix of values). Depending on your setup, you might be using Jupyter Notebooks or Rmarkdowns. The extensions that we have installed using `pip` allow you to directly add a Jupyter Notebook file to the `mkdocs.yml` navigation section. On the other hand, if you are using Rmarkdown, you will have to knit your document into either an HTML page or a GitHub document. - - For the purposes of this exercise, we have already included a basic `index.md` markdown file that can serve as the intro page of your repo, and a `jupyter_example.ipynb` with some code in it. You are welcome to modify them further to test them out! +Version control involves systematically tracking changes to a project over time, offering a structured way to document revisions and understand the progression of your work. In research data management and data analytics, it plays a critical role and provides numerous benefits. -3. Use MkDocs to create your webpage +[Git](https://git-scm.com/about) is a distributed version control system that helps developers and researchers efficiently manage project history, collaborate seamlessly, and maintain data integrity. On the other hand, [GitHub](https://github.com/) is a web-based platform that builds on Git's functionality by providing a centralized, collaborative hub for hosting Git repositories. It offers several key functionalities, such as tracking issues, security features to safeguard your repos, and GitHub Pages that allow you to create websites to showcase your projects. - When you are happy with your files and are ready to publish them, make sure to add, commit, and push the changes to the remote. Then, build up your webpage using MkDocs and the [`mkdocs gh-deploy`](https://www.mkdocs.org/user-guide/deploying-your-docs/) command from the same directory where the `mkdocs.yml` file is. For example, if your `mkdocs.yml` for your `Project` folder is in `/Users/JARH/Projects/project1_JARH_20231010/mkdocs.yml`, do `cd /Users/JARH/Projects/project1_JARH_20231010/` and then `mkdocs gh-deploy`. - This requires a couple of changes in your GitHub organization settings. +:::{.callout-tip title="Create a GitHub organization for your lab or department"} +GitHub users can create organizations, allowing groups to collaborate or create repositories under the same organization umbrella. You can create an educational organization on Github for free, by setting up a Github account for your lab. - Remember to make sure that your markdowns, images, reports, etc., are included in the `docs` folder and properly set up in the navigation section of your `mkdocs.yml` file. +Follow these [instructions](https://docs.github.com/en/organizations/collaborating-with-groups-in-organizations/creating-a-new-organization-from-scratch) to create a GitHub organization. - Finally, we only need to set up the GitHub `Project` repo settings. +Once you've established your GitHub organization, be sure to create your repositories within the organization's space rather than under your personal user account. This keeps your projects centralized and accessible to the entire group. Best practices for managing an organization on GitHub include setting clear access permissions, regularly reviewing roles and memberships, and organizing repositories effectively to keep your projects structured and easy to navigate. -4. Publishing your GitHub Page - - Go to your GitHub repo settings and configure the Page section. Since you are using the `mkdocs gh-deploy` command to publish your site in the `gh-pages` branch (as explained the the mkdocs documentation), we need to change where GitHub is fetching the website. You will need to configure the settings of this repository in GitHub so that the Page is taken from the `gh-pages` branch and the `root` folder. +::: + +### Setting up a GitHub repository for your project folder + +Version controlling your data analysis folders becomes straightforward once you've established your Cookiecutter templates. After you've created several folder structures and metadata using your Cookiecutter template, you can manage version control by either converting those folders into Git repositories or copying a folder into an existing Git repository. Both approaches are explained in [Lesson 5](https://hds-sandbox.github.io/RDM_NGS_course/develop/05_VC.html#from-project-folders-to-git-repositories). + +:::{.callout-exercise} +# Exercise 6: initialize a repository from an existing folder: +1. Initialize the repository: Begin by running the command `git init` in your project directory. This command sets up a new Git repository in the current directory and is executed only once, even for collaborative projects. See ([`git init`](https://git-scm.com/docs/git-init)) for more details. +2. Create a remote repository: Once the local repository is initialized, create an empty new repository on GitHub (website or Github Desktop). +3. Connect the remote repository: Add the GitHub repository URL to your local repository using the command `git remote add origin `. This associates the remote repository with the name "origin." +4. Commit changes: If you have files you want to add to your repository, stage them using `git add .`, then create a commit to save a snapshot of your changes with `git commit -m "add local folder"`. +5. Push to GitHub: To synchronize your local repository with the remote repository and establish a tracking relationship, push your commits to the GitHub repository using `git push -u origin main`. +::: + +:::{.callout-tip title="Tips to write good commit messages"} +If you would like to know more about Git commits and the best way to make clear Git messages, check out [this post](https://www.conventionalcommits.org/en/v1.0.0/)! +::: + +### GitHub Pages + +After creating your repository and hosting it on GitHub, you can now add your data analysis reports—such as Jupyter Notebooks, R Markdown files, or HTML reports—to a [GitHub Page website](https://pages.github.com/). Setting up a GitHub Page is straightforward, and we recommend following GitHub's helpful tutorial. However, we will go through the key steps in the exercise below. There are several ways to create your web pages, but we suggest using Quarto as a framework to build a sleek, professional-looking website with ease. The folder templates from the previous exercise already contain the necessary elements to launch a webpage. Familiarizing yourself with the basics of Quarto will help you design a webpage that suits your preferences. Other common options include [MkDocs](https://squidfunk.github.io/mkdocs-material/). If you want to use MkDocs instead, click [here](./examples/mkdocs_pages.qmd) and follow the instructions. - ![GitHub Pages setup](./images/git_pages.png) +:::{.callout-tip} +Here are some useful links to get started with Github Pages: + +- [Github Pages](https://pages.github.com/) +- [Quarto Github Pages](https://quarto.org/docs/publishing/github-pages.html) +::: + +:::{.callout-exercise} +# Exercise 7: Create a Github Page using Quarto +1. Head over to GitHub and create a new public repository named username.github.io, where username is your username (or organization name) on GitHub. *If the first part of the repository doesn’t exactly match your username, it won’t work, so make sure to get it right.* +2. Go to the folder where you want to store your project, and clone the new repository: `git clone https://github.com/username/username.github.io` (or use Github Desktop) +3. Create a new file named `_quarto.yml` + + ```{.yml filename="_quarto.yml"} + project: + type: website + ``` - - Branch should be `gh-pages` - - Folder should be `root` +4. Open the terminal + ```{.bash filename="Terminal"} + # Add a .nojekyll file to the root of the repository not to do additional processing of your published site + touch .nojekyll #copy NUL .nojekyll for windows - After a couple of minutes, your webpage should be ready! You should be able to see your webpage through the link provided in the Page section! + # Render and push it to Github + quarto render + git commit -m "Publish site to docs/" + git push + ``` +5. If you do not have a `gh-pages`, you can create one as follows + + ```{.bash filename="Terminal"} + git checkout --orphan gh-pages + git reset --hard # make sure all changes are committed before running this! + git commit --allow-empty -m "Initialising gh-pages branch" + git push origin gh-pages + ``` +6. Before attempting to publish you should ensure that the Source branch for your repository is `gh-pages` and that the site directory is set to the repository root (/) + + ![](./images/github_pages_quarto.png) + +7. It is important to not check your `_site` directory into version control, add the output directory of your project to `.gitignore` + + ```{.bash filename=".gitignore"} + /.quarto/ + /_site/ + ``` +8. Now is time to publish your website + + ```{.bash filename=".Terminal"} + quarto publish gh-pages + ``` -Now it is also possible to include this repository webpage in your main webpage *organization*github.io by including the link of the repo website (https://*organization*github.io/*repo-name*) in the navigation section of the `mkdocs.yml` file in the main *organization*github.io repo. +9. Once you’ve completed a local publish, add a publish.yml GitHub Action to your project by creating this YAML file and saving it to `.github/workflows/publish.yml`. Read how to do it [here](https://quarto.org/docs/publishing/github-pages.html#github-action) ::: ## 6. Archive GitHub repositories on Zenodo diff --git a/develop/scripts/shiny_sqlite_advanced.R b/develop/scripts/shiny_sqlite_advanced.R new file mode 100644 index 00000000..99946538 --- /dev/null +++ b/develop/scripts/shiny_sqlite_advanced.R @@ -0,0 +1,169 @@ +#!/usr/bin/env Rscript + +# Author: Alba Refoyo Martinez +# Copyright: Copyright 2024, University of Copenhagen +# Email: gsd818@ku.dk +# License: MIT +# R version: 4.3.2 + +# Define the UI +ui <- fluidPage( + titlePanel("SQLite R Shiny App"), + + # Use tabsetPanel to add multiple tabs + tabsetPanel( + # Existing tab for browsing the SQLite database + tabPanel("Browse Database", + sidebarLayout( + sidebarPanel( + fileInput("db_file", "Select SQLite Database File", accept = c(".sqlite")), + uiOutput("table_select"), + uiOutput("column_filter_select"), + textInput("filter_value", "Find by value", ""), + actionButton("refresh", "Refresh Tables") + # UI output for selecting columns (populated based on the selected table) + # uiOutput("column_select"), + ), + mainPanel( + DTOutput("tableData") + ) + )), + + # New tab for creating a project directory and filling metadata fields + tabPanel("Create Project Directory", + sidebarLayout( + sidebarPanel( + textInput("project_name", "Project Name:", value = "MyProject"), + textInput("metadata_field1", "Metadata Field 1:", value = ""), + textInput("metadata_field2", "Metadata Field 2:", value = ""), + actionButton("create_project", "Create Project") + ), + mainPanel( + textOutput("message") # To display feedback messages + ) + ) + ) + ) +) + +# Define the server +server <- function(input, output, session) { + # Reactive value to hold the database connection + db_conn <- reactiveVal(NULL) + + # Observe changes in the file input + observeEvent(input$db_file, { + # Check if a file is uploaded + if (!is.null(input$db_file)) { + # Get the path to the uploaded file + db_path <- input$db_file$datapath + + # Disconnect any existing connection + if (!is.null(db_conn())) { + dbDisconnect(db_conn()) + } + + # Establish a new connection to the SQLite database + conn <- dbConnect(RSQLite::SQLite(), dbname = db_path) + db_conn(conn) + + # Update the list of tables + updateTableChoices() + } + }) + + # Function to update the list of tables in the database + updateTableChoices <- function() { + # Ensure there's a database connection + if (!is.null(db_conn())) { + # Retrieve the list of tables in the database + tables <- dbListTables(db_conn()) + # Update the choices in the select input + updateSelectInput(session, "table", choices = tables) + } + } + + # Observe the refresh button + observeEvent(input$refresh, { + updateTableChoices() + }) + + # Render the select input for tables + output$table_select <- renderUI({ + selectInput("table", "Select a table", choices = character(0)) + }) + + # Render the select input for columns (choices populated based on the selected table) + # output$column_select <- renderUI({ + # req(input$table) + # # Read data from the selected table + # data <- dbReadTable(db_conn(), input$table) + # # Get the column names from the data + # columns <- names(data) + # # Create a select input for the column choices + # selectInput("columns", "Select columns", choices = columns, multiple = TRUE) + # }) + + + # Render the select input for columns to filter by + output$column_filter_select <- renderUI({ + req(input$table) + data <- dbReadTable(db_conn(), input$table) + columns <- names(data) + selectInput("column_filter", "Filter by column", choices = columns) + }) + + # Display data from the selected table + output$tableData <- renderDT({ + req(input$table) + data <- dbReadTable(db_conn(), input$table) + + # If filtering by columns, ensure they are selected + # req(input$columns) + #filtered_data <- data[, input$columns, drop = FALSE] + + if (!is.null(input$column_filter) && input$filter_value != "") { + filtered_data <- data[data[[input$column_filter]] == input$filter_value, ] + } else { + filtered_data <- data + } + + datatable(filtered_data) + }) + + # Observe the create_project button + observeEvent(input$create_project, { + project_name <- input$project_name + metadata_field1 <- input$metadata_field1 + metadata_field2 <- input$metadata_field2 + + # Define the project directory path + project_dir <- file.path(getwd(), project_name) + + # Check if the directory already exists + if (dir.exists(project_dir)) { + output$message <- renderText("Directory already exists. Please choose a different project name.") + } else { + # Create the directory + dir.create(project_dir) + + # Save the metadata fields to a TSV file in the project directory + metadata <- data.frame(Field1 = metadata_field1, Field2 = metadata_field2) + metadata_file <- file.path(project_dir, "metadata.tsv") + write.table(metadata, metadata_file, sep = "\t", row.names = FALSE, col.names = TRUE) + + # Provide feedback to the user + output$message <- renderText(paste("Project created successfully in", project_dir)) + } + }) + + # Close the database connection when the app is stopped + onStop(function() { + if (!is.null(db_conn())) { + dbDisconnect(db_conn()) + } + }) +} + +# Run the app +shinyApp(ui = ui, server = server) \ No newline at end of file diff --git a/index.qmd b/index.qmd index 3756ad00..682ea8f6 100644 --- a/index.qmd +++ b/index.qmd @@ -21,6 +21,9 @@ summary: Index page, intro to course # You should hide the navigation if there are no subsections # You should hide the Table of Contents if there are no important titles --> +:::{.callout-warning title="Practical RDM workshop"} +We offer workshops on practical RDM for biodata. Keep an eye on the upcoming events on the [Sandbox website](https://hds-sandbox.github.io/news/news.html). +::: ## Research Data Management for biological data The course "Research Data Management (RDM) for biological data" is designed to provide participants with foundational knowledge and practical skills in handling the extensive data generated by modern studies, with a focus on Next Generation Sequencing (NGS) data. It emphasizes the importance of Open Science and FAIR principles in managing data effectively. This course covers essential principles and best practices guidelines in data organization, metadata annotation, version control, and data preservation. These principles are explored from a computational perspective, ensuring participants gain hands-on experience in applying them to real-world scenarios in their research labs. Additionally, the course delves into FAIR principles and Open Science, promoting collaboration and reproducibility in research endeavors. By the course's conclusion, attendees will possess essential tools and techniques to address the data challenges prevalent in today's NGS research landscape, as well as in other related fields to health and bioinformatics. diff --git a/practical_workflows.qmd b/practical_workflows.qmd index 07b24eb7..0b722eb1 100644 --- a/practical_workflows.qmd +++ b/practical_workflows.qmd @@ -25,7 +25,7 @@ summary: workflow - Create reproducible analyses that can be adapted to new data with little effort ::: -# Workflows +# FAIR Workflows Data analysis typically involves the use of different tools, algorithms, and scripts. It often requires multiple steps to transform, filter, aggregate, and visualize data. The process can be time-consuming because each tool may demand specific inputs and parameter settings. As analyses become more complex, the importance of reproducible and scalable automated workflow management increases. Workflow management encompasses tasks such as parallelization, resumption, logging, and data provenance. @@ -167,10 +167,12 @@ Use git repositories to save your projects and pipelines! ## Nextflow +# FAIR environments + ## Sources - [Snakemake tutorial](https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html#tutorial) - [Snakemake turorial slides by Johannes Koster](https://slides.com/johanneskoester/snakemake-tutorial) - https://bioconda.github.io - Köster, Johannes and Rahmann, Sven. "Snakemake - A scalable bioinformatics workflow engine". Bioinformatics 2012. - Köster, Johannes. "Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis", PhD thesis, TU Dortmund 2014. - +- [faircookbook worflows](https://faircookbook.elixir-europe.org/content/recipes/applied-examples/fair-workflows.html) \ No newline at end of file