Merge branch 'develop' into qc_report2

vanheeringen-lab · Sep 13, 2023 · d840a71 · d840a71
2 parents 4d899b4 + 3edc246
commit d840a71
Show file tree

Hide file tree

Showing 13 changed files with 267 additions and 44 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,10 @@ All changed fall under either one of these types: `Added`, `Changed`, `Deprecate
 
 ## [Unreleased]
 
+### Added
+
+- download samples directly from ENCODE by assay (ENCSR) and file (ENCFF) accession ids.
+
 ### Changed
 
 - the init, run, and explain commands display the supported workflows in their --help

diff --git a/docs/content/gettingstarted.md b/docs/content/gettingstarted.md
@@ -7,8 +7,8 @@
 Download and install [miniconda](https://www.anaconda.com/) if not yet installed:
 
 ```console
-user@comp:~$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
-user@comp:~$ bash miniconda.sh # (make sure to say **yes** when asked to run conda init)
+user@comp:~$ wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh -O mambaforge.sh
+user@comp:~$ bash mambaforge.sh # (make sure to say **yes** when asked to initialize conda)
 user@comp:~$ source ~/.bashrc
 ```
 
@@ -25,7 +25,7 @@ user@comp:~$ conda config --add channels conda-forge
 The most straightforward way to install seq2science is by using [conda](https://docs.continuum.io/anaconda/) using the [bioconda](https://bioconda.github.io/) channel. To install seq2science in a fresh environment using bioconda:
 
 ```console
-(base) user@comp:~$ conda create -n seq2science seq2science
+(base) user@comp:~$ mamba create -n seq2science seq2science
 ```
 
 This should install the *newest* official release of seq2science.

diff --git a/docs/content/seq2science_profiles.md b/docs/content/seq2science_profiles.md
@@ -3,6 +3,8 @@
 It is possible that you have your own custom settings you always use, e.g. when running seq2science on a cluster.
 For these cases we made generic profiles, and can be found here: https://github.com/vanheeringen-lab/seq2science-profiles
 
+Support for profiles is **experimental**, and might not always work as intended. Please let us know if you come across any issues.
+
 Running seq2science after installation of a profile is as easy as:
 
 ```

diff --git a/docs/content/workflows/download_fastq.md b/docs/content/workflows/download_fastq.md
@@ -9,9 +9,9 @@ Downloading public data in bulk from the NCBI, ENA, and DDBJ databases has never
 
 #### Download SRA file
 
-The four most popular databases that store sequencing data are National Center for Biotechnology Information (NCBI), the European Nucleotide Archive (ENA), the DNA Data Bank of Japan (DDBJ), and the Genome Sequence Archive (GSA).
-Only ENA and GSA store the actual fastq files, and DDBJ and NCBI store the raw data (as a sra file) from which a fastq can be derived.
-For this reason for each sample will first be checked if it can be downloaded from ENA.
+The five most popular databases that store sequencing data are National Center for Biotechnology Information (NCBI), the European Nucleotide Archive (ENA), the DNA Data Bank of Japan (DDBJ), the Genome Sequence Archive (GSA), and the Encode project (ENCODE).
+ENA, ENCODE, and GSA store the actual fastq files, and DDBJ and NCBI store the raw data (as a sra file) from which a fastq can be derived.
+For this reason for each sample on DDBJ and NCBI seq2science will first check if it can be downloaded from ENA as a fastq directly.
 Otherwise we will download the samples in its raw format. To convert this data to a fastq it has to be "*dumped*".
 
 ### Filling out the samples.tsv
@@ -22,22 +22,26 @@ As an example, the `samples.tsv` could look something like this:
 
 ```
 sample
-ERX123  <-- EBI ENA experiment
-ERR321  <-- EBI ENA run
-GSMabc  <-- GEO sample
-SRX456  <-- SRA experiment
-SRRxzy  <-- SRA run
-DRX890  <-- DDBJ experiment
-DRR098  <-- DDBJ run
-CRX123  <-- GSA experiment
+CRX123    <-- GSA experiment
+DRX890    <-- DDBJ experiment
+DRR098    <-- DDBJ run
+ENCSR765  <-- ENCODE assay
+ENCFF432  <-- ENCODE fastq file
+ERX123    <-- EBI ENA experiment
+ERR321    <-- EBI ENA run
+GSMabc    <-- GEO sample
+SRX456    <-- SRA experiment
+SRRxzy    <-- SRA run
 ```
 
 #### Sample column
 
 When downloading fastq files there is only one column in the samples.txt. 
 This is the sample column, where each sample is specified.
 Samples are specified with their name of the accession (e.g. GSM2837484).
-(Accepted formats start with "GSM", "SRR", "SRX", "DRR", "DRX", "ERR" "ERX", or "CRX")
+(Accepted formats start with "CRX", "DRR", "DRX", "ENCFF" "ENCSR", "ERR", "ERX", "GSM", "SRR", or "SRX")
+
+When specifying an ENCODE fastq file, and it belongs to a paired sequencing run, both fastq files will be downloaded. They will have the file name of the sample, and R1 and R2 will correspond to ENCODE.
 
 #### Final notes
 

diff --git a/seq2science/cli.py b/seq2science/cli.py
@@ -188,7 +188,8 @@ def seq2science_parser(workflows_dir="./seq2science/workflows/"):
             "-p",
             "--profile",
             metavar="PROFILE NAME",
-            help="Use a seq2science profile. Profiles can be taken from: https://github.com/s2s-profiles",
+            help="Use a seq2science profile. Profiles can be taken from: https://github.com/vanheeringen-lab/seq2science-profiles. "
+            "Profiles are an experimental feature and might not always work as intended. Please let us know if something doesn't work properly!",
         )
         subparser.add_argument(
             "-c",

diff --git a/seq2science/rules/configuration_generic.smk b/seq2science/rules/configuration_generic.smk
@@ -447,7 +447,7 @@ def parse_pysradb():
                     sampledict.update(samples2metadata(missing_samples, config, logger))
 
                 pickle.dump(
-                    {k: v for k, v in sampledict.items() if k.startswith(("ERR", "ERX", "SRR", "SRX", "GSM", "DRX", "DRR"))},
+                    {k: v for k, v in sampledict.items() if k.startswith(("ERR", "ERX", "SRR", "SRX", "GSM", "DRX", "DRR", "CRX", "ENCSR", "ENCFF"))},
                     open(pysradb_cache, "wb"),
                 )
 
@@ -476,31 +476,31 @@ def parse_pysradb():
         if (values["layout"] == "PAIRED") and values.get("ena_fastq_ftp") is not None
         for run in values["runs"]
     ]
-    gsa_single_end = [
+    gsa_or_encode_single_end = [
         run
         for values in sampledict.values()
-        if (values["layout"] == "SINGLE") and values.get("gsa_fastq_http") is not None
+        if (values["layout"] == "SINGLE") and values.get("gsa_fastq_http", values.get("encode_fastq_http")) is not None
         for run in values["runs"]
     ]
-    gsa_paired_end = [
+    gsa_or_encode_paired_end = [
         run
         for values in sampledict.values()
-        if (values["layout"] == "PAIRED") and values.get("gsa_fastq_http") is not None
+        if (values["layout"] == "PAIRED") and values.get("gsa_fastq_http", values.get("encode_fastq_http")) is not None
         for run in values["runs"]
     ]
     sra_single_end = [
         run
         for values in sampledict.values()
         if (values["layout"] == "SINGLE")
         for run in values.get("runs", [])
-        if (run not in ena_single_end and run not in gsa_paired_end)
+        if (run not in ena_single_end and run not in gsa_or_encode_single_end)
     ]
     sra_paired_end = [
         run
         for values in sampledict.values()
         if (values["layout"] == "PAIRED")
         for run in values.get("runs", [])
-        if (run not in ena_paired_end and run not in gsa_paired_end)
+        if (run not in ena_paired_end and run not in gsa_or_encode_paired_end)
     ]
 
     # get download link per run
@@ -515,6 +515,8 @@ def parse_pysradb():
                         run2download[run] = values["ena_fastq_ftp"][run]
             if "gsa_fastq_http" in values:
                 run2download[run] = values["gsa_fastq_http"][run]
+            if "encode_fastq_http" in values:
+                run2download[run] = values["encode_fastq_http"][run]
 
     # if samples are merged add the layout of the technical replicate to the config
     failed_samples = dict()
@@ -539,9 +541,9 @@ def parse_pysradb():
             logger.error("\n")
         os._exit(1)  # noqa
 
-    return sampledict, ena_single_end, ena_paired_end, gsa_single_end, gsa_paired_end, sra_single_end, sra_paired_end, run2download, pysradb_cache_lock
+    return sampledict, ena_single_end, ena_paired_end, gsa_or_encode_single_end, gsa_or_encode_paired_end, sra_single_end, sra_paired_end, run2download, pysradb_cache_lock
 
-SAMPLEDICT, ENA_SINGLE_END, ENA_PAIRED_END, GSA_SINGLE_END, GSA_PAIRED_END, SRA_SINGLE_END, SRA_PAIRED_END, RUN2DOWNLOAD, PYSRADB_CACHE_LOCK = parse_pysradb()
+SAMPLEDICT, ENA_SINGLE_END, ENA_PAIRED_END, GSA_OR_ENCODE_SINGLE_END, GSA_OR_ENCODE_PAIRED_END, SRA_SINGLE_END, SRA_PAIRED_END, RUN2DOWNLOAD, PYSRADB_CACHE_LOCK = parse_pysradb()
 
 # workflow
 

diff --git a/seq2science/rules/get_fastq.smk b/seq2science/rules/get_fastq.smk
@@ -1,7 +1,7 @@
 """
 all rules/logic related downloading public fastqs should be here.
 """
-localrules: run2sra, ena2fastq_SE, ena2fastq_PE, gsa2fastq_SE, gsa2fastq_PE, runs2sample
+localrules: run2sra, ena2fastq_SE, ena2fastq_PE, gsa_or_encode2fastq_SE, gsa_or_encode2fastq_PE, runs2sample
 
 
 import os
@@ -71,12 +71,12 @@ rule run2sra:
 
 
 # ENA > SRA
-ruleorder: ena2fastq_SE > gsa2fastq_SE > sra2fastq_SE
-ruleorder: ena2fastq_PE > gsa2fastq_PE > sra2fastq_PE
+ruleorder: ena2fastq_SE > gsa_or_encode2fastq_SE > sra2fastq_SE
+ruleorder: ena2fastq_PE > gsa_or_encode2fastq_PE > sra2fastq_PE
 # PE > SE
 ruleorder: ena2fastq_PE > ena2fastq_SE
 ruleorder: sra2fastq_PE > sra2fastq_SE
-ruleorder: gsa2fastq_PE > gsa2fastq_SE
+ruleorder:gsa_or_encode2fastq_PE > gsa_or_encode2fastq_SE
 
 
 rule sra2fastq_SE:
@@ -234,9 +234,9 @@ rule ena2fastq_PE:
             shell("exit 1 >> {log} 2>&1")
 
 
-rule gsa2fastq_SE:
+rule gsa_or_encode2fastq_SE:
     """
-    Download single-end fastq files from the GSA.
+    Download single-end fastq files from the GSA or ENCODE DCC.
     """
     output:
         temp(expand("{fastq_dir}/runs/{{run}}.{fqsuffix}.gz", **config)),
@@ -247,7 +247,7 @@ rule gsa2fastq_SE:
     benchmark:
         expand("{benchmark_dir}/gsa2fastq_SE/{{run}}.benchmark.txt", **config)[0]
     wildcard_constraints:
-        run="|".join(GSA_SINGLE_END) if len(GSA_SINGLE_END) else "$a",
+        run="|".join(GSA_OR_ENCODE_SINGLE_END) if len(GSA_OR_ENCODE_SINGLE_END) else "$a",
     priority: 1
     retries: 2
     run:
@@ -261,9 +261,9 @@ rule gsa2fastq_SE:
             shell("exit 1 >> {log} 2>&1")
 
 
-rule gsa2fastq_PE:
+rule gsa_or_encode2fastq_PE:
     """
-    Download paired-end fastq files from the GSA.
+    Download paired-end fastq files from the GSA or ENCODE DCC.
     """
     output:
         temp(expand("{fastq_dir}/runs/{{run}}_{fqext}.{fqsuffix}.gz", **config)),
@@ -274,7 +274,7 @@ rule gsa2fastq_PE:
     benchmark:
         expand("{benchmark_dir}/gsa2fastq_PE/{{run}}.benchmark.txt", **config)[0]
     wildcard_constraints:
-        run="|".join(GSA_PAIRED_END) if len(GSA_PAIRED_END) else "$a",
+        run="|".join(GSA_OR_ENCODE_PAIRED_END) if len(GSA_OR_ENCODE_PAIRED_END) else "$a",
     priority: 1
     retries: 2
     run:

diff --git a/seq2science/util.py b/seq2science/util.py
@@ -6,6 +6,7 @@
 import os
 import sys
 import glob
+import json
 import time
 import colorsys
 import tempfile
@@ -128,7 +129,7 @@ def samples2metadata_local(samples: List[str], config: dict, logger) -> dict:
         ):
             SAMPLEDICT[sample] = dict()
             SAMPLEDICT[sample]["layout"] = "PAIRED"
-        elif sample.startswith(("GSM", "DRX", "ERX", "SRX", "DRR", "ERR", "SRR", "CRX")):
+        elif sample.startswith(("GSM", "DRX", "ERX", "SRX", "DRR", "ERR", "SRR", "CRX", "ENCSR", "ENCFF")):
             continue
         else:
             extend_msg = ""
@@ -143,7 +144,7 @@ def samples2metadata_local(samples: List[str], config: dict, logger) -> dict:
                 f"We checked directory '{config['fastq_dir']}' "
                 f"for gzipped files starting with '{sample}' and containing '{config['fqsuffix']}'.\n"
                 + extend_msg
-                + f"Since the sample did not start with either GSM, SRX, SRR, ERR, and DRR we "
+                + f"Since the sample did not start with either GSM, SRX, SRR, ERX, ERR, DRX, DRR, CRX, ENCSR, or ENCFF we "
                 f"couldn't find it online..\n"
             )
             os._exit(1)  # noqa
@@ -207,7 +208,7 @@ def samples2metadata_gsa(samples: List[str], logger) -> dict:
                         "runs": ["CRR1234", "CRR4321"],
                         "gsa_fastq_http": {CRR1234: [link_1, link_2], ...,
 
-            "SRR5678": {"layout": "SINGLE",
+            "CRR5678": {"layout": "SINGLE",
                         "runs": ["CRR5678"],
                         gsa_fastq_http: {CRR5678: [link_1]},
             ...
@@ -362,6 +363,110 @@ def samples2metadata_sra(samples: List[str], logger) -> dict:
     return sampledict
 
 
+def samples2metadata_encode(samples: List[str], logger) -> dict:
+    """
+
+
+    """
+    metadata = {}
+    for sample in samples:
+        if sample.startswith("ENCSR"):
+            metadata.update(ENCSR2metadata(sample))
+        elif sample.startswith("ENCFF"):
+            metadata.update(ENCFF2metadata(sample))
+        else:
+            assert False
+    return metadata
+
+
+def ENCFF2metadata(sample):
+    """
+    encode file (fastq) to metadata
+    """
+    url = f"https://www.encodeproject.org/files/{sample}/?format=json"
+    response = urllib.request.urlopen(urllib.request.Request(url)).read()
+    response = json.loads(response.decode('utf-8'))
+    response
+
+    assert response["file_format"] == "fastq"
+
+    metadata = {sample: {}}
+
+    if response["run_type"] == "single-ended":
+        metadata[sample]["layout"] = "SINGLE"
+        metadata[sample]["encode_fastq_http"] = "https://www.encodeproject.org" + response["href"]
+        metadata[sample]["runs"] = [sample]
+    elif response["run_type"] == "paired-ended":
+        metadata[sample]["layout"] = "PAIRED"
+        mate = response["paired_with"].split("/")[2]
+        mate_url = f"https://www.encodeproject.org/files/{mate}/?format=json"
+        mate_response = urllib.request.urlopen(urllib.request.Request(mate_url)).read()
+        mate_response = json.loads(mate_response.decode('utf-8'))
+
+        if response["paired_end"] == "1":
+            metadata[sample]["encode_fastq_http"] = {
+                sample: [
+                    "https://www.encodeproject.org" + response["href"],
+                    "https://www.encodeproject.org" + mate_response["href"],
+                ]
+            }
+        else:
+            metadata[sample]["encode_fastq_http"] = {
+                sample: [
+                    "https://www.encodeproject.org" + mate_response["href"],
+                    "https://www.encodeproject.org" + response["href"],
+                ]
+            }
+
+        metadata[sample]["runs"] = [sample]
+    else:
+        assert False
+    return metadata
+
+
+def ENCSR2metadata(sample):
+    """
+    encode assay to metadata
+
+    an assay can exist out of multiple isogenic and technical replicates
+    """
+    url = f"https://www.encodeproject.org/search/?type=File&dataset=/experiments/{sample}/&file_format=fastq&format=json&frame=object"
+    response = urllib.request.urlopen(urllib.request.Request(url)).read()
+    response = json.loads(response.decode('utf-8'))
+    response
+
+    metadata = {sample: {}}
+
+    # check if all run types are the same
+    assert len({file["run_type"] for file in response["@graph"]}) == 1
+    layout = response["@graph"][0]["run_type"]
+
+    if layout == "single-ended":
+        metadata[sample]["layout"] = "SINGLE"
+        metadata[sample]["encode_fastq_http"] = {
+            x["accession"]: "https://www.encodeproject.org" + x["href"] for x in response["@graph"]
+        }
+        metadata[sample]["runs"] = list(metadata[sample]["encode_fastq_http"].keys())
+
+    elif layout == "paired-ended":
+        graph = {x["accession"]: x for x in response["@graph"]}
+
+        metadata[sample]["layout"] = "PAIRED"
+        metadata[sample]["runs"] = [x["accession"] for x in response["@graph"] if x["paired_end"] == "1"]
+
+        metadata[sample]["encode_fastq_http"] = {
+            run: [
+                "https://www.encodeproject.org" + graph[run]["href"],
+                "https://www.encodeproject.org" + graph[graph[run]["paired_with"].split("/")[2]]["href"],
+            ]
+            for run in metadata[sample]["runs"]
+        }
+    else:
+        assert False
+
+    return metadata
+
+
 def samples2metadata(samples: List[str], config: dict, logger) -> dict:
     local_samples = samples2metadata_local(samples, config, logger)
     public_samples = [sample for sample in samples if sample not in local_samples.keys()]
@@ -371,7 +476,9 @@ def samples2metadata(samples: List[str], config: dict, logger) -> dict:
 
     gsa_samples = [sample for sample in public_samples if sample.startswith("CRX")]
     gsa_samples = samples2metadata_gsa(gsa_samples, logger)
-    rest_public_samples = [sample for sample in public_samples if sample not in gsa_samples]
+    encode_samples = [sample for sample in public_samples if sample.startswith(("ENCSR", "ENCFF"))]
+    encode_samples = samples2metadata_encode(encode_samples, logger)
+    rest_public_samples = [sample for sample in public_samples if sample not in (list(gsa_samples.keys()) + list(encode_samples.keys()))]
 
     # chop public samples into smaller chunks, doing large queries results into
     # pysradb decode errors..
@@ -383,7 +490,7 @@ def samples2metadata(samples: List[str], config: dict, logger) -> dict:
         # just to be sure sleep in between to not go over our API limit
         time.sleep(1)
 
-    return {**local_samples, **sra_samples, **gsa_samples}
+    return {**local_samples, **sra_samples, **gsa_samples, **encode_samples}
 
 
 def sieve_bam(configdict):