-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prefer sra normalized format over sra lite #23
Comments
So good news and bad news... Good news first:
Q -scores range from 2-38, so this should get you want you need. However I can add a way to allow the user to switch between SRA Normalized and SRA Lite, with Normalized being the default. Now the bad news:
It looks like ENA synced the SRA Lite version of the reads, and not the Normalized. This was also the case for Hmmm, this bugs me because I usually use ENA as the default provider because they provide FASTQs directly. But I also want the original quality scores which SRA sync'd reads may or may not provide. The blog post above has a October 2021 date, so I'm unsure if after this date the reads synced from SRA to ENA have the SRA Lite Q scores. I'm wondering if a solution might be to add a third provider: |
This sounds like the best first pass solution to me |
Thanks for the quick reply & brainstorming on solutions. Just wanted to share this example where despite using the options Robert suggested, it still seemed to download SRA Lite formatted FASTQs. Even though the output explicitly states # fastq-dl v2.0.2 installed via mamba
$ fastq-dl -a SRR13086318 --verbose --provider sra --only-provider
2023-08-03 10:09:37 DEBUG 2023-08-03 10:09:37:root:DEBUG - Querying ENA for metadata... fastq_dl.py:428
DEBUG 2023-08-03 10:09:37:root:DEBUG - --only-provider supplied, limiting queries to sra fastq_dl.py:431
DEBUG 2023-08-03 10:09:37:urllib3.connectionpool:DEBUG - Starting new HTTPS connection (1): eutils.ncbi.nlm.nih.gov:443 connectionpool.py:1003
2023-08-03 10:09:39 DEBUG 2023-08-03 10:09:39:urllib3.connectionpool:DEBUG - [https://eutils.ncbi.nlm.nih.gov:443](https://eutils.ncbi.nlm.nih.gov/) "POST /entrez/eutils/esearch.fcgi HTTP/1.1" 200 None connectionpool.py:456
DEBUG 2023-08-03 10:09:39:urllib3.connectionpool:DEBUG - Starting new HTTPS connection (1): eutils.ncbi.nlm.nih.gov:443 connectionpool.py:1003
2023-08-03 10:09:40 DEBUG 2023-08-03 10:09:40:urllib3.connectionpool:DEBUG - [https://eutils.ncbi.nlm.nih.gov:443](https://eutils.ncbi.nlm.nih.gov/) "GET connectionpool.py:456
/entrez/eutils/esummary.fcgi?db=sra&usehistory=n&retmode=json&query_key=1&WebEnv=MCID_64cbb522a06d0e3d496a66e5&retstart=0&retmax=500
HTTP/1.1" 200 None
DEBUG 2023-08-03 10:09:40:urllib3.connectionpool:DEBUG - Starting new HTTPS connection (1): eutils.ncbi.nlm.nih.gov:443 connectionpool.py:1003
2023-08-03 10:09:41 DEBUG 2023-08-03 10:09:41:urllib3.connectionpool:DEBUG - [https://eutils.ncbi.nlm.nih.gov:443](https://eutils.ncbi.nlm.nih.gov/) "GET connectionpool.py:456
/entrez/eutils/esearch.fcgi?db=sra&usehistory=n&retmode=json&term=SRR13086318 HTTP/1.1" 200 None
DEBUG 2023-08-03 10:09:41:urllib3.connectionpool:DEBUG - Starting new HTTPS connection (1): eutils.ncbi.nlm.nih.gov:443 connectionpool.py:1003
2023-08-03 10:09:42 DEBUG 2023-08-03 10:09:42:urllib3.connectionpool:DEBUG - [https://eutils.ncbi.nlm.nih.gov:443](https://eutils.ncbi.nlm.nih.gov/) "GET connectionpool.py:456
/entrez/eutils/efetch.fcgi?db=sra&usehistory=n&retmode=runinfo&query_key=1&WebEnv=MCID_64cbb5240bbbf858ca74f635&retstart=0&retmax=500
HTTP/1.1" 200 None
DEBUG 2023-08-03 10:09:42:urllib3.connectionpool:DEBUG - Starting new HTTPS connection (1): www.ebi.ac.uk:443 connectionpool.py:1003
2023-08-03 10:10:00 DEBUG 2023-08-03 10:10:00:urllib3.connectionpool:DEBUG - [https://www.ebi.ac.uk:443](https://www.ebi.ac.uk/) "GET connectionpool.py:456
/ena/portal/api/filereport?result=read_run&fields=fastq_ftp&accession=SRP074197 HTTP/1.1" 200 None
2023-08-03 10:10:10 INFO 2023-08-03 10:10:10:root:INFO - Query: SRR13086318 fastq_dl.py:629
INFO 2023-08-03 10:10:10:root:INFO - Archive: sra fastq_dl.py:630
INFO 2023-08-03 10:10:10:root:INFO - Total Runs To Download: 1 fastq_dl.py:635
INFO 2023-08-03 10:10:10:root:INFO - Working on run SRR13086318... fastq_dl.py:654
DEBUG 2023-08-03 10:10:10:executor.process:DEBUG - Executing external command: bash -c 'prefetch SRR13086318 --max-size 10T -o SRR13086318.sra' __init__.py:1475
DEBUG 2023-08-03 10:10:10:executor.process:DEBUG - Constructing subprocess.Popen object .. __init__.py:1483
DEBUG 2023-08-03 10:10:10:executor.process:DEBUG - Joining synchronous process using subprocess.Popen.communicate() .. __init__.py:1504
2023-08-03 10:10:14 DEBUG 2023-08-03 10:10:14:executor.process:DEBUG - Got return code 0 from synchronous process (bash -c 'prefetch SRR13086318 --max-size 10T -o __init__.py:1638
SRR13086318.sra').
DEBUG 2023-08-03 10:10:14:root:DEBUG - fastq_dl.py:92
DEBUG 2023-08-03 10:10:14:root:DEBUG - 2023-08-03T14:10:10 prefetch.3.0.3: Current preference is set to retrieve SRA Normalized Format files with full fastq_dl.py:93
base quality scores.
2023-08-03T14:10:11 prefetch.3.0.3: 1) Downloading 'SRR13086318'...
2023-08-03T14:10:11 prefetch.3.0.3: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to
current file availability.
2023-08-03T14:10:11 prefetch.3.0.3: Downloading via HTTPS...
2023-08-03T14:10:14 prefetch.3.0.3: HTTPS download succeed
2023-08-03T14:10:14 prefetch.3.0.3: 'SRR13086318' is valid
2023-08-03T14:10:14 prefetch.3.0.3: 1) 'SRR13086318' was downloaded successfully
2023-08-03T14:10:14 prefetch.3.0.3: 'SRR13086318' has 0 unresolved dependencies
DEBUG 2023-08-03 10:10:14:executor.process:DEBUG - Executing external command: bash -c 'fasterq-dump SRR13086318 --split-3 --mem 1G --threads 1' __init__.py:1475
DEBUG 2023-08-03 10:10:14:executor.process:DEBUG - Constructing subprocess.Popen object .. __init__.py:1483
DEBUG 2023-08-03 10:10:14:executor.process:DEBUG - Joining synchronous process using subprocess.Popen.communicate() .. __init__.py:1504
2023-08-03 10:10:35 DEBUG 2023-08-03 10:10:35:executor.process:DEBUG - Got return code 0 from synchronous process (bash -c 'fasterq-dump SRR13086318 --split-3 --mem 1G __init__.py:1638
--threads 1').
DEBUG 2023-08-03 10:10:35:root:DEBUG - fastq_dl.py:92
DEBUG 2023-08-03 10:10:35:root:DEBUG - spots read : 841,910 fastq_dl.py:93
reads read : 1,683,820
reads written : 1,683,820
DEBUG 2023-08-03 10:10:35:executor.process:DEBUG - Executing external command: bash -c 'pigz --force -p 1 -n SRR13086318*.fastq' __init__.py:1475
DEBUG 2023-08-03 10:10:35:executor.process:DEBUG - Constructing subprocess.Popen object .. __init__.py:1483
DEBUG 2023-08-03 10:10:35:executor.process:DEBUG - Joining synchronous process using subprocess.Popen.communicate() .. __init__.py:1504
2023-08-03 10:12:39 DEBUG 2023-08-03 10:12:39:executor.process:DEBUG - Got return code 0 from synchronous process (bash -c 'pigz --force -p 1 -n SRR13086318*.fastq'). __init__.py:1638
DEBUG 2023-08-03 10:12:39:root:DEBUG - fastq_dl.py:92
DEBUG 2023-08-03 10:12:39:root:DEBUG - fastq_dl.py:93
INFO 2023-08-03 10:12:39:root:INFO - Writing metadata to /home/curtis_kapsak/fastq-run-info.tsv
|
OK, yup I think I just got unlucky with this particular accession: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR13086318&display=metadata It seems to me that even the original FASTQs hosted on SRA are SRA Lite format. I tried using |
It's looking like, given Unfortunately, after that, not much can be done about what SRA is serving up. For Haha quite the can of worms that SRA Lite has opened! |
agreed! Thank you for digging into this one. I will submit a ticket to the SRA helpdesk and see what they can tell me. |
@kapsakcj as a band-aid, I released v2.0.3 which explicitly sets the preference to SRA Normalized by executing I have to restructure things, when I do that I'll add the This should at least allow you to move forward and know that we've provided SRA everything expected to get SRA Normalized format. |
I've hit an odd issue where
fastq-dl
pulls FASTQs without issue, but they are in SRA Lite format instead of the typical SRA Normalized format.FASTQs in SRA Lite format have
?
for all Qscores for all bases, which equates to Q30. This leads to issues wheretrimmomatic
or other typical downstream softwares are unable to detect the Phred quality encoding and the Qscore are not useful during assembly (and probably other applications that utilize the Qscores)FASTQs in SRA Normalized are the original format that contains the full base quality scores
Some examples where I encountered this issue
I'm guessing it will be a big effort, but would it be possible for
fastq-dl
to download the SRA-normalized format of FASTQs?Not sure how ENA deals with this issue, but sra-toolkit has an option for using this format
More info:
The text was updated successfully, but these errors were encountered: