Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read duplication #217

Open
necrolyte2 opened this issue Mar 24, 2016 · 4 comments
Open

read duplication #217

necrolyte2 opened this issue Mar 24, 2016 · 4 comments

Comments

@necrolyte2
Copy link
Member

Related #204

So today they filled their hard drive while running the pipeline.
The fastq files they are running are very large since they only ran 12 or 24 samples in the run.

An example project they have
RawFastq(1G x 2) + Filtered Fastq(1G x 2) + trimmed fastq(1G x 2) + bam(790M) = 6.7G

Then they are running a few of these samples(you can see how this is adding up)

At WRAIR this is less of an issue because we have de-duplication on the storage server

Just as a test I tried gzipping one of the fastq files that was originally 1.2G and it came out 330M, which is a pretty great storage savings.

Maybe we should force gzip output from all stages?

@averagehat
Copy link
Contributor

I think we could use the unzipped data in the next step, then gzip it after, if that makes sense. like:

convert_format files 
ngs_filter files > filtered
gnuzip files
trim_reads filtered
gnuzip filtered

etc.

@necrolyte2
Copy link
Member Author

This will work fine since we can't just do gzip read/write due to Biopython incompatibility

@averagehat
Copy link
Contributor

Another thing is that right now ngs_filter symlinks data from convert-formats if no filtering is done
could maybe fix this by skipping calling ngs_filter altogether within runsample. Are there any other symbolic link being used in the pipeline?

@necrolyte2
Copy link
Member Author

I don't remember any other symlinks besides runsamplesheet.sh symlinking consensus sequence files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants