Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Meryl on multiple fastq files #49

Open
NicMAlexandre opened this issue Jul 12, 2024 · 6 comments
Open

Run Meryl on multiple fastq files #49

NicMAlexandre opened this issue Jul 12, 2024 · 6 comments

Comments

@NicMAlexandre
Copy link

I have a large list of fastq files that will be used for the same database.

Is there a simple way to run the command on a list of files for the same Meryl database or do I need to concatenate all of them and make a large file.

I'm currently trying this:

while read i; do echo "meryl k=21 count $i output Species.meryl"; done >> job.list < Species_fastq.list

@brianwalenz
Copy link
Member

If your large list isn't too large, you can do:

meryl k=21 count *fastq.gz output Species.meryl

Where 'too large' would generate a complaint about 'line too long' from bash not meryl.

What you're doing in the while loop above would count k-mers in each file individually, but overwrite the results each time. To do this method properly, you'd want something like count $i output tmp$i.meryl then follow it up with meryl union tmp*.meryl output all.meryl.

@NicMAlexandre
Copy link
Author

NicMAlexandre commented Jul 13, 2024 via email

@damizuka
Copy link

damizuka commented Oct 9, 2024

Hi, great program
Im triying the exact way above of counting kmers in multiple fastq files it but when a do meryl union I have the following output
PROCESSING TREE #1 using 88 threads. opUnion D2_S1_L001_R1_001.fastq.gz.meryl D2_S1_L001_R1_002.fastq.gz.meryl D2_S1_L001_R1_003.fastq.gz.meryl D2_S1_L001_R1_004.fastq.gz.meryl D2_S1_L001_R1_005.fastq.gz.meryl D2_S1_L001_R2_001.fastq.gz.meryl D2_S1_L001_R2_002.fastq.gz.meryl D2_S1_L001_R2_003.fastq.gz.meryl D2_S1_L001_R2_004.fastq.gz.meryl D2_S1_L001_R2_005.fastq.gz.meryl D2_S1_L002_R1_001.fastq.gz.meryl D2_S1_L002_R1_002.fastq.gz.meryl D2_S1_L002_R1_003.fastq.gz.meryl D2_S1_L002_R1_004.fastq.gz.meryl D2_S1_L002_R1_005.fastq.gz.meryl D2_S1_L002_R2_001.fastq.gz.meryl D2_S1_L002_R2_002.fastq.gz.meryl D2_S1_L002_R2_003.fastq.gz.meryl D2_S1_L002_R2_004.fastq.gz.meryl D2_S1_L002_R2_005.fastq.gz.meryl D2_S2_L001_R1_001.fastq.gz.meryl D2_S2_L001_R1_002.fastq.gz.meryl D2_S2_L001_R1_003.fastq.gz.meryl D2_S2_L001_R1_004.fastq.gz.meryl D2_S2_L001_R2_001.fastq.gz.meryl D2_S2_L001_R2_002.fastq.gz.meryl D2_S2_L001_R2_003.fastq.gz.meryl D2_S2_L001_R2_004.fastq.gz.meryl D2_S2_L002_R1_001.fastq.gz.meryl D2_S2_L002_R1_002.fastq.gz.meryl D2_S2_L002_R1_003.fastq.gz.meryl D2_S2_L002_R1_004.fastq.gz.meryl D2_S2_L002_R2_001.fastq.gz.meryl D2_S2_L002_R2_002.fastq.gz.meryl D2_S2_L002_R2_003.fastq.gz.meryl D2_S2_L002_R2_004.fastq.gz.meryl output to all.meryl Failed to open 'D2_S1_L001_R1_005.fastq.gz.meryl/0x100000.merylData' for reading: Too many open files Failed to open 'D2_S1_L002_R2_005.fastq.gz.meryl/0x111101.merylData' for reading: Too many open files how can I solve it?
Best

Dami Gonzalez, Bioinformatic
PhD candidate

@brianwalenz
Copy link
Member

looks like it is defaulting to using all CPUs on the machine. each CPU will open one set of input/output files. simple solution is to decrease (or explicitly set) the number of CPUs in use with threads=8. (or 4, or 16, or 32, etc).

@damizuka
Copy link

damizuka commented Oct 9, 2024

thanks for the quick response! im running again meryl per every fastq, will try the new solution and let you know
best regards

@damizuka
Copy link

Hi, the code works fine! I was wondering if it is normal that the resulting output folder from meryl union-sum weight the same as that every individual database
best regards! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants