Skip to content
This repository has been archived by the owner on May 7, 2019. It is now read-only.

Enable multi-lane support #20

Open
marchoeppner opened this issue Apr 26, 2018 · 4 comments
Open

Enable multi-lane support #20

marchoeppner opened this issue Apr 26, 2018 · 4 comments

Comments

@marchoeppner
Copy link

marchoeppner commented Apr 26, 2018

Some sequencing setups will split libraries across lanes. This is currently not modeled in the pipeline.

Using a CSV to keep track of IndividualID and sampleID, we could do something along these lines:

runBWAOutput_grouped_by_sample = runBWAOutput.groupTuple(by: [0,1])

process mergeBamFiles_bySample {

        tag "${indivID}|${sampleID}"
	
	input:
        set indivID, sampleID, file(aligned_bam_list) from runBWAOutput_grouped_by_sample

	output:
	set indivID,sampleID,file(merged_bam) into mergedBamFile_by_Sample

	script:
	merged_bam = sampleID + "merged.bam"

	"""
		java -jar -Djava.io.tmpdir=tmp/ -jar ${PICARD} MergeSamFiles \
			INPUT=${aligned_bam_list.join(' INPUT=')} \
			OUTPUT=${merged_bam} \
			CREATE_INDEX=false \
			CREATE_MD5_FILE=false \
			SORT_ORDER=coordinate
	"""
}
@andreas-wilm
Copy link

andreas-wilm commented Apr 26, 2018

Hi Marco,

I think this is important (more so for WGS samples). We use an approach where an input yaml file is created, that consists of a samples dictionary, with so called readunits as its members. The only mandatory readunit key is fq1 (fastq R1). The others are fq2, run_id, lane_id, library_id, flowcell_id. This allows to also construct a read group based on these values.

The input channel is then set up as follows:

def GetReadPair = { sk, rk ->
    tuple(file(params.samples[sk].readunits[rk]['fq1']),
          file(params.samples[sk].readunits[rk]['fq2']))
}

def GetReadUnitKeys = { sk ->
    params.samples[sk].readunits.keySet()
}

Channel
    .from(sample_keys)
    .map { sk -> tuple(sk, GetReadUnitKeys(sk).collect{GetReadPair(sk, it)}.flatten()) }
    .set { fastq_ch }

There might be more elegant ways. Probably makes sense to discuss this on Gitter...

@maxulysse
Copy link
Member

In Sarek, we have one BWA-mem | samtools sort process
cf: https://github.com/SciLifeLab/Sarek/blob/master/main.nf#L166-L187
Then we're grouping samples by the read groups:
https://github.com/SciLifeLab/Sarek/blob/master/main.nf#L199-L205
And then merging the BAMs:
https://github.com/SciLifeLab/Sarek/blob/master/main.nf#L207-L222

@vsmalladi
Copy link

vsmalladi commented Apr 26, 2018

For read pairs we do this:

// Define channel for raw reads
if (pairedEnd) {
  rawReads = designFilePaths
    .splitCsv(sep: '\t', header: true)
    .map { row -> [ row.sample_id, [row.fastq_read1, row.fastq_read2], row.experiment_id, row.biosample, row.factor, row.treatment, row.replicate, row.control_id ] }
} else {
rawReads = designFilePaths
  .splitCsv(sep: '\t', header: true)
  .map { row -> [ row.sample_id, [row.fastq_read1], row.experiment_id, row.biosample, row.factor, row.treatment, row.replicate, row.control_id ] }
}

@andreas-wilm
Copy link

Thanks @vsmalladi and @maxulysse for the references!

@apeltzer apeltzer added this to the ExoSeq V1.0 "Black Fox" milestone Aug 15, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants