Enable multi-lane support #20

marchoeppner · 2018-04-26T10:29:35Z

Some sequencing setups will split libraries across lanes. This is currently not modeled in the pipeline.

Using a CSV to keep track of IndividualID and sampleID, we could do something along these lines:

runBWAOutput_grouped_by_sample = runBWAOutput.groupTuple(by: [0,1])

process mergeBamFiles_bySample {

        tag "${indivID}|${sampleID}"
	
	input:
        set indivID, sampleID, file(aligned_bam_list) from runBWAOutput_grouped_by_sample

	output:
	set indivID,sampleID,file(merged_bam) into mergedBamFile_by_Sample

	script:
	merged_bam = sampleID + "merged.bam"

	"""
		java -jar -Djava.io.tmpdir=tmp/ -jar ${PICARD} MergeSamFiles \
			INPUT=${aligned_bam_list.join(' INPUT=')} \
			OUTPUT=${merged_bam} \
			CREATE_INDEX=false \
			CREATE_MD5_FILE=false \
			SORT_ORDER=coordinate
	"""
}

andreas-wilm · 2018-04-26T11:53:19Z

Hi Marco,

I think this is important (more so for WGS samples). We use an approach where an input yaml file is created, that consists of a samples dictionary, with so called readunits as its members. The only mandatory readunit key is fq1 (fastq R1). The others are fq2, run_id, lane_id, library_id, flowcell_id. This allows to also construct a read group based on these values.

The input channel is then set up as follows:

def GetReadPair = { sk, rk ->
    tuple(file(params.samples[sk].readunits[rk]['fq1']),
          file(params.samples[sk].readunits[rk]['fq2']))
}

def GetReadUnitKeys = { sk ->
    params.samples[sk].readunits.keySet()
}

Channel
    .from(sample_keys)
    .map { sk -> tuple(sk, GetReadUnitKeys(sk).collect{GetReadPair(sk, it)}.flatten()) }
    .set { fastq_ch }

There might be more elegant ways. Probably makes sense to discuss this on Gitter...

maxulysse · 2018-04-26T11:59:21Z

In Sarek, we have one BWA-mem | samtools sort process
cf: https://github.com/SciLifeLab/Sarek/blob/master/main.nf#L166-L187
Then we're grouping samples by the read groups:
https://github.com/SciLifeLab/Sarek/blob/master/main.nf#L199-L205
And then merging the BAMs:
https://github.com/SciLifeLab/Sarek/blob/master/main.nf#L207-L222

vsmalladi · 2018-04-26T12:11:36Z

For read pairs we do this:

// Define channel for raw reads
if (pairedEnd) {
  rawReads = designFilePaths
    .splitCsv(sep: '\t', header: true)
    .map { row -> [ row.sample_id, [row.fastq_read1, row.fastq_read2], row.experiment_id, row.biosample, row.factor, row.treatment, row.replicate, row.control_id ] }
} else {
rawReads = designFilePaths
  .splitCsv(sep: '\t', header: true)
  .map { row -> [ row.sample_id, [row.fastq_read1], row.experiment_id, row.biosample, row.factor, row.treatment, row.replicate, row.control_id ] }
}

andreas-wilm · 2018-04-27T01:12:28Z

Thanks @vsmalladi and @maxulysse for the references!

apeltzer added this to the ExoSeq V1.0 "Black Fox" milestone Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable multi-lane support #20

Enable multi-lane support #20

marchoeppner commented Apr 26, 2018 •

edited by ewels

Loading

andreas-wilm commented Apr 26, 2018 •

edited by ewels

Loading

maxulysse commented Apr 26, 2018

vsmalladi commented Apr 26, 2018 •

edited by ewels

Loading

andreas-wilm commented Apr 27, 2018

Enable multi-lane support #20

Enable multi-lane support #20

Comments

marchoeppner commented Apr 26, 2018 • edited by ewels Loading

andreas-wilm commented Apr 26, 2018 • edited by ewels Loading

maxulysse commented Apr 26, 2018

vsmalladi commented Apr 26, 2018 • edited by ewels Loading

andreas-wilm commented Apr 27, 2018

marchoeppner commented Apr 26, 2018 •

edited by ewels

Loading

andreas-wilm commented Apr 26, 2018 •

edited by ewels

Loading

vsmalladi commented Apr 26, 2018 •

edited by ewels

Loading