-
Notifications
You must be signed in to change notification settings - Fork 10
Molecule iteration
The SingleCellMultiOmics API heavily relies on the Molecule object.
A Molecule is a entity which usually represents a single molecule in an experiment. The Molecule is a class which refers to the original molecular template (DNA/RNA). Usually the original template molecule is amplified and sequenced, resulting in one or more sequenced fragments which are evidence for the existence and properties of the Molecule.
A Fragment holds one or more reads which where sequenced as part of a single fragment. This is usually read 1 and read 2 of a Illumina read. The reads are stored as PySAM objects.
How fragments are associated to molecules is protocol specific. A variety of classes encode ways of associating fragments to molecules. Usually a combination of the unique molecular identifier (UMI) and mapping coordinates of the fragments are used to associate fragments to a molecule.
The example below iterates over all NLAIII molecules in my_bam.bam
and writes valid molecules to output.bam
import pysam
import singlecellmultiomics.molecule
import singlecellmultiomics.fragment
import pysamiterators
alignments = pysam.AlignmentFile('my_bam.bam')
reference = pysamiterators.iterators.CachedFasta( pysam.FastaFile('my_reference.fasta' ) )
with pysam.AlignmentFile('output.bam' , "wb",header=alignments.header) as output:
for i,molecule in enumerate(
singlecellmultiomics.molecule.MoleculeIterator(
alignments=alignments,
moleculeClass=singlecellmultiomics.molecule.NlaIIIMolecule,
fragmentClass=singlecellmultiomics.fragment.NLAIIIFragment,
molecule_class_args={
'reference':reference,
'site_has_to_be_mapped':True
}
)):
molecule.write_pysam(output)
To obtain the allele for your molecules you will need a VCF file which contains variants which can be used to distinguish the alleles. Make sure to use a BAM file which is generated by mapping to a variant masked reference, which you can generate using fastaMaskVariants.py
.
Then we can prepare the AlleleResolver which will try to assign an allele to a molecule.
If the VCF file contains more than two samples use the select_samples
argument to select to which alleles the molecule should be assigned.
allele_resolver = singlecellmultiomics.alleleTools.AlleleResolver(
'my_variants.vcf',
select_samples=['AlleleA','AlleleB'],
lazyLoad=True)
Then we can set up the molecule iterator as usual, and use the molecule.get_allele(allele_resolver)
to obtain the allele.