`Sequence.gc` methods that consider IUPAC nucleotide ambiguity #128

mdshw5 · 2017-11-13T19:34:18Z

The existing Sequence.gc method purposefully ignores characters other than G/C and uses the sequence length as a denominator to produce "fraction g/c". This has a few benefits:

we can ignore IUPAC ambiguous DNA codes
len(sequence) is fast to compute vs. counting more occurrences of characters

The downside is that any non-GCAT characters may be included in the denominator:

pyfaidx/pyfaidx/__init__.py

Lines 254 to 266 in 7b4d8d7

    
               @property 
        
               def gc(self): 
        
                   """ Return the GC content of seq as a float 
        
                   >>> x = Sequence(name='chr1', seq='ATCGTA') 
        
                   >>> y = round(x.gc, 2) 
        
                   >>> y == 0.33 
        
                   True 
        
                   """ 
        
                   g = self.seq.count('G') 
        
                   g += self.seq.count('g') 
        
                   c = self.seq.count('C') 
        
                   c += self.seq.count('c') 
        
                   return (g + c) / len(self.seq)

I'd welcome any pull request to implement something like:

Sequence.gc_iupac method that counts e.g. S=GC and W=AT, and also considers K=GT. This is considerably more difficult than the current method and requires some validation of the sequence to confirm that it only contains valid IUPAC letters
Sequence.gc_strict method that counts G/C and A/T, implicitly ignoring all other characters. This is probably closest to what people expect as GC content

The text was updated successfully, but these errors were encountered:

mdshw5 self-assigned this Nov 13, 2017

Bardia-Masudy mentioned this issue Jan 23, 2023

'Sequence.gc' Method for IUPAC bases #205

Merged

Bardia-Masudy referenced this issue Jan 23, 2023

implemented change into __init__ Sequence class

f773018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Sequence.gc` methods that consider IUPAC nucleotide ambiguity #128

`Sequence.gc` methods that consider IUPAC nucleotide ambiguity #128

mdshw5 commented Nov 13, 2017

Sequence.gc methods that consider IUPAC nucleotide ambiguity #128

Sequence.gc methods that consider IUPAC nucleotide ambiguity #128

Comments

mdshw5 commented Nov 13, 2017

`Sequence.gc` methods that consider IUPAC nucleotide ambiguity #128

`Sequence.gc` methods that consider IUPAC nucleotide ambiguity #128