Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What order should results be in? #62

Open
ivirshup opened this issue Apr 3, 2024 · 7 comments
Open

What order should results be in? #62

ivirshup opened this issue Apr 3, 2024 · 7 comments
Labels
API enhancement New feature or request P0‼️ High priority

Comments

@ivirshup
Copy link
Member

ivirshup commented Apr 3, 2024

Description of feature

Continuing from #59 (comment)

In the mentioned PR, we found out that duckdb is inconsistent with what order it returns results in. This by and large seems fine from a duckdb point of view, but would be frustrating for users of this package.

To solve this @thomas-reimonn added an order_by statement here

Right now the order returned is based on the order of columns in the user input (I believe). Is there a better/ more canonical way to do this? Off the top of my head I would assume we'd generally want to sort by chrom and start.

We should also check what the bioc packages do here.

@nvictus, @emdann, @lauradmartens any thoughts?

@ivirshup ivirshup added enhancement New feature or request P0‼️ High priority API labels Apr 3, 2024
@ivirshup
Copy link
Member Author

ivirshup commented Apr 3, 2024

  • bedtools sort does chromosome then start
  • ensembldb
  • GenomicFeatures
    • Mentions order a bunch of different places in API docs, haven't full explored cases
    • Has parameters which control whether strand is considered when ordering exons

@emdann
Copy link
Member

emdann commented Apr 3, 2024

Definitely default to genomic location (chr+start), option to use another column?

@thomas-reimonn
Copy link
Contributor

Some of these tables don't have loci information. E.g. ibis.common.exceptions.IbisTypeError: Column 'chrom' is not found in table. Existing columns: 'gene_id', 'gene_name', 'gene_biotype', 'gene_seq_start', 'gene_seq_end', 'seq_name', 'seq_strand', 'seq_coord_system', 'description', 'gene_id_version', 'canonical_transcript'.

@ivirshup
Copy link
Member Author

ivirshup commented Apr 3, 2024

@thomas-reimonn Ah, I think chrom should be seq_name

@emdann, for something like exons, where "gene_id" or "transcript_id" has been selected, so we still sort by chromosome and start? Or should we sort by the gene/ transcript start?

And then, do we consider strand for sorting exons within a transcript?

@ivirshup
Copy link
Member Author

ivirshup commented Apr 3, 2024

It looks like tx2exon has exon_idx which I believe is the order of the exon inside the transcript

@lauradmartens
Copy link
Collaborator

I also agree by default it should be chrom + start. It seems like this is also what GenomicFeatures does and they have a separate function that returns ranges sorted by another value I think: e.g. transcriptsBy(txdb, "gene") and then the exons are sorted how they would appear in the transcript:
In the manual on transcriptsBy/exonsBy:

These functions return a GRangesList object where the ranges within each of the elements are
ordered according to the following rule:
When using exonsBy or cdsBy with by="tx", the returned exons or CDS parts are ordered by
ascending rank for each transcript, that is, by their position in the transcript. In all other cases, the
transcriptsBy
ranges will be ordered by chromosome, strand, start, and end values.

@ivirshup
Copy link
Member Author

ivirshup commented Apr 3, 2024

I think the bioc packages go by chrom + start for the "primary" thing being queried (e.g. genes for .genes). But it looks like there is some strand specific behaviour in both if you grab exons within a gene query. Here's where I got to:

Demo

Setup

library(ensembldb)
library(EnsDb.Hsapiens.v86)
edb = EnsDb.Hsapiens.v86

library(GenomicFeatures)

samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite",
                        package="GenomicFeatures")
txdb <- loadDb(samplefile)

ensembldb getting genes but also returning exons.

Only - strand:

> head(genes(
    edb,
    columns=c("exon_seq_start", "exon_seq_end", "exon_id"),
    filter=SeqStrandFilter("-")
))

GRanges object with 6 ranges and 4 metadata columns:
                  seqnames      ranges strand | exon_seq_start exon_seq_end         exon_id         gene_id
                     <Rle>   <IRanges>  <Rle> |      <integer>    <integer>     <character>     <character>
  ENSG00000227232        1 14404-29570      - |          29534        29570 ENSE00001890219 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          24738        24891 ENSE00003507205 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          18268        18366 ENSE00003477500 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          17915        18061 ENSE00003565697 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          17606        17742 ENSE00003475637 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          17233        17368 ENSE00003502542 ENSG00000227232
  -------
  seqinfo: 308 sequences (1 circular) from GRCh38 genome

+ strand

> head(genes(
    edb,
    columns=c("exon_seq_start", "exon_seq_end", "exon_id"),
    filter=SeqStrandFilter("+")
))

GRanges object with 6 ranges and 4 metadata columns:
                  seqnames      ranges strand | exon_seq_start exon_seq_end         exon_id         gene_id
                     <Rle>   <IRanges>  <Rle> |      <integer>    <integer>     <character>     <character>
  ENSG00000223972        1 11869-14409      + |          11869        12227 ENSE00002234944 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          12613        12721 ENSE00003582793 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          13221        14409 ENSE00002312635 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          12010        12057 ENSE00001948541 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          12179        12227 ENSE00001671638 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          12613        12697 ENSE00001758273 ENSG00000223972
  -------
  seqinfo: 335 sequences (1 circular) from GRCh38 genome

GenomicFeatures

> head(genes(
    txdb,
    columns=c("exon_start", "exon_end", "exon_id"),
    filter=list(tx_strand=c("-"))
))

  1 gene was dropped because it has exons located on both strands of the same reference sequence or on more than one reference sequence, so cannot be represented by a single genomic
  range.
  Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList object, or use suppressMessages() to suppress this message.
GRanges object with 6 ranges and 3 metadata columns:
         seqnames              ranges strand |                        exon_start                          exon_end         exon_id
            <Rle>           <IRanges>  <Rle> |                     <IntegerList>                     <IntegerList>   <IntegerList>
       1    chr19   58858172-58874214      - |    58864770,58864658,58864294,...    58864865,58864693,58864563,... 461,460,459,...
   10186    chr13   39917029-40177356      - |    40177020,40174969,39952565,...    40177356,40175527,39952663,... 347,346,345,...
  131669     chr3 126200008-126236616      - | 126236437,126229507,126228428,... 126236616,126229637,126228521,... 170,169,168,...
  139716     chrX 153903527-153979858      - | 153979229,153944301,153941482,... 153979348,153944604,153941698,... 564,563,561,...
  201725     chr4 159587827-159593407      - |     159592768,159587827,159593149 159593202,159590920,159593407,... 180,179,181,...
  221150    chr13   21727734-21750741      - |    21750514,21746759,21746478,...    21750741,21746820,21746643,... 343,342,341,...
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API enhancement New feature or request P0‼️ High priority
Projects
None yet
Development

No branches or pull requests

4 participants