What order should results be in? #62

ivirshup · 2024-04-03T15:13:11Z

Description of feature

In the mentioned PR, we found out that duckdb is inconsistent with what order it returns results in. This by and large seems fine from a duckdb point of view, but would be frustrating for users of this package.

To solve this @thomas-reimonn added an order_by statement here

Right now the order returned is based on the order of columns in the user input (I believe). Is there a better/ more canonical way to do this? Off the top of my head I would assume we'd generally want to sort by chrom and start.

We should also check what the bioc packages do here.

@nvictus, @emdann, @lauradmartens any thoughts?

The text was updated successfully, but these errors were encountered:

ivirshup · 2024-04-03T15:30:55Z

bedtools sort does chromosome then start
ensembldb
- Old release notes say "Chrom then start"
- Also offers order.by argument
GenomicFeatures
- Mentions order a bunch of different places in API docs, haven't full explored cases
- Has parameters which control whether strand is considered when ordering exons

emdann · 2024-04-03T15:40:33Z

Definitely default to genomic location (chr+start), option to use another column?

thomas-reimonn · 2024-04-03T15:49:21Z

Some of these tables don't have loci information. E.g. ibis.common.exceptions.IbisTypeError: Column 'chrom' is not found in table. Existing columns: 'gene_id', 'gene_name', 'gene_biotype', 'gene_seq_start', 'gene_seq_end', 'seq_name', 'seq_strand', 'seq_coord_system', 'description', 'gene_id_version', 'canonical_transcript'.

ivirshup · 2024-04-03T16:01:06Z

@thomas-reimonn Ah, I think chrom should be seq_name

@emdann, for something like exons, where "gene_id" or "transcript_id" has been selected, so we still sort by chromosome and start? Or should we sort by the gene/ transcript start?

And then, do we consider strand for sorting exons within a transcript?

ivirshup · 2024-04-03T16:33:50Z

It looks like tx2exon has exon_idx which I believe is the order of the exon inside the transcript

lauradmartens · 2024-04-03T17:07:41Z

I also agree by default it should be chrom + start. It seems like this is also what GenomicFeatures does and they have a separate function that returns ranges sorted by another value I think: e.g. transcriptsBy(txdb, "gene") and then the exons are sorted how they would appear in the transcript:
In the manual on transcriptsBy/exonsBy:

These functions return a GRangesList object where the ranges within each of the elements are
ordered according to the following rule:
When using exonsBy or cdsBy with by="tx", the returned exons or CDS parts are ordered by
ascending rank for each transcript, that is, by their position in the transcript. In all other cases, the
transcriptsBy
ranges will be ordered by chromosome, strand, start, and end values.

ivirshup · 2024-04-03T18:37:54Z

I think the bioc packages go by chrom + start for the "primary" thing being queried (e.g. genes for .genes). But it looks like there is some strand specific behaviour in both if you grab exons within a gene query. Here's where I got to:

Demo

Setup

library(ensembldb)
library(EnsDb.Hsapiens.v86)
edb = EnsDb.Hsapiens.v86

library(GenomicFeatures)

samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite",
                        package="GenomicFeatures")
txdb <- loadDb(samplefile)

`ensembldb` getting genes but also returning exons.

Only `-` strand:

> head(genes(
    edb,
    columns=c("exon_seq_start", "exon_seq_end", "exon_id"),
    filter=SeqStrandFilter("-")
))

GRanges object with 6 ranges and 4 metadata columns:
                  seqnames      ranges strand | exon_seq_start exon_seq_end         exon_id         gene_id
                     <Rle>   <IRanges>  <Rle> |      <integer>    <integer>     <character>     <character>
  ENSG00000227232        1 14404-29570      - |          29534        29570 ENSE00001890219 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          24738        24891 ENSE00003507205 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          18268        18366 ENSE00003477500 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          17915        18061 ENSE00003565697 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          17606        17742 ENSE00003475637 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          17233        17368 ENSE00003502542 ENSG00000227232
  -------
  seqinfo: 308 sequences (1 circular) from GRCh38 genome

`+ strand`

> head(genes(
    edb,
    columns=c("exon_seq_start", "exon_seq_end", "exon_id"),
    filter=SeqStrandFilter("+")
))

GRanges object with 6 ranges and 4 metadata columns:
                  seqnames      ranges strand | exon_seq_start exon_seq_end         exon_id         gene_id
                     <Rle>   <IRanges>  <Rle> |      <integer>    <integer>     <character>     <character>
  ENSG00000223972        1 11869-14409      + |          11869        12227 ENSE00002234944 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          12613        12721 ENSE00003582793 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          13221        14409 ENSE00002312635 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          12010        12057 ENSE00001948541 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          12179        12227 ENSE00001671638 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          12613        12697 ENSE00001758273 ENSG00000223972
  -------
  seqinfo: 335 sequences (1 circular) from GRCh38 genome

`GenomicFeatures`

> head(genes(
    txdb,
    columns=c("exon_start", "exon_end", "exon_id"),
    filter=list(tx_strand=c("-"))
))

  1 gene was dropped because it has exons located on both strands of the same reference sequence or on more than one reference sequence, so cannot be represented by a single genomic
  range.
  Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList object, or use suppressMessages() to suppress this message.
GRanges object with 6 ranges and 3 metadata columns:
         seqnames              ranges strand |                        exon_start                          exon_end         exon_id
            <Rle>           <IRanges>  <Rle> |                     <IntegerList>                     <IntegerList>   <IntegerList>
       1    chr19   58858172-58874214      - |    58864770,58864658,58864294,...    58864865,58864693,58864563,... 461,460,459,...
   10186    chr13   39917029-40177356      - |    40177020,40174969,39952565,...    40177356,40175527,39952663,... 347,346,345,...
  131669     chr3 126200008-126236616      - | 126236437,126229507,126228428,... 126236616,126229637,126228521,... 170,169,168,...
  139716     chrX 153903527-153979858      - | 153979229,153944301,153941482,... 153979348,153944604,153941698,... 564,563,561,...
  201725     chr4 159587827-159593407      - |     159592768,159587827,159593149 159593202,159590920,159593407,... 180,179,181,...
  221150    chr13   21727734-21750741      - |    21750514,21746759,21746478,...    21750741,21746820,21746643,... 343,342,341,...
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

ivirshup added enhancement New feature or request P0‼️ High priority API labels Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What order should results be in? #62

What order should results be in? #62

ivirshup commented Apr 3, 2024

ivirshup commented Apr 3, 2024 •

edited

Loading

emdann commented Apr 3, 2024

thomas-reimonn commented Apr 3, 2024

ivirshup commented Apr 3, 2024

ivirshup commented Apr 3, 2024

lauradmartens commented Apr 3, 2024

ivirshup commented Apr 3, 2024

Setup

`ensembldb` getting genes but also returning exons.

Only `-` strand:

`+ strand`

`GenomicFeatures`

What order should results be in? #62

What order should results be in? #62

Comments

ivirshup commented Apr 3, 2024

Description of feature

ivirshup commented Apr 3, 2024 • edited Loading

emdann commented Apr 3, 2024

thomas-reimonn commented Apr 3, 2024

ivirshup commented Apr 3, 2024

ivirshup commented Apr 3, 2024

lauradmartens commented Apr 3, 2024

ivirshup commented Apr 3, 2024

Setup

ensembldb getting genes but also returning exons.

Only - strand:

+ strand

GenomicFeatures

ivirshup commented Apr 3, 2024 •

edited

Loading

`ensembldb` getting genes but also returning exons.

Only `-` strand:

`+ strand`

`GenomicFeatures`