Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make a TxDb object for the T2T-CHM13v2.0 genome (telomere to telomere Human genome) #1

Open
hpages opened this issue Mar 22, 2024 · 2 comments

Comments

@hpages
Copy link
Contributor

hpages commented Mar 22, 2024

[Moved from https://github.com/Bioconductor/GenomicFeatures/issues/65 on March 22, 2024]

Question: How to make a TxDb object for the T2T-CHM13v2.0 genome (telomere to telomere Human genome), a.k.a. the hs1 genome at UCSC.

Answer: Unfortunately, makeTxDbFromUCSC() doesn't support hs1 at the moment, so we're going to use the GFF file provided by NCBI.

  1. Download GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/

  2. Import the GFF file as a GRanges object:

    library(rtracklayer)
    
    ## Takes < 1 min, consumes about 7Gb of RAM
    gff <- import("GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz")
    
  3. Note that the sequence names in the GRanges object are RefSeq accessions:

    seqlevels(gff)
    #  [1] "NC_060925.1" "NC_060926.1" "NC_060927.1" "NC_060928.1" "NC_060929.1"
    #  [6] "NC_060930.1" "NC_060931.1" "NC_060932.1" "NC_060933.1" "NC_060934.1"
    # [11] "NC_060935.1" "NC_060936.1" "NC_060937.1" "NC_060938.1" "NC_060939.1"
    # [16] "NC_060940.1" "NC_060941.1" "NC_060942.1" "NC_060943.1" "NC_060944.1"
    # [21] "NC_060945.1" "NC_060946.1" "NC_060947.1" "NC_060948.1"
    

    Let's change them to the official chromosome names:

    library(GenomeInfoDb)
    chrominfo <- getChromInfoFromNCBI("T2T-CHM13v2.0")
    seqlevels(gff) <- setNames(chrominfo$SequenceName, chrominfo$RefSeqAccn)
    seqlevels(gff)
    #  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
    # [16] "16" "17" "18" "19" "20" "21" "22" "X"  "Y"  "MT"
    
  4. Add the complete sequence info to the GRanges object:

    seqinfo(gff) <- Seqinfo(genome="T2T-CHM13v2.0")
    seqinfo(gff)
    # Seqinfo object with 25 sequences (1 circular) from T2T-CHM13v2.0 genome:
    #   seqnames seqlengths isCircular        genome
    #   1         248387328      FALSE T2T-CHM13v2.0
    #   2         242696752      FALSE T2T-CHM13v2.0
    #   3         201105948      FALSE T2T-CHM13v2.0
    #   4         193574945      FALSE T2T-CHM13v2.0
    #   5         182045439      FALSE T2T-CHM13v2.0
    #   ...             ...        ...           ...
    #   21         45090682      FALSE T2T-CHM13v2.0
    #   22         51324926      FALSE T2T-CHM13v2.0
    #   X         154259566      FALSE T2T-CHM13v2.0
    #   Y          62460029      FALSE T2T-CHM13v2.0
    #   MT            16569       TRUE T2T-CHM13v2.0
    
  5. Use makeTxDbFromGRanges() to make a TxDb object from the GRanges object:

    library(txdbmaker)
    
    ## This will emit 3 warnings that can be ignored.
    txdb <- makeTxDbFromGRanges(gff, taxonomyId=9606)
    
    txdb
    # TxDb object:
    ## Db type: TxDb
    ## Supporting package: GenomicFeatures
    ## Genome: T2T-CHM13v2.0
    ## Organism: Homo sapiens
    ## Taxonomy ID: 9606
    ## Nb of transcripts: 188205
    ## Db created by: txdbmaker package from Bioconductor
    ## Creation time: 2024-03-22 16:56:52 -0700 (Fri, 22 Mar 2024)
    ## txdbmaker version at creation time: 0.99.7
    ## RSQLite version at creation time: 2.3.5
    ## DBSCHEMAVERSION: 1.2
    

Note that if you need the UCSC chromosome names instead of the NCBI ones, you can switch them with seqlevelsStyle():

seqlevelsStyle(txdb)
# [1] "NCBI"

seqlevelsStyle(txdb) <- "UCSC"

seqlevelsStyle(txdb)
# [1] "UCSC"

seqlevels(txdb)
#  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
# [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
# [19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"  "chrM" 

H.

@Claudiolete
Copy link

Hello, I am having problems with the getChromInfoFromNCBI function. When I try chrominfo <- getChromInfoFromNCBI("T2T-CHM13v2.0"), shows this: Error in download.file(url, destfile, method, quiet = TRUE):
it was not possible to open the following URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/'
Warning:
In download.file(url, destfile, method, quiet = TRUE) :
URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/': Timeout of 60 seconds was reached

Is this some trouble with my connection or with something else?

Thanks in advance

@hpages
Copy link
Contributor Author

hpages commented Sep 3, 2024

What's your sessionInfo()?

Is the following code working for you?

library(GenomeInfoDb)
list_ftp_dir("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/")

Please show the output you get.

If it doesn't work then yes it's some trouble with your connection. If you are behind an HTTP proxy (a common set up at many institutions), then it could also be a problem with the configuration of the proxy, in which case you would need to talk with your institution IT.

FWIW I just tried this again with BioC 3.19 (the latest Bioconductor release, requires R 4.4), and it works fine for me:

library(GenomeInfoDb)
list_ftp_dir("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/")
# [1] "GCA_009914755.1_T2T-CHM13v0.7" "GCA_009914755.2_T2T-CHM13v1.0"
# [3] "GCA_009914755.3_T2T-CHM13v1.1" "GCA_009914755.4_T2T-CHM13v2.0"

My sessionInfo():

R version 4.4.0 alpha (2024-04-03 r86327)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 23.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.4.r86327/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] GenomeInfoDb_1.40.1 IRanges_2.38.1      S4Vectors_0.42.1   
[4] BiocGenerics_0.50.0

loaded via a namespace (and not attached):
[1] httr_1.4.7              compiler_4.4.0          R6_2.5.1               
[4] tools_4.4.0             GenomeInfoDbData_1.2.12 UCSC.utils_1.0.0       
[7] jsonlite_1.8.8         

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants