-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathREADME.Rmd
157 lines (116 loc) · 6.12 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
<!-- devtools::rmarkdown::render("README.Rmd") -->
<!-- Rscript -e "library(knitr); knit('README.Rmd')" -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
# Locally query GenBank <img src="https://raw.githubusercontent.com/ropensci/restez/master/logo.png" height="200" align="right"/>
[![R-CMD-check](https://github.com/ropensci/restez/workflows/R-CMD-check/badge.svg)](https://github.com/ropensci/restez/actions)
[![Coverage Status](https://coveralls.io/repos/github/ropensci/restez/badge.svg?branch=master)](https://coveralls.io/github/ropensci/restez?branch=master)
[![ROpenSci status](https://badges.ropensci.org/232_status.svg)](https://github.com/ropensci/software-review/issues/232)
[![CRAN downloads](http://cranlogs.r-pkg.org/badges/grand-total/restez)](https://CRAN.R-project.org/package=restez)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6806029.svg)](https://doi.org/10.5281/zenodo.6806029)
[![status](https://joss.theoj.org/papers/10.21105/joss.01102/status.svg)](https://joss.theoj.org/papers/10.21105/joss.01102)
> NOTE: Starting with v2.0.0, the database backend changed from [MonetDBLite](https://github.com/MonetDB/MonetDBLite-R) to [duckdb](https://github.com/duckdb/duckdb). Because of this change, restez v2.0.0 or higher **is not compatible with databases built with previous versions of restez**.
Download parts of [NCBI's GenBank](https://www.ncbi.nlm.nih.gov/nuccore) to a local folder and create a simple SQL-like database. Use 'get' tools to query the database by accession IDs. [rentrez](https://github.com/ropensci/rentrez) wrappers are available, so that if sequences are not available locally they can be searched for online through [Entrez](https://www.ncbi.nlm.nih.gov/books/NBK25500/).
See the [detailed tutorials](https://docs.ropensci.org/restez/articles/restez.html) for more information.
## Introduction
*Vous entrez, vous rentrez et, maintenant, vous .... restez!*
Downloading sequences and sequence information from GenBank and related NCBI taxonomic databases is often performed via the NCBI API, Entrez. Entrez, however, has a limit on the number of requests and downloading large amounts of sequence data in this way can be inefficient. For programmatic situations where multiple Entrez calls are made, downloading may take days, weeks or even months.
This package aims to make sequence retrieval more efficient by allowing a user to download large sections of the GenBank database to their local machine and query this local database either through package specific functions or Entrez wrappers. This process is more efficient as GenBank downloads are made via NCBI's FTP using compressed sequence files. With a good internet connection and a middle-of-the-road computer, a database comprising 20 GB of sequence information can be generated in less than 10 minutes.
<img src="https://raw.githubusercontent.com/ropensci/restez/master/paper/outline.png" height="500" align="center"/>
## Installation
Install from CRAN:
```{r cran-installation, include=TRUE, echo=TRUE, eval=FALSE}
install.packages("restez")
```
Or install the development version from r-universe:
```{r r-univ-installation, include=TRUE, echo=TRUE, eval=FALSE}
install.packages("restez", repos = "https://ropensci.r-universe.dev")
```
Or install the development version from GitHub (requires installing the `remotes` package first):
```{r gh-installation, include=TRUE, echo=TRUE, eval=FALSE}
# install.packages("remotes")
remotes::install_github("ropensci/restez")
```
## Quick Examples
> For more detailed information on the package's functions and detailed guides on downloading, constructing and querying a database, see the [detailed tutorials](https://docs.ropensci.org/restez/articles/restez.html).
### Setup
```{r presetup, include=FALSE}
rstz_pth <- tempdir()
restez::restez_path_set(filepath = rstz_pth)
restez::db_delete(everything = TRUE)
```
```{r restez-setup, echo=TRUE, eval=TRUE, results='hide'}
# Warning: running these examples may take a few minutes
library(restez)
# choose a location to store GenBank files
restez_path_set(rstz_pth)
```
``` {r reassigndownload, include=FALSE, eval=TRUE}
db_download <- function() {
restez::db_download(preselection = '20')
}
```
```{r gb-download, echo=TRUE, eval=TRUE, results='hide'}
# Run the download function
db_download()
# after download, create the local database
db_create()
```
### Query
```{r query, echo=TRUE, eval=TRUE}
# for reproducibility
set.seed(12345)
# get a random accession ID from the database
id <- sample(list_db_ids(), 1)
# you can extract:
# sequences
seq <- gb_sequence_get(id)[[1]]
str(seq)
# definitions
def <- gb_definition_get(id)[[1]]
print(def)
# organisms
org <- gb_organism_get(id)[[1]]
print(org)
# or whole records
rec <- gb_record_get(id)[[1]]
cat(rec)
```
### Entrez wrappers
```{r entrez, echo=TRUE, eval=TRUE}
# use the entrez_* wrappers to access GB data
res <- entrez_fetch(db = 'nucleotide', id = id, rettype = 'fasta')
cat(res)
# if the id is not in the local database
# these wrappers will search online via the rentrez package
res <- entrez_fetch(db = 'nucleotide', id = c('S71333.1', id),
rettype = 'fasta')
cat(res)
```
## Contributing
Want to contribute? Check the [contributing page](https://docs.ropensci.org/restez/CONTRIBUTING.html).
## Licence
MIT
## Citation
Bennett et al. (2018). restez: Create and Query a Local Copy of GenBank in R.
*Journal of Open Source Software*, 3(31), 1102. https://doi.org/10.21105/joss.01102
## References
Benson, D. A., Karsch-Mizrachi, I., Clark, K., Lipman, D. J., Ostell, J., &
Sayers, E. W. (2012). GenBank. *Nucleic Acids Research*, 40(Database issue),
D48–D53. https://doi.org/10.1093/nar/gkr1202
Winter DJ. (2017) rentrez: An R package for the NCBI eUtils API.
*PeerJ Preprints* 5:e3179v2 https://doi.org/10.7287/peerj.preprints.3179v2
## Maintainer
[Joel Nitta](https://github.com/joelnitta)
This package previously developed and maintained by Dom Bennett
-----
[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)