Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: genome feature support #3

Merged
merged 115 commits into from
May 22, 2024
Merged
Show file tree
Hide file tree
Changes from 108 commits
Commits
Show all changes
115 commits
Select commit Hold shift + click to select a range
e62f976
feat: add gff3 gz uri fields to db + models
davidlougheed May 7, 2024
7424442
chore: add schema, models, db functions for genome features
davidlougheed May 7, 2024
abfe955
feat(routes): define gff3 / feature API endpoints
davidlougheed May 7, 2024
2c1e01f
chore: work on gff3 parsing code for ingestion
davidlougheed May 7, 2024
2e1457f
chore: stub out db calls for genome feature ingest
davidlougheed May 7, 2024
7a2a7f3
fix(schema): missing semicolon
davidlougheed May 7, 2024
687dfc7
fix: issues with db interface
davidlougheed May 7, 2024
3a136bb
lint
davidlougheed May 7, 2024
87695c5
chore: work on workflow definitions
davidlougheed May 7, 2024
c7ea14a
Merge remote-tracking branch 'origin/main' into feat/genome-features
davidlougheed May 7, 2024
388b439
chore: bump version to 0.2.0
davidlougheed May 7, 2024
d971fc9
feat: allow putting GFF3 files to ingest annotations
davidlougheed May 8, 2024
b30747d
feat(workflows): optional gff3 ingestion in fasta ingest workflow
davidlougheed May 8, 2024
813008c
fix: workflow issues
davidlougheed May 8, 2024
89f2dcb
chore(workflows): implement gff3_annot workflow
davidlougheed May 8, 2024
5a7c310
chore(db): implement bulk feature ingestion
davidlougheed May 8, 2024
3b0b9bd
chore: add default tmp folder
davidlougheed May 8, 2024
94efe08
chore: create ingest tmp dir if needed
davidlougheed May 8, 2024
ccb5563
fix: mismatch between DB and Pydantic model for strand
davidlougheed May 8, 2024
60a5515
chore: properly ignore contents of tmp folder
davidlougheed May 8, 2024
44ef1fa
fix: misc issues with GFF feature ingestion
davidlougheed May 8, 2024
3aabd0c
fix: missing enumerate() in feature ingest
davidlougheed May 8, 2024
77991c6
fix: clear genome features before (re)ingesting them
davidlougheed May 8, 2024
57be8f5
fix: issues with ingesting features to db + ingest feature types
davidlougheed May 8, 2024
8229bd9
chore(sql): don't create unneeded table, add v0.2 migration
davidlougheed May 8, 2024
e764d76
chore: don't log everything as debug
davidlougheed May 8, 2024
a63dcfe
fix: issues with feature ingest
davidlougheed May 8, 2024
203ca58
test: genome feature testing
davidlougheed May 8, 2024
e99b2bf
fix(workflows): issues with fasta_ref workflow
davidlougheed May 8, 2024
d9842bf
fix(workflows): GFF3 pattern for fasta_ref workflow
davidlougheed May 8, 2024
820b603
fix: undo debug thing
davidlougheed May 8, 2024
2296e30
fix(features): use contig-batches to keep parents with child features
davidlougheed May 9, 2024
1e76c7b
chore: more logging for feature ingest
davidlougheed May 9, 2024
b9344e3
fix(db): syntax for feature_id in-clause
davidlougheed May 9, 2024
dd4ef44
fix: logger config
davidlougheed May 9, 2024
6572837
chore: log time taken to ingest batch of feaures
davidlougheed May 9, 2024
c5bb409
chore(db): implement attribute fetch, fix offset/limit perf issue
davidlougheed May 9, 2024
2ae4ee3
perf: add some missing SQL indices
davidlougheed May 10, 2024
cc00733
fix: handle missing genome when ingesting annotations
davidlougheed May 10, 2024
d3af68e
refact!: rewrite schema for better feature storage/indexing
davidlougheed May 10, 2024
c432588
fix: issue with v0.2 migration SQL
davidlougheed May 10, 2024
edf6e9c
lint
davidlougheed May 10, 2024
938f6b5
perf(db): release connections earlier in db manager
davidlougheed May 10, 2024
e8e5bed
fix(workflows): fix jumping the gun ingesting fasta+gff3 at once
davidlougheed May 10, 2024
6d9ad33
refact!: rewrite feature ingest to use async task-based flow
davidlougheed May 10, 2024
dd6b998
chore: increase log interval for feature ingest
davidlougheed May 10, 2024
c5d6165
refact: cleanup genomes router exc fn used once
davidlougheed May 10, 2024
b416d5c
chore: update gff3 ingest workflow for new task system
davidlougheed May 11, 2024
122c40d
fix: bad task status literal in Task model
davidlougheed May 11, 2024
1c9c34a
fix(db): issues with genome feature select query
davidlougheed May 11, 2024
ea3009f
perf: add missing indices to aid in querying+deletion
davidlougheed May 13, 2024
75c4983
feat: allow deleting features via tested endpoint
davidlougheed May 13, 2024
ee19841
fix(schema): missing on delete cascade for tasks genome fk
davidlougheed May 13, 2024
140810a
chore: update lockfile
davidlougheed May 13, 2024
2089a4b
fix: genome alias handling
davidlougheed May 14, 2024
a4dc701
fix(db): contig alias deserialization
davidlougheed May 14, 2024
dac9a7c
chore(db): update queued/running tasks to error on db startup
davidlougheed May 14, 2024
cdc781c
chore(workflows): update fasta workflow gff3 ingest fn
davidlougheed May 14, 2024
16298c6
chore(workflows): don't output any artifacts
davidlougheed May 14, 2024
35a855e
chore(deps): add fasta-checksum-utils as a dev util
davidlougheed May 14, 2024
e06649c
test: add hg38 chr1:1-100000 test "genome"
davidlougheed May 14, 2024
380775a
fix(db): alias deserialization
davidlougheed May 14, 2024
2d50000
test: fix async db tests
davidlougheed May 14, 2024
f47c761
test: creation of covid + hg38 subset genomes
davidlougheed May 14, 2024
49c3a80
test: iter test for features for hg38 subset + sars cov 2
davidlougheed May 14, 2024
e5037ef
fix(features): not properly removing Parent attr on ingest
davidlougheed May 14, 2024
8339def
chore(features): move standard GFF Name attr to constant
davidlougheed May 14, 2024
0d4097a
test: fix issues with test gencode file
davidlougheed May 14, 2024
bd64c4f
chore(db): log feature types being ingested
davidlougheed May 14, 2024
bf80fc3
test: fix ssl errors in streaming response tests
davidlougheed May 14, 2024
2cc6044
chore(db): log better error message when feature has missing parent
davidlougheed May 14, 2024
07b8249
test: start testing querying of genome features
davidlougheed May 14, 2024
e0611ef
docs: update status of annotation service
davidlougheed May 14, 2024
3a7cedd
fix(streaming): deprecation notice for streaming tcp connector init
davidlougheed May 14, 2024
3ca266e
chore(db): consistent use of jsonb
davidlougheed May 14, 2024
ce0317c
fix(db): correctly deduplicate attribute keys/value lookups
davidlougheed May 14, 2024
e2122e6
chore(db): create GIN index on genome feature attr values
davidlougheed May 14, 2024
1932d28
test: use mock instead of real site for streaming tests
davidlougheed May 15, 2024
bb93de5
test: feature_types summary route
davidlougheed May 15, 2024
6032870
chore(features): don't keep Name attribute in feature attrs
davidlougheed May 15, 2024
5583bc8
fix(db): issues with fetching contigs by checksum
davidlougheed May 15, 2024
3190f30
refact: separate query and filter genome feature fns
davidlougheed May 15, 2024
bf2999a
test: more db tests
davidlougheed May 15, 2024
ba52f99
fix: genome feature endpoints - either query or filter for now
davidlougheed May 15, 2024
b2dd332
fix(features): fix bad feature name extraction in some cases
davidlougheed May 15, 2024
b2bf066
test: filter genome features based on name
davidlougheed May 15, 2024
2ca7bbb
lint
davidlougheed May 15, 2024
77cb8df
chore(features): add docstrings
davidlougheed May 15, 2024
df2de43
refact(features): factor out iter_features closure
davidlougheed May 15, 2024
12630a0
chore: update dependencies
davidlougheed May 15, 2024
863e9d7
lint(features): add comments
davidlougheed May 15, 2024
8f19204
chore(features): fall back to none right away w/ gene name as feature…
davidlougheed May 15, 2024
736e705
fix(db): typo handling feature type filter queries for features
davidlougheed May 15, 2024
5ae8d38
test: fix hg38 subset genome contig consistency
davidlougheed May 15, 2024
2f8a8d5
fix(features): bad handling of gff fetch contig + coord off-by-1
davidlougheed May 15, 2024
114a1ff
test: more tests for filtering features
davidlougheed May 15, 2024
33b630e
perf: add a maximum feature response length limit
davidlougheed May 16, 2024
0c7276e
refact: move running tasks to error on startup instead of db init
davidlougheed May 16, 2024
0d76715
fix: bad query for tasks querying
davidlougheed May 16, 2024
1121e4d
feat: add name_q argument for text searching feature names
davidlougheed May 16, 2024
7a18f65
fix(features): bad log for feature attribute values
davidlougheed May 16, 2024
864f267
test: task list route
davidlougheed May 16, 2024
3204c07
lint(db): rm unused function
davidlougheed May 16, 2024
01ae03f
feat: include time taken in features query response
davidlougheed May 17, 2024
b575d44
chore(features): shorter logs for fallback to feature name/missing ID
davidlougheed May 17, 2024
c772412
fix: lifespan function for moving tasks
davidlougheed May 17, 2024
1425399
test: tasks detail endpoint
davidlougheed May 17, 2024
b73f11a
feat: fuzzy search params for feature search args: q/name
davidlougheed May 17, 2024
fcbe4fa
refact(features): simplify feature generator a bit
davidlougheed May 21, 2024
b7b8f68
refact: rewrite feature ingestion to be a bit more RESTy
davidlougheed May 21, 2024
93699a5
chore(streaming): add additional debug log when starting to stream URL
davidlougheed May 22, 2024
24b341d
fix: typo in fasta ref WDL
davidlougheed May 22, 2024
de199e3
chore(config): increase default file response chunk size
davidlougheed May 22, 2024
3320bbf
fix(workflows): another typo in fasta_ref
davidlougheed May 22, 2024
77b9e63
fix: feature ingestion
davidlougheed May 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Reference data (genomes & annotations) service for the Bento platform.
* Bento-style genome ingestion: **DONE**
* API endpoint permissions: **DONE**
* RefGet implementation: _Partially done_
* Annotation service: **Not done**
* Annotation service: _Partially done_
* Tests: _Partially done_
* Documentation: **Not done**

Expand Down
5 changes: 4 additions & 1 deletion bento_reference_service/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,14 @@ class Config(BentoBaseConfig):
service_url_base_path: str = "http://127.0.0.1:5000" # Base path to construct URIs from

database_uri: str = "postgres://localhost:5432"
data_path: Path = Path(__file__).parent / "data"
file_ingest_tmp_dir: Path = Path(__file__).parent.parent / "tmp" # Default to repository `tmp` folder
file_ingest_chunk_size: int = 1024 * 256 # 256 KiB at a time

file_response_chunk_size: int = 1024 * 16 # 16 KiB at a time
response_substring_limit: int = 10000 # TODO: Refine default

feature_response_record_limit: int = 1000


@lru_cache()
def get_config():
Expand Down
Loading
Loading