Issue 99 #102

HobnobMancer · 2022-11-29T13:04:48Z

Fix Issue #99

Improve handling when incurring errors when retrieving data from NCBI

can use cached seqs in JSON and/or FASTA file, and can use a combiation of cache and db seqs

HobnobMancer · 2022-11-29T13:06:24Z

The initial problem stated in issue #99 is fixed.

However, GenBank accessions that are no longer listed in NCBI are still retried as many times as defined by args.retries (default 10). This unnecessarily increases running time and demand on the NCBI server.

HobnobMancer · 2022-11-29T13:06:43Z

Unit test coverage could be increased to >= 60%

chage skip_download to bool not path

catch runtime errors, incomplete reads, notxmlerorr, typeerror and attributeerror. Improve detection of invalid IDs. Working on not retrying conenction for invalid IDs. Invalid Ids are kepy searate ffrom IDs whose querying to NCBI was interrupted by a connection failure

move the functions that perform the called to NCBI.Entrez to the NCBI module

…n batches. Process each separetly

use downloaded seq and overwrite cached seq

add missing commas, remove unneeded brackets

…etrieve

codecov · 2023-01-11T11:09:58Z

Codecov Report

Merging #102 (b3b9dc4) into master (47af82a) will decrease coverage by 2.05%.
The diff coverage is 3.69%.

@@            Coverage Diff             @@
##           master     #102      +/-   ##
==========================================
- Coverage   58.62%   56.57%   -2.06%     
==========================================
  Files          60       61       +1     
  Lines        5337     5541     +204     
==========================================
+ Hits         3129     3135       +6     
- Misses       2208     2406     +198

do not remove batch from dict of failed batches multiple times, only once finished parsing batch

HobnobMancer · 2023-01-11T15:52:47Z

Changes implemented

Changing operation has been successful:

Separate invalid IDs to IDs that suffered to failed connections
Parse batches containing invalid IDs separately to and before failed connection batches

Downloaded protein sequences are cached to a FASTA file.

Updated information in the docs on caching.

Future developments notes

[1]
Unit tests raise deprecated feature warning:

tests/conftest.py:40
  /home/circleci/cazy_webscraper/tests/conftest.py:40: MovedIn20Warning: Deprecated API features detected! These feature(s) are not compatible with SQLAlchemy 2.0. To prevent incompatible upgrades prior to updating applications, ensure requirements files are pinned to "sqlalchemy<2.0". Set environment variable SQLALCHEMY_WARN_20=1 to show all deprecation warnings.  Set environment variable SQLALCHEMY_SILENCE_UBER_WARNING=1 to silence this message. (Background on SQLAlchemy 2.0 at: https://sqlalche.me/e/b8d9)
    Base = declarative_base()

cazy_webscraper currently uses sqlalchemy version >= 1.4.20. This should be changed to ==, and cazy_webscraper should be migrated to sqlalchemy version 2.x on a new branch. (See issue #106)

[2]
Add the protein description from NCBI (retrieved when downloading the sequence record from NCBI) to the local CAZyme database.

Add the protein description to a new column called genbank_description in the Genbanks table.

…raper into issue_99

HobnobMancer added 17 commits November 15, 2022 11:14

term program if closing early

94ee1be

add opt to use seqs from FASTA file

c258475

can use cached seqs in JSON and/or FASTA file, and can use a combiation of cache and db seqs

shorten line lengths

53d2506

update docs with new CLI args

d53203c

add tutorial for using genbank seq cahce

ec6cb54

cache seqs as retrieved. refactorise code

8c41446

compress the cache directory once done

cf3e395

update version num

c504074

update installation instructions

3f84d95

update closing message unit tests

d01a69f

correct add to append for list

b396a49

add logging messages

b7c2f41

IncompleteRead error capture when posting ids

6512e0e

ignore blank links in the acc file

297e1c3

use re to retrieve ncbi accessions

908dcce

add unit tests and remove unused imports

8e0bbd1

update unit tests

7546238

HobnobMancer added bug Something isn't working documentation Improvements or additions to documentation unit tests Add/update unit tests labels Nov 29, 2022

HobnobMancer added this to the cazy_webscraper v2 stable release milestone Nov 29, 2022

HobnobMancer self-assigned this Nov 29, 2022

HobnobMancer added 5 commits January 11, 2023 10:18

fix merge conflicts

8f28199

update version number

91422ea

refactor funcs

36c482b

define missing logger

9faaae0

update parser

c503b5a

chage skip_download to bool not path

HobnobMancer added 19 commits January 11, 2023 10:36

update unit tests

82e9baf

cache uniprot accessions that could not be mapped to genbank

d2ff454

move ncbi-calling funcs to ncbi mod

ec3d6d4

move the functions that perform the called to NCBI.Entrez to the NCBI module

update imports for getting prot seqs

2d0822b

process invalids containing batches after processing failed connectio…

75f3405

…n batches. Process each separetly

update version number

a70c2b9

log when cache and downloaded seqs don't match

9a896e7

use downloaded seq and overwrite cached seq

correct syntax errors

d4ecf03

add missing commas, remove unneeded brackets

add missing args to func calls

b0f1384

add missing IncompeteRead import

53198c1

add missing IncompeteRead import

6bb4849

correct . to , typo

2af7469

add missing args param to func calls

3119346

use joined list as str as failed connections key

8ce63b6

correct typos success to successful

63f869b

correct old var name to new var name: gbk_acc_to_retrieve to acc_to_r…

2d08be9

…etrieve

correct list to set: successful_accessions

771b3d1

remove whitespace from ids

79350d2

HobnobMancer and others added 4 commits January 11, 2023 14:31

fix failed_connections bug

94d59d1

do not remove batch from dict of failed batches multiple times, only once finished parsing batch

update docs on cache files

4324751

Merge branch 'master' into issue_99

35c2fd8

change sqlalchemy requirement to fix v num

b8268e5

HobnobMancer mentioned this pull request Jan 11, 2023

Update to sqlalchemy 2.x #106

Open

Merge branch 'issue_99' of https://github.com/HobnobMancer/cazy_websc…

b3b9dc4

…raper into issue_99

HobnobMancer merged commit d8b34b8 into master Jan 11, 2023

HobnobMancer deleted the issue_99 branch April 4, 2023 10:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 99 #102

Issue 99 #102

HobnobMancer commented Nov 29, 2022

HobnobMancer commented Nov 29, 2022

HobnobMancer commented Nov 29, 2022

codecov bot commented Jan 11, 2023 •

edited

Loading

HobnobMancer commented Jan 11, 2023 •

edited

Loading

Issue 99 #102

Issue 99 #102

Conversation

HobnobMancer commented Nov 29, 2022

HobnobMancer commented Nov 29, 2022

HobnobMancer commented Nov 29, 2022

codecov bot commented Jan 11, 2023 • edited Loading

Codecov Report

HobnobMancer commented Jan 11, 2023 • edited Loading

Changes implemented

Future developments notes

codecov bot commented Jan 11, 2023 •

edited

Loading

HobnobMancer commented Jan 11, 2023 •

edited

Loading