Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 99 #102

Merged
merged 69 commits into from
Jan 11, 2023
Merged

Issue 99 #102

merged 69 commits into from
Jan 11, 2023

Conversation

HobnobMancer
Copy link
Owner

Fix Issue #99

Improve handling when incurring errors when retrieving data from NCBI

@HobnobMancer HobnobMancer added bug Something isn't working documentation Improvements or additions to documentation unit tests Add/update unit tests labels Nov 29, 2022
@HobnobMancer HobnobMancer self-assigned this Nov 29, 2022
@HobnobMancer
Copy link
Owner Author

The initial problem stated in issue #99 is fixed.

However, GenBank accessions that are no longer listed in NCBI are still retried as many times as defined by args.retries (default 10). This unnecessarily increases running time and demand on the NCBI server.

@HobnobMancer
Copy link
Owner Author

Unit test coverage could be increased to >= 60%

@codecov
Copy link

codecov bot commented Jan 11, 2023

Codecov Report

Merging #102 (b3b9dc4) into master (47af82a) will decrease coverage by 2.05%.
The diff coverage is 3.69%.

@@            Coverage Diff             @@
##           master     #102      +/-   ##
==========================================
- Coverage   58.62%   56.57%   -2.06%     
==========================================
  Files          60       61       +1     
  Lines        5337     5541     +204     
==========================================
+ Hits         3129     3135       +6     
- Misses       2208     2406     +198     

HobnobMancer and others added 4 commits January 11, 2023 14:31
@HobnobMancer
Copy link
Owner Author

HobnobMancer commented Jan 11, 2023

Changes implemented

Changing operation has been successful:

  1. Separate invalid IDs to IDs that suffered to failed connections
  2. Parse batches containing invalid IDs separately to and before failed connection batches

Downloaded protein sequences are cached to a FASTA file.

Updated information in the docs on caching.

Future developments notes

[1]
Unit tests raise deprecated feature warning:

tests/conftest.py:40
  /home/circleci/cazy_webscraper/tests/conftest.py:40: MovedIn20Warning: Deprecated API features detected! These feature(s) are not compatible with SQLAlchemy 2.0. To prevent incompatible upgrades prior to updating applications, ensure requirements files are pinned to "sqlalchemy<2.0". Set environment variable SQLALCHEMY_WARN_20=1 to show all deprecation warnings.  Set environment variable SQLALCHEMY_SILENCE_UBER_WARNING=1 to silence this message. (Background on SQLAlchemy 2.0 at: https://sqlalche.me/e/b8d9)
    Base = declarative_base()

cazy_webscraper currently uses sqlalchemy version >= 1.4.20. This should be changed to ==, and cazy_webscraper should be migrated to sqlalchemy version 2.x on a new branch. (See issue #106)

[2]
Add the protein description from NCBI (retrieved when downloading the sequence record from NCBI) to the local CAZyme database.

Add the protein description to a new column called genbank_description in the Genbanks table.

@HobnobMancer HobnobMancer merged commit d8b34b8 into master Jan 11, 2023
@HobnobMancer HobnobMancer deleted the issue_99 branch April 4, 2023 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation unit tests Add/update unit tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant