-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Support multiple inputs during filter #697
Commits on Mar 10, 2021
-
Add initial I/O interface and tests
Adds tests and code for new `open_file`, `read_sequences`, and `write_sequences` functions loosely based on a proposed API [1]. These functions transparently handle compressed inputs and outputs using the xopen library. The `open_file` function is a context manager that lightly wraps the `xopen` function and also supports either path strings or existing IO buffers. Both the read and write functions use this context manager to open files. This manager enables the common use case of writing to the same handle many times inside a for loop, by replacing the standard `open` call with `open_file`. Doing so, we maintain a Pythonic interface that also supports compressed file formats and path-or-buffer inputs. This context manager also enables input and output of any other file type in compressed formats (e.g., metadata, sequence indices, etc.). Note that the `read_sequences` and `write_sequences` functions do not infer the format of sequence files (e.g., FASTA, GenBank, etc.). Inferring file formats requires peeking at the first record in each given input, but peeking is not supported by piped inputs that we want to support (e.g., piped gzip inputs from xopen). There are also no internal use cases for Augur to read multiple sequences of different formats, so I can't currently justify the complexity required to support type inference. Instead, I opted for the same approach used by BioPython where the calling code must know the type of input file being passed. This isn't an unreasonable expectation for Augur's internal code. I also considered inferring file type by filename extensions like xopen infers compression modes. Filename extensions are less standardized across bioinformatics than we would like for this type of inference to work robustly. Tests ignore BioPython and pycov warnings to minimize warning fatigue for issues we cannot address during test-driven development. [1] #645
Configuration menu - View commit details
-
Copy full SHA for 8a20b4f - Browse repository at this point
Copy the full SHA 8a20b4fView commit details -
Support compressed inputs/outputs for index
Adds support to augur index for compressed sequence inputs and index outputs.
Configuration menu - View commit details
-
Copy full SHA for 0a9d742 - Browse repository at this point
Copy the full SHA 0a9d742View commit details -
Support compress inputs/outputs for parse and mask
Adds tests for augur parse and mask and then refactors these modules to use the new read/write interface. For augur parse, the refactor moves from an original for loop into its own `parse_sequence` function, adds tests for this new function, and updates the body of the `run` function to use this function inside the for loop. This commit also replaces the Bio.SeqIO read and write functions with the new `read_sequences` and `write_sequences` functions. These functions support compressed input and output files based on the filename extensions. For augur mask, the refactor moves logic for masking individual sequences into its own function and replaces Bio.SeqIO calls with new `read_sequences` and `write_sequences` functions. The refactoring of the `mask_sequence` function allows us to easily define a generator for the output sequences to write and make a single call to `write_sequences`.
Configuration menu - View commit details
-
Copy full SHA for c77bcb7 - Browse repository at this point
Copy the full SHA c77bcb7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 071023d - Browse repository at this point
Copy the full SHA 071023dView commit details
Commits on Mar 15, 2021
-
Configuration menu - View commit details
-
Copy full SHA for f6c61f1 - Browse repository at this point
Copy the full SHA f6c61f1View commit details
Commits on Mar 16, 2021
-
Add Zika build test for compressed inputs/outputs
Documents which steps of a standard build support compressed inputs/outputs by adding a copy of the Zika build test and corresponding expected compressed inputs/outputs.
Configuration menu - View commit details
-
Copy full SHA for 46b8a65 - Browse repository at this point
Copy the full SHA 46b8a65View commit details -
Support compressed inputs in augur align
Adds support for compressed inputs (reference files and alignment sequences) in augur align by refactoring existing code to use Augur's `io` module. This is a work in progress and still requires focused work to add support for compressed output files.
Configuration menu - View commit details
-
Copy full SHA for 6a71928 - Browse repository at this point
Copy the full SHA 6a71928View commit details
Commits on Mar 17, 2021
-
Support multiple inputs to filter
Work in progress prototyping how we could add support multiple metadata, sequence, and sequence index inputs to augur filter to simplify workflows that aggregate filters across multiple input datasets (e.g., the ncov workflow).
Configuration menu - View commit details
-
Copy full SHA for f53f921 - Browse repository at this point
Copy the full SHA f53f921View commit details