Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wrong input proteins management #37

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from
Open

Conversation

JeanMainguy
Copy link
Member

@JeanMainguy JeanMainguy commented Dec 19, 2024

This PR resolves the issue described in #35, related to how Binette handles proteins given as inputs.

Currently, Binette excludes unbinned contigs to avoid unnecessary computations (e.g., running Prodigal and Diamond on unused contigs). While this saves time, it creates a mismatch when protein files are provided, as unbinned contigs are still present in the protein data. This inconsistency triggers errors during the step that checks for differences between the assembly contigs and the proteins.

Solution

Filtered Protein File

  • The input protein file is now filtered to exclude genes from unbinned contigs.
  • A filtered version of the file is saved in <outdir>/temporary_files/ and passed to Diamond for gene annotation.
  • This ensures only genes from contigs of interest are annotated, resolving the mismatch.

Additional changes

Temporary File Compression

  • Temporary files in <outdir>/temporary_files/, including the FAA file and Diamond results, are now compressed to save disk space.
    -The pyfastx index for the contig file is now stored in /temporary_files/ instead of alongside the assembly file. This ensures files are written only to the intended directory, avoiding unwanted file placements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant