Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use seqkit for nonredundant prodigal #34

Open
tijeco opened this issue Sep 26, 2021 · 0 comments
Open

use seqkit for nonredundant prodigal #34

tijeco opened this issue Sep 26, 2021 · 0 comments
Labels
enhancement New feature or request

Comments

@tijeco
Copy link
Collaborator

tijeco commented Sep 26, 2021

The current protocol uses pandas, which is pretty memory intensive, and probably won't scale amazingly, unless swapping to some other pandas big data version protocol thing. I think it may be best to just use seqkit. Currently the expanded nonredundant file has ,pephash,sample,contig,start,stop,strand,allStandardAA,seq, this can be handled usin seqkit fx2tab with seq-hash, then plug that into seqkit tab2fx with the hash as the header. What we have is fine for now, but this will definitely be needed when scaling. Honestly, at that point we should probably also use seqkit to split the nr data into max_threads number of files for parallelization (though that is an entirely different issue)

@tijeco tijeco added the enhancement New feature or request label Oct 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant