Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ncbi_egapx: form seems to work and generates what looks like valid yaml #29

Merged
merged 4 commits into from
Sep 9, 2024

Conversation

fubar2
Copy link
Contributor

@fubar2 fubar2 commented Sep 5, 2024

needs tests but cannot even run let alone test here - no machine with 120GB or 31 cores - because of the resource requirements baked into the docker image

image

needs tests but cannot even run let alone test here - no machine with 120GB or 31 cores -  because of the resource requirements baked into the docker image
@fubar2 fubar2 changed the title form seems to work and generates what looks like valid yaml ncbi_egapx: form seems to work and generates what looks like valid yaml Sep 5, 2024
Copy link
Owner

@richard-burhans richard-burhans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@richard-burhans richard-burhans marked this pull request as ready for review September 9, 2024 21:57
@richard-burhans richard-burhans merged commit 050f870 into richard-burhans:main Sep 9, 2024
11 checks passed
@richard-burhans
Copy link
Owner

@nekrut
Copy link

nekrut commented Oct 8, 2024

@fubar2 or @richard-burhans = we need to add an option that would allow EGAPx to take a file with Protein FASTA as annotation source (see https://github.com/ncbi/egapx?tab=readme-ov-file#input-data-format)

@fubar2
Copy link
Contributor Author

fubar2 commented Oct 9, 2024

would allow EGAPx to take a file with Protein FASTA

@nekrut: If that protein fasta is independent of the NCBI, then it may make sense to use it in the HMM.
Easy to add another form input, but you probably need to be certain that the supplied protein annotation is statistically independent of the NCBI protein annotation to contribute useful, unbiased information.

If any NCBI protein fasta exists for a taxon, it is AFAIK an output from running the internal NCBI pipeline that has become egapx. So predicting proteins using egapx, relying on information from a fasta that has been predicted by the father of egapx, may yield biased and uninterpretable results AFAIK because of lack of statistical independence in some of the inputs used for prediction?

Not an expert but this is a good question for one of the NCBI authors.

@marco91sol
Copy link

I believe we can implement this and run some tests to compare the results with and without the additional protein FASTA and HMM files. EGAPx treats these as optional parameters:
image
However, the challenge is that the taxid-protein set is not available for all taxa (I've already contacted the CGR group for more information). For instance, I encountered this issue while testing on sharks, where the taxid wasn't connected to any protein set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants