Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question about the example. #1

Open
lllastronaut opened this issue Sep 24, 2024 · 1 comment
Open

A question about the example. #1

lllastronaut opened this issue Sep 24, 2024 · 1 comment
Assignees
Labels
bug Something isn't working data format How to format data into graph format documentation Improvements or additions to documentation question Further information is requested

Comments

@lllastronaut
Copy link

Hello, you have done a great job and thank you for your contribution to the field of protein-based genome representation.

When I was embedding the genome, I used the sample you provided (test.faa) and used the command (esm-embed --input ./test1.faa --outdir ./test_output --esm esm2_t6_8M --torch-hub test) to convert the protein sequence into esm2 embedding, and then used the command (pst graphify --file ./test_output/esm2_t6_8M_results.h5 --fasta-file ./test.faa --output ./test_graph.h5) to try to convert the esm2 embedding into a graph structure. An error occurred when converting the graph structure:
###################################################
ValueError: FASTA file headers must be in prodigal format: '>scaffold_ptn#' with the additional metadata separated by ' # '
####################################################
It seems to be a problem with the input file test.faa of the parameter --fasta-file ./test.faa. You mentioned in the Readme and -h that it should conform to the prodigal format (>scaffold_ptn#). I tried to modify the name line (the line starting with >) in test.faa to the prodigal format (>SAMEA.110_1 or >SAMEA_1), but the above error still occurred. How can I solve this problem? I noticed that you didn't provide a sample file for the complete process. If my understanding is wrong, could you provide the detailed format of the prodigal, or provide all the sample files required for the complete process? Thank you.

@cody-mar10
Copy link
Member

Hi, thanks for reaching out.

I am updating the wiki to better explain all of this and will provide an example FASTA file to run the end-to-end pipeline.

The protein FASTA headers should look like this:

>SAMEA2737773_b1_ct14_vs2@Podoviridae__sp._ctp7i14@linear_1 # 143 # 1474 # -1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=AGxAGG/AGGxGG;rbs_spacer=5-10bp;gc_cont=0.371

which is the standard output from prodigal/pyrodigal. The field separator is #, and the relevant fields for the graph format are the 1st (protein name and position in the scaffold) and 4th (encoding strand).

It sounds like your protein FASTA files do not follow the prodigal format, which is fine, but the strand information is still needed. You can create your own tab-delimited file for the encoding strand that looks like this:

genome1_1   1
genome1_2   1
genome1_3   1
genome1_4   -1

The protein names are in the first column (should be identical to the names in the FASTA file) and the strand in the second column. If you do not know what strands your proteins are encoded on, it should be ok to set them all to the same strand, but I haven't tested what that does in terms of genome representations.

@cody-mar10 cody-mar10 self-assigned this Oct 2, 2024
@cody-mar10 cody-mar10 added bug Something isn't working documentation Improvements or additions to documentation question Further information is requested data format How to format data into graph format labels Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data format How to format data into graph format documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants