Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the format for input using a csv file? #3

Open
DadongZ opened this issue Dec 18, 2019 · 7 comments
Open

What is the format for input using a csv file? #3

DadongZ opened this issue Dec 18, 2019 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@DadongZ
Copy link

DadongZ commented Dec 18, 2019

I have a large list of peptides and wondering what is the format if using a csv file as input. I have tried few ways but doesn't work. Is there a template?

@ikizhvatov
Copy link
Contributor

ikizhvatov commented Dec 18, 2019

Mentioning csv in the help is confusing, I see. Input csv should contain a peptide per line. I have added example_input.csv and the corresponding output csv to the repo.

@DadongZ
Copy link
Author

DadongZ commented Dec 26, 2019

Thanks! That's helpful.

@DadongZ
Copy link
Author

DadongZ commented Jan 3, 2020

Thanks, I got the below error with csv as input (~150K peptides):

Traceback (most recent call last):
  File "/home/dz33/.conda/envs/hmmhc/bin/hmmhc-predict", line 8, in <module>
    sys.exit(main())
  File "/home/dz33/.conda/envs/hmmhc/lib/python2.7/site-packages/hmmhc/cmdline.py", line 53, in main
    predictions = predictor.predict(peptides)
  File "/home/dz33/.conda/envs/hmmhc/lib/python2.7/site-packages/hmmhc/hmmhc.py", line 70, in predict
    normalizedLogOdds = self.computeLogOdds(peptidesList)
  File "/home/dz33/.conda/envs/hmmhc/lib/python2.7/site-packages/hmmhc/hmmhc.py", line 103, in computeLogOdds
    sequenceSetBlocks = self.toSequenceSetBlocks(peptideList)
  File "/home/dz33/.conda/envs/hmmhc/lib/python2.7/site-packages/hmmhc/hmmhc.py", line 156, in toSequenceSetBlocks
    [list(p) for p in peptideList[rangeStart:rangeEnd]]
  File "/home/dz33/.conda/envs/hmmhc/lib/python2.7/site-packages/ghmm.py", line 948, in __init__
    internalInput = [self.emissionDomain.internalSequence(seq) for seq in sequenceSetInput]
  File "/home/dz33/.conda/envs/hmmhc/lib/python2.7/site-packages/ghmm.py", line 393, in internalSequence
    result = map(lambda i: self.index[i], result)
  File "/home/dz33/.conda/envs/hmmhc/lib/python2.7/site-packages/ghmm.py", line 393, in <lambda>
    result = map(lambda i: self.index[i], result)
KeyError: 'X'

Some suggestions?

@ikizhvatov
Copy link
Contributor

At least one of your peptide-encoding strings contains character 'X' which denotes an unknown amino acid and is not supported by the predictor. 'U' is also not supported.

As a quick solution, for now please input only the peptide strings containing the 20 amino acids ACDEFGHIKLMNPQRSTVWY.

We will consider filtering the peptides with unsupported amino acids in the tool itself.

@DadongZ
Copy link
Author

DadongZ commented Jan 3, 2020

I removed all peptides contains X and U but still got error

Traceback (most recent call last):
  File "/home/dz33/.conda/envs/hmmhc/bin/hmmhc-predict", line 8, in <module>
    sys.exit(main())
  File "/home/dz33/.conda/envs/hmmhc/lib/python2.7/site-packages/hmmhc/cmdline.py", line 53, in main
    predictions = predictor.predict(peptides)
  File "/home/dz33/.conda/envs/hmmhc/lib/python2.7/site-packages/hmmhc/hmmhc.py", line 70, in predict
    normalizedLogOdds = self.computeLogOdds(peptidesList)
  File "/home/dz33/.conda/envs/hmmhc/lib/python2.7/site-packages/hmmhc/hmmhc.py", line 103, in computeLogOdds
    sequenceSetBlocks = self.toSequenceSetBlocks(peptideList)
  File "/home/dz33/.conda/envs/hmmhc/lib/python2.7/site-packages/hmmhc/hmmhc.py", line 156, in toSequenceSetBlocks
    [list(p) for p in peptideList[rangeStart:rangeEnd]]
  File "/home/dz33/.conda/envs/hmmhc/lib/python2.7/site-packages/ghmm.py", line 948, in __init__
    internalInput = [self.emissionDomain.internalSequence(seq) for seq in sequenceSetInput]
  File "/home/dz33/.conda/envs/hmmhc/lib/python2.7/site-packages/ghmm.py", line 393, in internalSequence
    result = map(lambda i: self.index[i], result)
  File "/home/dz33/.conda/envs/hmmhc/lib/python2.7/site-packages/ghmm.py", line 393, in <lambda>
    result = map(lambda i: self.index[i], result)
KeyError: '*'

@ikizhvatov
Copy link
Contributor

As said, peptide strings shall only contain the 20 canonical amino acids. You have '*' (it usually denotes a stop codon) in at least one of the strings.

@ikizhvatov ikizhvatov self-assigned this Jan 3, 2020
@ikizhvatov ikizhvatov added the enhancement New feature or request label Jan 3, 2020
@DadongZ
Copy link
Author

DadongZ commented Jan 3, 2020

A stop codon in peptide? These are peptides from genes CDS though and I have removed all X/U letters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants