Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieving the entire query sequence from a blast, not just the 'local' aligned HSP #2

Open
vmkalbskopf opened this issue Aug 11, 2021 · 3 comments

Comments

@vmkalbskopf
Copy link

Hi there

I am using PyBlast to find similar sequences (obviously) in a different genome. None of these are model organisms.
The queries are fasta files, each with the same gene but from different strains. So I have about 12 genes and 10 strains.
I need to get the length of the entire query, not just the length of the part of the query that aligned to the database. Do you know if that's accessible?

If not, I will have to blast each sequence in each fasta file separately, which will be slower than blasting the whole file at a time.
This is what it looks like now:

for q in qdir.glob('*.fasta'):
    bcl = BCLine6("blastn", query=q,
    subject=db, word_size=11, evalue=0.01, outfmt="evalue sstrand")
    res = bcl.run(ncore=8, quiet=True)
    print(f'query length = {len(q.seq)}')

But of course q is not the actual sequence record, but a file name.
And qlen is the length of the aligned query, not the length of the whole query sequence.

@vmkalbskopf
Copy link
Author

vmkalbskopf commented Aug 11, 2021

Darn

I'm now trying to blast each seq record in the fasta file separately, but getting an error from PyBlast:

    for q in qdir.glob('*.fasta'):
        for rec in SeqIO.parse(q, 'fasta'):
            bcl = BCLine6("blastn", query=rec,
            subject=db, word_size=11, evalue=0.01, outfmt="evalue sstrand")
            res = bcl.run(ncore=8, quiet=True)

Error:

Traceback (most recent call last):
  File "./blast_1.py", line 36, in <module>
    res = bcl.run(ncore=8, quiet=True)
  File "/home/victor/anaconda3/lib/python3.7/site-packages/pyblast/bloc.py", line 228, in run
    return self.treat_res(res)
  File "/home/victor/anaconda3/lib/python3.7/site-packages/pyblast/bloc.py", line 164, in treat_res
    return pd.concat([self.treat_res(r) for r in res])
  File "/home/victor/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 284, in concat
    sort=sort,
  File "/home/victor/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 331, in __init__
    raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate

Did I just not get a hit? Or is there something else going on?

@vmkalbskopf
Copy link
Author

OK! I feel like I'm turning this into my personal dev diary.
I've gone around the problem by opening the query fasta file separately from the blast, finding the seq record that got a hit, and getting the len of that seq.
Feel free to close this issue, unless you want to add more functionality. But I suspect biopython is the limiting factor here.

@jsgounot
Copy link
Owner

I'm a bit confused, are you sure that qlen is not what you're looking for? Do you have a concrete example where qlen (and not qlength) is not the query total length?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants