-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rdr and wrt string macros and transcribe #14
Conversation
This looks fantastic! A couple of thoughts:
|
Goddamn muscle memory from slack... keep doing ctrl+enter for a newline /gripe
Now a couple of questions:
Overall, I think it's great, handles all of the pain points pretty effectively, and seems like it will be pretty extensible. Anyone that wants something more complicated is probably going to have to be comfortable digging in and implementing stuff on their own. |
Yeah and probably also
It is, for two reasons: It's slow, and it quickly devolves into a mess because the error could be thrown from anywhere, and caught in any try/catch block in the stack frame, so it's hard to know what error you're dealing with. However, in this case I think it's fine. The speed doesn't matter because file operations are slow anyway, and because in this case we don't really care what the error is - we need to close the files, no matter what kind of error. For the same reason, Base Julia uses try/catch all over the place in its file handling code: https://github.com/JuliaLang/julia/blob/master/base/file.jl
Yeah... is it possible to detect whether the streams are passed in as a variable though? I don't think it is. I agree it would be better to only close the files if the
The extension will be interpreted as a compression extension. That could lead to confusing error messages since it will then look at the next extension (which might be no extension, i.e. the empty string) and complain that the extension is unknown. Not sure how to improve the error message.
I don't think I can predict how one could use it for arbitrary readers, but for FASTX, we could interpret the flag string as a collection of characters, each of which is a flag. E.g. doing
Soon™ :) I think the time bottleneck is just to discuss it, the actual implementation is really easy and I can do it in an afternoon. |
We might also want to bikeshed the name |
Ahh, I guess I was thinking that the user could do eg The trouble with this approach is that if I make my own reader / writer, but I'm not done with the IO object after the transcribe, I can't keep using them. Some possible solutions:
|
What about a close keyword to transcribe which defaults to true? |
I'm OK with that too. I kinda like the idea of the macro emitting the IO object as well. But I also think that, since their main purpose is simplicity, and there are other ways of achieving this goal, perhaps we shouldn't over-complicate it. |
So, the macros does return an IO object. |
Hm, I think this needs more work. I just realized that:
|
Ahh, I see. I sort of forgot that you can call
I think this is fine - other functions of this sort (
Do you mean you can't do
? That would indeed suck. But this can't possibly be true... I'm probably misunderstanding what you mean |
Transcribe not having setupI think the lack of setup actually might be quite a problem. It means that in the code body of the function, you have to either not rely on any global-scope objects at all, or else risk having major, major slowdowns because you are working with globals.
Here, the access to the global I can see two options:
That is, the function body should return a single-argument function, which is then applied to each record. Is this too magical? Also, it opens up risk of encountering the infamous "slow closure bug" #15276, but that may be fixed someday. Macros not having do-blockI mean you can't do
because the do-syntax requires a function call on the syntax level, i.e. before the macro is expanded. |
does this not help:
? |
Hm, when timing it I'm struggling to find a case where the type stability matters. Even when using the highly optimised v2 FASTX and v1 Automa, and converting 5.68 million FASTQ records to FASTA, and when So yeah. probably this doesn't matter. Eh, whatever, I kind of like the current (as of commit df11fee) approach. The "slow closure" bug won't matter either, if dynamic dispatch is so cheap. Update: Indeed, performance is hit by the slow closure-bug. But with >4 M records/second, I don't think it's worth worrying about. |
Can the macro handle the inclusion of FASTA fai and BAM bai index files? |
Unfortunately, no. I'm not sure how that would be done, but that would be a good idea.
We should probably have that function anyway. But if indexing could be added to the macros that would be nice, too. |
Okay @kescobo and others: I'm going to merge this tomorrow, and then BioJulia/FASTX.jl#111 unless there are any objections.
AFAIK, the only way Julia allows to defer work is through do-blocks. That requires a function to do the deferral. We could use open(rdr"foo.fna") do reader
...
end , which is undoubtedly nice, as it's similar to the normal
Cases like the following is why julia> reader = rdr"foo.fna";
julia> isopen(reader)
true
julia> open(i -> nothing, reader)
julia> isopen(reader)
false It doesn't open - it closes. That's just super confusing. Also, what if a user tries to do: open(rdr"foo.fna", "r") do x
...
end ? We'd have to throw an error, unless we want to support all the normal arguments of Here are some reasons I think
|
Yeah - this sentiment feels more julian too. Using I still don't love |
This commit implements the `@rdr_str` and `@wtr_str` macros, which autodetect the correct readers, writers and de/compressors to open a biological file based on the extensions of the path. The system is extensible to arbitrary biological formats, but the extensions of compression formats are hardcoded in this package. I also add a dubious overload to `Base.open`, such that the readers and writer macros can be used like so: ```julia open(rdr"foo.fna", wtr"bar.fq") do reader, writer ... end ```
a2d2cff
to
965a59e
Compare
This is a whack at BioJulia/FASTX.jl#76 . I've done it in this repo so it can affect all
AbstractReader
s andAbstractWriter
s.This PR implements two new high-level operations that are convenient shorthands for already existing operations. I would like to get any feedback, especially regarding
Reader and writer macros
This allows a user to type e.g.
wtr"dir/hiv.fna.gz"
, which expands toFASTA.Writer(GzipCompressorStream(open("dir/hiv.fna.gz", "w")))
, and a similarrdr"path"
macro. Here is how it works: Let's userdr"abc.fa.gzip.xz.gz"flag
as an example.open("abc.fa.gzip.xz.gz")
..gz
, then.xz
, then.gzip
. From this it will generate nestedGzipDecompressorStream
andXzDecompressorStream
. This is repeated in a while loop until there either is no extension at all, or the rightmost extension is not recognized as a compression extension.fa
in this cases and callsreadertype(::Val{:fa}, "flag")
. The idea is that downstream packages can override this method, e.g. FASTX would override it toBioGenerics.IO.readertype(::Union{Val{:fa}, Val{:fna}}, flag) = FASTA.Reader
. The flag can be used as the downstream packages see fit, for any arguments to the readers/writers, but default to the empty string if not given.FASTA.Reader(GzipDecompressorStream(XzDecompressorStream(GzipDecompressorStream(open("abc.fa.gzip.xz.gz")))))
transcribe
functionIt's pretty easy, this is its definition:
I.e. you give it a function that takes the reader, then it reads all records of the reader, applies the function, then write the result to the writer if it is not
nothing
cc. @SabrinaJaye , @CiaranOMara , @kescobo