Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rdr and wrt string macros and transcribe #14

Merged
merged 1 commit into from
Sep 27, 2023

Conversation

jakobnissen
Copy link
Member

@jakobnissen jakobnissen commented Aug 8, 2022

This is a whack at BioJulia/FASTX.jl#76 . I've done it in this repo so it can affect all AbstractReaders and AbstractWriters.

This PR implements two new high-level operations that are convenient shorthands for already existing operations. I would like to get any feedback, especially regarding

  • Is it convenient enough? Could something be made nicer?
  • Is it extendible and flexible enough? Currently, I have FASTX on my mind and may not consider the general case. I.e. what would someone do with a VCF writer, or some writer that requires a footer, etc

Reader and writer macros

This allows a user to type e.g. wtr"dir/hiv.fna.gz", which expands to FASTA.Writer(GzipCompressorStream(open("dir/hiv.fna.gz", "w"))), and a similar rdr"path" macro. Here is how it works: Let's use rdr"abc.fa.gzip.xz.gz"flag as an example.

  • First, it generates open("abc.fa.gzip.xz.gz").
  • Then, it looks at each extension from right to left, checking for "compression extensions" in a hardcoded list. Here, it will find first .gz, then .xz, then .gzip. From this it will generate nested GzipDecompressorStream and XzDecompressorStream. This is repeated in a while loop until there either is no extension at all, or the rightmost extension is not recognized as a compression extension.
  • Then, it looks at the rightmost extension, fa in this cases and calls readertype(::Val{:fa}, "flag"). The idea is that downstream packages can override this method, e.g. FASTX would override it to BioGenerics.IO.readertype(::Union{Val{:fa}, Val{:fna}}, flag) = FASTA.Reader. The flag can be used as the downstream packages see fit, for any arguments to the readers/writers, but default to the empty string if not given.
  • The final code produced by this macro is FASTA.Reader(GzipDecompressorStream(XzDecompressorStream(GzipDecompressorStream(open("abc.fa.gzip.xz.gz")))))

transcribe function

It's pretty easy, this is its definition:

function transcribe(f::Function, writer::AbstractWriter, reader::AbstractReader)
    try
        for record in reader            
            transformed = f(record, reader)
            if transformed !== nothing
                write(writer, transformed)
            end
        end
    finally
        close(writer)
        close(reader)
    end
end

I.e. you give it a function that takes the reader, then it reads all records of the reader, applies the function, then write the result to the writer if it is not nothing

cc. @SabrinaJaye , @CiaranOMara , @kescobo

@kescobo
Copy link
Member

kescobo commented Aug 8, 2022

This looks fantastic! A couple of thoughts:

  1. BioGenerics.IO.readertype(::Union{Val{:fa}, Val{:fna}}, flag) = FASTA.Reader should include Val{:fasta}. I know - not part of this PR, but still :-P

@kescobo
Copy link
Member

kescobo commented Aug 8, 2022

Goddamn muscle memory from slack... keep doing ctrl+enter for a newline /gripe

  1. using try/catch in a function like that seems like non-idiomatic julia. There are better ways to handle opening / closing streams, no?
  2. related: if streams are being passed from outside the function, I don't think transcribe should be closing them.

Now a couple of questions:

  1. What happens if someone creates a reader flag that conflicts with a compression extension? This is unlikely to come up in practice, but presumably the compression pass happens first, so it gets consumed before it reaches the reader?
  2. How does the macro know if it's a Reader or a Writer? That is, whether it needs a XXXCompressorStream or a XXXDecompressorStream? Derp, I'm an idiot, there are 2 different Macros 🤦
  3. What kinds of things are we thinking for flags?
  4. When can I have this? I want this 🤣

Overall, I think it's great, handles all of the pain points pretty effectively, and seems like it will be pretty extensible. Anyone that wants something more complicated is probably going to have to be comfortable digging in and implementing stuff on their own.

@jakobnissen
Copy link
Member Author

jakobnissen commented Aug 9, 2022

Yeah and probably also .faa for amino acid FASTA extensions!

using try/catch in a function like that seems like non-idiomatic julia

It is, for two reasons: It's slow, and it quickly devolves into a mess because the error could be thrown from anywhere, and caught in any try/catch block in the stack frame, so it's hard to know what error you're dealing with.

However, in this case I think it's fine. The speed doesn't matter because file operations are slow anyway, and because in this case we don't really care what the error is - we need to close the files, no matter what kind of error. For the same reason, Base Julia uses try/catch all over the place in its file handling code: https://github.com/JuliaLang/julia/blob/master/base/file.jl

if streams are being passed from outside the function, I don't think transcribe should be closing them.

Yeah... is it possible to detect whether the streams are passed in as a variable though? I don't think it is. I agree it would be better to only close the files if the transcribe function is the only scope that has access to the file handles, but I don't think it's possible in Julia to check this. In some ways, closing the streams properly is the thing that makes designing concise APIs hard, in my opinion. And they really need to be closed, because writers won't write all their data until they are (or until they are flushed).

What happens if someone creates a reader flag that conflicts with a compression extension?

The extension will be interpreted as a compression extension. That could lead to confusing error messages since it will then look at the next extension (which might be no extension, i.e. the empty string) and complain that the extension is unknown. Not sure how to improve the error message.

What kinds of things are we thinking for flags?

I don't think I can predict how one could use it for arbitrary readers, but for FASTX, we could interpret the flag string as a collection of characters, each of which is a flag. E.g. doing rdr"path"c would see the c and set copy=false (in the v2 API), making it faster, just like the Inplace function suggested by Sabrina.

When can I have this?

Soon™ :) I think the time bottleneck is just to discuss it, the actual implementation is really easy and I can do it in an afternoon.

@jakobnissen
Copy link
Member Author

We might also want to bikeshed the name transcribe since FASTX v1 (the old version) exports this symbol. This is technically not breaking but could be a little annoying for people. stream is already taken. I kind of like transcribe though: It's short, well known to biologists, and not that common a word.

@kescobo
Copy link
Member

kescobo commented Aug 9, 2022

agree it would be better to only close the files if the transcribe function is the only scope that has access to the file handles, but I don't think it's possible in Julia to check this

Ahh, I guess I was thinking that the user could do eg open(#=something=#) do io... end, where transcribe is called inside the block, but I see now that the open is happening inside the macro call, so there's no handle to close on the outside 🤔 .

The trouble with this approach is that if I make my own reader / writer, but I'm not done with the IO object after the transcribe, I can't keep using them. Some possible solutions:

  1. transcribe just isn't flexible enough to handle this case. Users that want to do something complicated can roll their own. I actually think this is fine, and probably easier to maintain. Most use-cases are handled by this design.
  2. Offer 2 methods, one that's transcribe(f, ::Tuple{Reader, IO}, ::Tuple{Writer, IO}), and then have the macros spit out these tuples, and close the IO objects. The other is transcribe(f, ::Reader, ::Writer), and expects the user to handle the IO separately. I'm not sure how confusing this might be, and passing Tuples as arguments also seems non-ideomatic (though you could also make a 5 argument version that's the main implementation, and then use dispatch if you get the tuples from the macros).

@jakobnissen
Copy link
Member Author

jakobnissen commented Aug 9, 2022

What about a close keyword to transcribe which defaults to true?

@kescobo
Copy link
Member

kescobo commented Aug 9, 2022

I'm OK with that too. I kinda like the idea of the macro emitting the IO object as well. But I also think that, since their main purpose is simplicity, and there are other ways of achieving this goal, perhaps we shouldn't over-complicate it.

@jakobnissen
Copy link
Member Author

So, the macros does return an IO object. transcribe is just a shorthand for applying a function to each element of a reader, and dumping it to a writer.

@jakobnissen
Copy link
Member Author

Hm, I think this needs more work. I just realized that:

  • transcribe does not allow setting up any code that runs before the for loop of every record, say a dict to look up in or something
  • the rdr/wtr macros do not work with do-syntax, which sucks!

@kescobo
Copy link
Member

kescobo commented Aug 10, 2022

does return an IO object

Ahh, I see. I sort of forgot that you can call close directly on a reader/writer. Hmm... Maybe the file handle instead? Anyway, I don't think it's worth this complication for an obscure use-case.

transcribe does not allow setting up any code that runs before the for loop of every record, say a dict to look up in or something

I think this is fine - other functions of this sort (findall, map etc) don't either. I typically use a let block if I need something temporary. In principle, you could have a setup = ... argument or something, but that doesn't feel all that useful.

he rdr/wtr macros do not work with do-syntax, which sucks!

Do you mean you can't do

transcribe(rdr"thing.gz", wtr"thing2.gz") do stuff ... end

? That would indeed suck. But this can't possibly be true... I'm probably misunderstanding what you mean

@jakobnissen
Copy link
Member Author

Transcribe not having setup

I think the lack of setup actually might be quite a problem. It means that in the code body of the function, you have to either not rely on any global-scope objects at all, or else risk having major, major slowdowns because you are working with globals.
For example, suppose I wanted to do the following

fna = FASTARecord()
transcribe(wtr"path1", rdr"path2") do record, reader
    copy!(fna, record)
end

Here, the access to the global fna makes it super slow.

I can see two options:

  • Just have it be slow, whatever. People use Python
  • Rewrite it so you use it like this:
transcribe(wtr"path1", rdr"path2") do writer, reader
    fna = FASTARecord()
    record -> copy!(fna, record)
end

That is, the function body should return a single-argument function, which is then applied to each record. Is this too magical? Also, it opens up risk of encountering the infamous "slow closure bug" #15276, but that may be fixed someday.

Macros not having do-block

I mean you can't do

rdr"some/path.fna" do reader
    [stuff]
end

because the do-syntax requires a function call on the syntax level, i.e. before the macro is expanded.
We might need a wrapper function like with(rdr"path.fna") do .

@kescobo
Copy link
Member

kescobo commented Aug 11, 2022

I think the lack of setup actually might be quite a problem.

does this not help:

let fna = FASTARecord()
    transcribe(wtr"path1", rdr"path2") do record, reader
        copy!(fna, record)
    end
end

?

@jakobnissen
Copy link
Member Author

jakobnissen commented Aug 11, 2022

Hm, when timing it I'm struggling to find a case where the type stability matters. Even when using the highly optimised v2 FASTX and v1 Automa, and converting 5.68 million FASTQ records to FASTA, and when f is completely uninferrable and returns Any, then it still does about 4.5 million records/second.

So yeah. probably this doesn't matter. Eh, whatever, I kind of like the current (as of commit df11fee) approach. The "slow closure" bug won't matter either, if dynamic dispatch is so cheap.

Update: Indeed, performance is hit by the slow closure-bug. But with >4 M records/second, I don't think it's worth worrying about.

@CiaranOMara
Copy link
Member

Can the macro handle the inclusion of FASTA fai and BAM bai index files?

@jakobnissen
Copy link
Member Author

Unfortunately, no. I'm not sure how that would be done, but that would be a good idea.
Perhaps we should have a function index!(::Reader, x) that attaches an index to an existing reader? Then one could do:

with(rdr"/path/to.fna.gz") do reader
    index!(reader, "/path/path/index.fai")
    [ stuff ]
end

We should probably have that function anyway. But if indexing could be added to the macros that would be nice, too.

@jakobnissen
Copy link
Member Author

Okay @kescobo and others:

I'm going to merge this tomorrow, and then BioJulia/FASTX.jl#111 unless there are any objections.
I considered a few different approaches - let me summarize why I think the defer is the best:

  • The syntax rdr"path/to/file.fna.gz" do reader does not parse correctly, and that's not something we can do anything about, so that's not a possible avenue.
  • We could rely on automatic closing: Writers already automatically close their IO when they are being garbage collected, and we could make readers close when they reach EOF. However, I believe this behaviour is too "magical" and behind-the-curtain: Users do not know that it's important to close their files, and so may be dismayed when
    • They open too many files at once without closing (say, if the GC is not kicking in), and their OS yells at them
    • They don't understand why they write to a writer but it has no effect (because the filehandle is not flushed yet)
  • However, for safety, we still close writers when they are being GCd.

AFAIK, the only way Julia allows to defer work is through do-blocks. That requires a function to do the deferral.

We could use open, as suggested by @kescobo. It has the advantage that the usage is like this:

open(rdr"foo.fna") do reader
    ...
end

, which is undoubtedly nice, as it's similar to the normal open-do syntax. However.... I feel like the semantics is off. Just the other day I wrote this on Slack:

I've found that whenever you want to do things that are questionable semantically [ ... ] but you want it because it's practical/useful, it's almost always a trap and you shouldn't do it.

Cases like the following is why open is questionable semantically:

julia> reader = rdr"foo.fna";

julia> isopen(reader)
true

julia> open(i -> nothing, reader)

julia> isopen(reader)
false

It doesn't open - it closes. That's just super confusing. Also, what if a user tries to do:

open(rdr"foo.fna", "r") do x
    ...
end

? We'd have to throw an error, unless we want to support all the normal arguments of Base.open. And that'd be a mess. The basic problem is that what we want to do is NOT really open...

Here are some reasons I think defer is a good choice, even when it's a little obscure:

  • AFAIK, it's the default name for exactly this thing (namely cleaning up resouces) in other languages including Go and Zig.
  • defer is short and easy to type on a QUERTY keyboard

@kescobo
Copy link
Member

kescobo commented Sep 24, 2023

I've found that whenever you want to do things that are questionable semantically [ ... ] but you want it because it's practical/useful, it's almost always a trap and you shouldn't do it.

Yeah - this sentiment feels more julian too. Using open has a distinctly DWIM feeling, which I think is generally discouraged in julia. DWIM often feels superficially nice, which is why it's appealing, but often ends up in a regrettable place.

I still don't love defer, but I can live with it. It doesn't feel like deferring something, since presumably the thing happens when it's called. But I don't have many better suggestions, other than apply, which I like because you're applying some function to the reader/writer.

This commit implements the `@rdr_str` and `@wtr_str` macros, which autodetect
the correct readers, writers and de/compressors to open a biological file
based on the extensions of the path.

The system is extensible to arbitrary biological formats, but the extensions
of compression formats are hardcoded in this package.

I also add a dubious overload to `Base.open`, such that the readers and writer
macros can be used like so:
```julia
open(rdr"foo.fna", wtr"bar.fq") do reader, writer
    ...
end
```
@jakobnissen jakobnissen merged commit 28e7237 into BioJulia:master Sep 27, 2023
@jakobnissen jakobnissen deleted the highlevel branch September 27, 2023 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants