-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abstract ref_to_m5 from cram_io #589
base: develop
Are you sure you want to change the base?
Conversation
db1c511
to
16bcd0c
Compare
I've squashed the commits into two commits and made a pull request for the in-memory I/O at #590. The original history is now at https://github.com/wtsi-hgi/htslib/tree/ref-old |
I've had a look at the double buffering problem - it's hard to stop buffering happening in a hFILE, so I think it's best to have a implement a "wrapper hFILE", which contains extra information, but redirects all hFILE calls to the wrapper object. I've implemented this above and made |
Now the mem-io branch has been merged, is it possible to untangle this PR by removing those merges and rebasing on develop? I gave it a try, but it's a bit tricky and I ended up with something that didn't compile! You understand this bit better than me. |
hfile.c
Outdated
@@ -459,6 +488,14 @@ off_t hseek(hFILE *fp, off_t offset, int whence) | |||
|
|||
int hclose(hFILE *fp) | |||
{ | |||
if(fp->is_hFILE_wrapper){ | |||
fp->backend->close(fp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generates a warning:
hfile.c: In function 'hclose':
hfile.c:493:27: warning: ignoring return value of function declared with attribute warn_unused_result [-Wunused-result]
fp->backend->close(fp);
ref.c
Outdated
} | ||
|
||
int ref_close(hFILE* hf){ | ||
int error_code = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused variable
ref.c
Outdated
|
||
if(hfref->mf) | ||
return mfclose(hfref->mf); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undefined return value here. Please add an else clause.
I've rebased the current |
For ease of reviewing the code, this is what I've done in this pull request:
If it would be easier to review, I could also move the changes that implement hFILE_wrappers into a new pull request. |
Please make sure that all tests pass, on both HTSlib and samtools. I get 13 failures when building samtools (develop) against this HTSlib branch. |
In the above commits:
|
aclocal.m4
Outdated
@@ -0,0 +1,290 @@ | |||
# generated automatically by aclocal 1.15.1 -*- Autoconf -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this file from the pull request and add it to .gitignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I think aclocal.m4 will be recreated by autoreconf -i
or similar.
test/test_view -D test/xx#rg.tmp.cram > /dev/null && \ | ||
rm -rf $$REF_CACHE; \ | ||
else \ | ||
echo "Warning: testing of EBI fetching disabled"; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete the message with instructions about how to enable this test.
@@ -121,6 +121,11 @@ AC_ARG_ENABLE([s3], | |||
[support Amazon AWS S3 URLs])], | |||
[], [enable_s3=check]) | |||
|
|||
AC_ARG_ENABLE([http-tests], | |||
[AS_HELP_STRING([--disable-http-tests], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describe the option --disable-http-tests
in the Configuration section of the INSTALL file.
@@ -361,6 +365,26 @@ check test: $(BUILT_PROGRAMS) $(BUILT_TEST_PROGRAMS) | |||
test/test-regidx | |||
cd test && REF_PATH=: ./test.pl $${TEST_OPTS:-} | |||
|
|||
test_ebi_fetch: | |||
@if [ "$(ENABLE_HTTP_TESTS)" = true ] ; then \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this script to a separate .sh
file and run it from here (similarly to what has been done for the tabix test above).
cram/cram_io.c
Outdated
fd->refs->fp = fp; | ||
fd->refs->fn = r->fn; | ||
r->is_md5 = 1; | ||
if (!(tag = sam_hdr_find_key(fd->header, ty, "M5", NULL))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add curly braces to the else block, add them to the if block, as well, even if there is only one instruction in the block.
cram/cram_io.c
Outdated
sanitise_SQ_lines(fd); | ||
if (fd->refs->fp) { | ||
if (bgzf_close(fd->refs->fp) != 0) | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the indentation of this row.
cram/cram_io.c
Outdated
fd->refs = refs; | ||
if (fd->refs->fp) { | ||
if (bgzf_close(fd->refs->fp) != 0) | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the indentation of this row.
cram/cram_io.c
Outdated
@@ -2383,7 +2168,7 @@ ref_entry *cram_ref_load(refs_t *r, int id, int is_md5) { | |||
return NULL; | |||
r->fn = e->fn; | |||
if (!(r->fp = bgzf_open_ref(r->fn, "r", is_md5))) | |||
return NULL; | |||
return NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the tab to spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valeriuo All the code in the cram dir has mix of tabs and spaces. We should probably do a single commit to change it to match the rest of htslib, but for now it is like this because it was initially a direct copy from Staden io_lib which had the code in this layout.
We don't really need to fix these in cram dir in this PR, but we do when we move things to htslib's ref.c (which has been done).
AC_ARG_ENABLE([http-tests], | ||
[AS_HELP_STRING([--disable-http-tests], | ||
[disables tests that require querying to a http server])], | ||
[disable_http_tests=yes], [disable_http_tests=no]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it was decided that HTTP tests should be disabled by default. So the option should be renamed --enable-http-tests
and the associated variable should only be set to true when this option is selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Although @daviesrob may have a trivial perl http server (with no complex perl package dependencies) that could permit testing against a local server without needing network access. (I'm a bit wary of going that route though unless it's very little extra code to maintain.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall aim of this PR is very worth while and it's long overdue for the monstrosity of reference handling in cram_io.c to be tidied up and split into its own file. So thank you for working on this.
The basic restructuring seems sound, although I haven't yet tested it in anger and I see a few potential memory problems lurking. I'll give it some real world testing to see what happens.
I do wonder though whether the whole mFILE thing should be retired in favour of the in-memory hFILE layer istelf. When the CRAM code was written there was no hFILE layer present, but now we have two things serving the a similar purpose. However perhaps this improvement is best done as a subsequent PR.
PS. The main thing that mFILE can do whereas hFILE in-memory code cannot is utilise mmap.
hfile.c
Outdated
@@ -461,6 +489,15 @@ off_t hseek(hFILE *fp, off_t offset, int whence) | |||
|
|||
int hclose(hFILE *fp) | |||
{ | |||
if(fp->is_hFILE_wrapper){ | |||
// need to close this file anyway, ignore the return value | |||
(void)fp->backend->close(fp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to check the return value here. You can't discard it with a void cast either as the compiler still notices! :-)
hfile.c: In function 'hclose':
hfile.c:494:9: warning: ignoring return value of function declared with attribute warn_unused_result [-Wunused-result]
(void)fp->backend->close(fp);
^
aclocal.m4
Outdated
@@ -0,0 +1,290 @@ | |||
# generated automatically by aclocal 1.15.1 -*- Autoconf -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I think aclocal.m4 will be recreated by autoreconf -i
or similar.
AC_ARG_ENABLE([http-tests], | ||
[AS_HELP_STRING([--disable-http-tests], | ||
[disables tests that require querying to a http server])], | ||
[disable_http_tests=yes], [disable_http_tests=no]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Although @daviesrob may have a trivial perl http server (with no complex perl package dependencies) that could permit testing against a local server without needing network access. (I'm a bit wary of going that route though unless it's very little extra code to maintain.)
cram/cram_io.c
Outdated
@@ -2383,7 +2168,7 @@ ref_entry *cram_ref_load(refs_t *r, int id, int is_md5) { | |||
return NULL; | |||
r->fn = e->fn; | |||
if (!(r->fp = bgzf_open_ref(r->fn, "r", is_md5))) | |||
return NULL; | |||
return NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valeriuo All the code in the cram dir has mix of tabs and spaces. We should probably do a single commit to change it to match the rest of htslib, but for now it is like this because it was initially a direct copy from Staden io_lib which had the code in this layout.
We don't really need to fix these in cram dir in this PR, but we do when we move things to htslib's ref.c (which has been done).
hfile.c
Outdated
errno = save; | ||
fp = ((hFILE_wrapper*)fp)->innerhf; | ||
|
||
free(fp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong. I think you wanted something like
hFILE *wrapper = ((hFILE_wrapper*)fp)->innerhf;
free(fp);
fp = wrapper;
However I'd need to think about it a bit to know which is the inner vs outer we should actually be freeing here.
htslib/ref.h
Outdated
#ifndef HTSLIB_REF_H | ||
#define HTSLIB_REF_H | ||
|
||
#include "bgzf.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is bgzf.h here? I don't think this will work when installed either.
ref.c
Outdated
|
||
// The buffer is freed in mfclose, but we need something for | ||
// hclose to free (the caller of this), or it will error | ||
((hFILE_wrapper*)hf)->innerhf->buffer = malloc(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels icky. Is the comment really true about needing something? Note we can call free(NULL) fine these days. It's something that I still end up checking for out of habit, but it's been this way for decades now.
I haven't tested it, but would prefer ((hFILE_wrapper*)hf)->innerhf->buffer = NULL
if that works. Alternatively if we're about to free innerhf->buffer and it's the same pointer as in the mfile then we can call mfsteal
instead of mfclose
which leaves the data unfreed (maybe nuances with mmap?) and let the hclose free as normal?
ref.c
Outdated
else{ | ||
mf = NULL; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure about the above.
mfsteal
returns the buffer and size from the mFILE and then destroys (frees) the mFILE. It's designed to turn an mFILE into a C buffer basically. Thus after the mfsteal
call mf
is now undefined and we cannot dereference it and nor should we pass it to open_hfile_ref
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm looking further I see how it works. It only uses mf if mfsteal failed, otherwise it sets mf to NULL. mfsteal can only fail if mfdetach fails, which in turn only fails if we're using mmap.
I think the logic in this function is therefore correct (sorry), but I'm wondering if it's a symptom of my mfsteal being too complex in error handling.
At the very least we probably should do something like this to get to a consistent point whether or not the steal fails:
if (!seq) {
seq = mf->data;
mf->data = NULL;
mfdestroy(mf);
}
and remove the mf parameter from open_file_href
. This all ties up with the ref_close comments though too, as that's where href->mf
is used. Is the ref_close only expected to be doing mFILE things due to mmap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for spamming. I've finally figured out what's happening here and I feel guilty for commenting on what turns out to be entirely my own demons!
So, just incase anyone else who is reading this cares...
The original code in cram_populate_ref was:
/* Otherwise search full REF_PATH; slower as loads entire file */
if ((mf = open_path_mfile(tag->str+3, ref_path, NULL))) {
size_t sz;
r->seq = mfsteal(mf, &sz);
if (r->seq) {
r->mf = NULL;
} else {
// keep mf around as we couldn't detach
r->seq = mf->data;
r->mf = mf;
}
r->length = sz;
r->is_md5 = 1;
} else {
(The comment would probably have been useful to keep, albeit with explanation that mfdetach won't free the data if we used mmap.)
Basically if the open_path_mfile
used mmap on an md5 file then we need to keep the file descriptor open as we need it for munmap later on. Otherwise once it's open we can just steal the pointer can close it. This would have been the original way it was done, with mmap appearing later on. With hindsight it probably should have kept mf valid in all scenarios, for simplicity, and never attempted the mfsteal part at all. We probably needed a "mf_close_if_able" function or similar that closed the file descriptor if not mmaped so it's no longer consuming resources, but keep the rest of the mFILE struct present.
Hence hFILE has now taken the role as replacement for ref_entry *r
and r->mf
. It's the same code as my original, just with different structure names. Sorry Thomas. :)
I think what all of this points towards is that mFILE is overkill for what we need here. It was designed as a transparent drop in for stdio FILE I/O so it supports read/write, seeking, update, etc. We really just need a load-or-mmap function, which is what the hfile mem layer provides (minus the missing mmap bit).
PS. My head hurts!
ref.c
Outdated
perror(path_tmp); | ||
|
||
// Not fatal - we have the data already so keep going. | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We lost a return 0
(NULL now) in the if (!fp)
clause from the original cram_io.c copy. The intent was if we fail to create a local temporary file name to cache the reference, then so be it - we do still have our reference and the code can continue to work, but for whatever reason (eg permission denied) being able to write to the cache is a non-fatal condition.
However continuing means we end up calling hwrite, hfclose etc on NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have validated that this is an issue and that adding return 0
after the comment fixes it.
Try setting REF_CACHE to a directory that you don't have write access to and running the test-ref tool. It'll segmentation fault.
Edit: not return 0, but something like this:
if (!fp) {
perror(path_tmp);
// Not fatal - we have the data already so keep going.
free(path);
return ref;
}
This still leaks 56 bytes somewhere though, but that appears to happen in the other code paths too:
==29772== 56 bytes in 1 blocks are definitely lost in loss record 4 of 7
==29772== at 0x4C2CFA7: malloc (vg_replace_malloc.c:296)
==29772== by 0x458038: hfile_init_fixed (hfile.c:127)
==29772== by 0x458108: create_hfile_mem (hfile.c:815)
==29772== by 0x458108: hopenv_mem (hfile.c:864)
==29772== by 0x459C4A: hopen (hfile.c:1086)
==29772== by 0x45AD62: mFILE_to_hFILE_ref (ref.c:245)
==29772== by 0x45AE53: m5_to_ref (ref.c:317)
==29772== by 0x43F105: cram_populate_ref (cram_io.c:1934)
==29772== by 0x43F105: cram_get_ref (cram_io.c:2280)
==29772== by 0x429F71: cram_decode_slice (cram_decode.c:2290)
==29772== by 0x42A4F3: cram_next_slice (cram_decode.c:3103)
==29772== by 0x42A4F3: cram_get_seq (cram_decode.c:3176)
==29772== by 0x42AAC0: cram_get_bam_seq (cram_decode.c:3236)
==29772== by 0x419EFD: sam_read1 (sam.c:1346)
==29772== by 0x40417F: main (test_view.c:285)
} | ||
|
||
return error_code; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a newline at the end of file.
cram/cram_io.c
Outdated
return -1; | ||
} | ||
r->seq = seq; | ||
r->mf = hfref->mf; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we have hfref=ref
above on line 1981 and free(ref)
at line 2001, this is accessing freed memory. The free statement needs moving down a few lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, the free(ref)
isn't sufficient. If ref is an hFILE_wrapper then we also need to free the innerhf too. Eg:
if (ref->is_hFILE_wrapper)
free(((hFILE_wrapper *)ref)->innerhf); // hack!
free(ref); // Can't hclose as this would free the buffer
This is all feeling like the wrong solution and it's going to fail if we ever have other forms of wrappers, but without the above we leak memory.
I'm trying to recall the reasoning for the wrapper and why it can't just be an hFILE_mem
hfile class without any wrapping up. I think the wrapper is there because we have another type (hFILE_ref
), but perhaps we just want hFILE_mem expanded to be able to do the same as the ref one.
The ref version adds mf and file_name fields. mf is only needed due to mmap and the requirement to unmap at the end. File_name I can't see a use for in this code. If we have requirements for these then we may be better off adding them both to an extended hFILE_mem structure instead of introducing an entirely new wrapper layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkbonfield The use of hfile_wrappers comes from the need to use the existing code in cram_populate_ref and functions that call it, without much modification - I didn't want to rewrite the logic in cram_populate_ref
and cram_get_ref
(and maybe other places). This required a structure that could hold both in-memory structures to be contained as well holding arbitrary hFiles generated from https://github.com/wtsi-hgi/htslib/blob/1a6a7e242d75c9b746653ee548d2bce430cb905b/ref.c#L304. The file name, length and mFile (if used) also needed to be stored, where the file name and length are used here: https://github.com/wtsi-hgi/htslib/blob/1a6a7e242d75c9b746653ee548d2bce430cb905b/cram/cram_io.c#L1989 and here: https://github.com/wtsi-hgi/htslib/blob/1a6a7e242d75c9b746653ee548d2bce430cb905b/cram/cram_io.c#L1983.
I could rewrite relevant code such that m5_to_ref
would use an extended in-memory hFile pointing either memory or mmapped data, where this extended structure would contain a flag indicating whether this is an mmapped structure and it's length. However, it may be better to work to merge this PR as it is with the inelegant implementation, then refractor the code at a later time.
This creates a new file: `ref.c`, which contains a function `m5_to_path`. This function returns a valid path to a reference from the MD5 checksum of a reference. This replaces logic in cram_io, open_trace_file and mFILE.
This adds functionality implemented in b39e724.
This pull request extracts the function
m5_to_ref
fromcram_io
and adds tests.m5_to_ref
and helper functions for it are added in a new file calledref.c
.I've also added a new
hFILE
extension called hFILE_ref, internal to ref.c as a wrapper hFILE to carry information required bycram_populate_ref
(such as file name, mFILE handle for MMapped files etc.). and also implemented the proposed in-memory I/O proposed in #417, so that thism5_to_ref
returns a hFILE. Sorry about the number of commits, it took a bit of trying to get this working - I could squash the commits down into one if necessary