Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in-memory I/O using hFILE fixed buffers #590

Closed
wants to merge 9 commits into from

Conversation

ThomasHickman
Copy link
Contributor

This PR implements what was proposed in #417, by implementing writing for fixed buffers and implementing a wrapper.

To create a in-memory hFILE:

char* test_string = strdup("Test string");
hFILE hf = open("mem:", "rw:", test_string, 12);

you can also access the internal buffer using:

char* buffer;
size_t size;
hfile_mem_get_buffer(hf, &buffer, &size);

This PR also implements preloading the enter contents of a file into a fixed buffer:

hFILE hf = hopen("file/path", "r:", "preload");

Copy link
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

Rob has a far greater knowledge of the hfile overloading code than myself so I'll let him add more comments on that side of thing, specifically the double buffering questions.

hfile.c Outdated
FILE *file = fopen(filename, mode);
if (!file) goto error;

fseek(file, 0, SEEK_END);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to check the return value of fseek, incase someone has given a non-seekable stream to this function. (Eg /dev/stdin)

hfile.c Outdated
char* buffer = malloc(len);
if(buffer == NULL)
{
errno = ENOMEM;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually setting errno isn't recommended. Malloc will do this itself, as would the fread below. Indeed we're obscuring the fread return as it may have other more meaningful errno values.

hfile.c Outdated
@@ -821,6 +862,41 @@ static int init_add_plugin(void *obj, int (*init)(struct hFILE_plugin *),
return 0;
}

hFILE *hopenv_mem(const char *filename, const char *mode, va_list args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the existing hopen_mem, which is strictly URL based rather than generic memory based, should be reimplemented in terms of the generic hopenv_mem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good idea, though in hopenv_mem there isn't (and I don't think it would be a good idea to) a way to specify a different buffer size from the length of the buffer that is filled - this is needed in hopen_mem. I think it would be a good to create a function create_hfile_mem, which would do the functionality in common with both functions (which I have implemented below)

@ThomasHickman
Copy link
Contributor Author

@jkbonfield thanks for the comments, the double buffering issue is actually to do with the hfile_ref structure in #589, so Rob could have a look at the issue there.

Copy link
Contributor

@valeriuo valeriuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review.

hfile.c Outdated
@@ -615,6 +615,39 @@ static hFILE *hopen_fd(const char *filename, const char *mode)
return NULL;
}

static hFILE *hpreload_fd(const char *filename, const char *mode)
{
if(!strchr(mode, 'r'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should check mode value. If NULL, strchr will seg fault.

hfile.c Outdated
static hFILE *create_hfile_mem(char* buffer, const char* mode, size_t buf_filled, size_t buf_size)
{
hFILE_mem *fp = (hFILE_mem *) hfile_init_fixed(sizeof(hFILE_mem), mode, buffer, buf_filled, buf_size);
if (fp == NULL) { free(buffer); return NULL; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should let the caller function free buffer.

hfile.c Outdated
@@ -735,6 +777,8 @@ static hFILE *hopen_mem(const char *url, const char *mode)
hts_decode_percent(buffer, &length, data);
}

return create_hfile_mem(buffer, mode, length, size);

hFILE_mem *fp = (hFILE_mem *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is unreachable.

hfile.c Outdated
hFILE_mem *fp = (hFILE_mem *)
hfile_init_fixed(sizeof (hFILE_mem), mode, buffer, length, size);
if (fp == NULL) { free(buffer); return NULL; }
return create_hfile_mem(buffer, mode, length, size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to free buffer. I would check the value returned by create_hfile_mem and, if NULL, free(buffer).

@ThomasHickman
Copy link
Contributor Author

Had another look at this as part of trying to solve the double buffering issue for the hfile_ref structure. I've changed this so the buffer doesn't get freed on hclose - it should be the caller's responsibility to free the buffer as it allocated it.

@dkj
Copy link
Member

dkj commented Oct 3, 2017

G131 asks @jmarshall do you have time and the inclination to take a peek?

@jmarshall
Copy link
Member

@dkj: Thanks for the notification, which I have only just seen; will take a peek.

@jkbonfield
Copy link
Contributor

The discussion at the start of hfile.c regarding mobile vs immobile windows is now incorrect, as in-memory streams can be "mobile" again. (I'm rather confused on the terminology though.) It probably needs to be explained in a different way, such as a fixed-size memory buffer, with a comment about variable-sized mobile in-memory buffers also being supported.

I'll look at updating this as I have a few other tweaks to do I think.

hfile.c Outdated
}

hFILE *fp = hpreload_fd(fname, mode);
if (fp == NULL && errno == ENOENT) errno = EPROTONOSUPPORT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure why this line exists. hpreload_fd can return NULL/ENOENT for valid reasons, and ENOENT is suitable then I think.

hfile.c Outdated
void hfile_destroy(hFILE *fp)
{
int save = errno;
if (fp) free(fp->buffer);
if (fp && fp->backend != &mem_backend) free(fp->buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extra check means data: hopens now leak memory. Why are we not freeing here?

Changing this back makes test/hfile pass valgrind leak checks provided we also remove the now redundant free(internal_buf) calls.

I think it is probably fine to accept that obtaining the internal buffer doesn't equate to now also owning this buffer. If we wish to have the concept of taking ownership (and hence freeing responsibilities) we'd need an additional flag somewhere.

@jkbonfield
Copy link
Contributor

I'm not convinced this works.

I created a noddy test case so I could strace it and see what happens:

    {
        char *b = malloc(1000000), *p = b;
        int len;
        printf("XYZZY: Start\n"); fflush(stdout);
        fin = hopen("/tmp/_", "r");
        while ((len = hread(fin, p, 15551)) > 0) {
            p += len;
        }
        len = p-b;
        if (hclose(fin) != 0) abort();
        printf("XYZZY: Read %d bytes\n", len); fflush(stdout);

        char c[15551];
        printf("XYZZY: Preload start\n"); fflush(stdout);
        fin = hopen("/tmp/_", "r:", "preload");
        printf("XYZZY: Preload End\n"); fflush(stdout);
        p = b;
        while ((len = hread(fin, c, 15551)) > 0) {
            if (memcmp(c, p, len)) {
                fail("mismatch at %d-%d\n", p-b, p-b+len);
            }
            p += len;
        }
        printf("XYZZY: Seek\n"); fflush(stdout);
        hseek(fin, 0, SEEK_SET);
        p = b;
        while ((len = hread(fin, c, 15551)) > 0) {
            if (memcmp(c, p, len)) {
                fail("mismatch at %d-%d\n", p-b, p-b+len);
            }
            p += len;
        }
        printf("XYZZY: Closing\n"); fflush(stdout);
        if (hclose(fin) != 0) abort();
    }

This simply rereads the file each time. It hasn't preloaded at all.

The cause is due to hopen and unknown_scheme. It runs find_scheme_handler(fname) but as "/tmp/_" is a non-url it simply returns NULL instead of returning &unknown_scheme. This then just calls hopen_fd and doesn't have any checking for "preload".

We could change this code in hopen, or make find_scheme_handler always return a scheme instead of returning NULL, so that the unknown scheme is always used (which boils own to straight hopen_fd mostly), but I'm not sure it's the right approach.

I assume the idea of adding a preload argument comes from #417. If this is what we're trying to do, then then hacking the unknown scheme handler to support preload feels wrong. I think hopen() itself should check for preload and if found then proceed with a read loop to preload the file. (If this fails then the entire open fails.) This would make preload uri agnostic too. I'll have a play and see if this works.

@jkbonfield
Copy link
Contributor

Hmm, adding this to va_args doesn't help either really. Normally with va_args we either have a scan along it until we hit a known terminator (eg NULL):

hopen(p->url, "r:", "httphdr:v", p->headers, NULL)

Or we have a format string that controls what the arguments are (think printf("%d %s", 10, "foo")), or we have some specific logic that knows precisely how many arguments are parsed. Eg in this bit of test/hfile.c

test/hfile.c:    fin = hopen("mem:", "r:", test_string, 12);

The problem here is adding preload into this just breaks it. We don't know what follows, if anything, and we can't stop processing our va list. The only thing we can do is check if the first argument is preload and then stop, but that then forbids preloading of anything else that wants to use va_list for its own purposes. In short, we make preload only viable for some things, which is perhaps what the intention was, but if so that's not at all clear.

If we're going down that road, eg file only, then frankly we'd be better off just defining it as "preload:filename" and adding a scheme handler for it instead of using the varargs interface. Actually we could even have preload:uri so that "preload:ftp://foo/bar.bam.bai" worked. This layering works, whereas adding it to varags doesn't.

@jkbonfield
Copy link
Contributor

jkbonfield commented Dec 7, 2017

@ThomasHickman I have a branch in my home dir at Sanger (~jkb/work/samtools_master/htslib) named mem_io_jkb. It's a work in progress, but attempts to fix some of the above. The preload now works and the buffer is owned by hfile still unless explicitly stolen, so we don't leak memory in the common use case.

I'm unsure though of whether this is better than the original implementation in #417 as I haven't had time to go over that code. There were ongoing discussions between @jmarshall and @pkrusche which fizzled out, but it does appear it mostly did the right thing already bar the preload bit. What was happening with that and this PR? Any comments @jmarshall? You said you'd take a look at this, and obviously also looked at the other PR. Which would you say is the best starting point to work from?

@ThomasHickman
Copy link
Contributor Author

@jkbonfield, thanks for the corrections, I'll add them into my branch. This PR is my attempt to implement suggestions by @jmarshall on the original pull request: #417 (comment) - I think this PR is a better place to go from than the original implementation, as this was written when the varargs version of hopen wasn't implemented and therefore uses a very inelegant method of setting the buffer (as detailed in the conversation around that pull request)
As for the interface for preloading files, I think that creating a new scheme for preloading is probably a better interface that the varargs method currently implemented. I'll have a look at implementing that now

jkbonfield and others added 2 commits December 12, 2017 16:26
The buffer used in preload is now freed.  If we want to take ownership
of it and avoid it being freed, the new hfile_mem_steal_buffer
function can be called.

Also changed the prototype of hfile_mem_get_buffer to return the
buffer directly instead of via a pointer to a pointer as it's
simpler and permits inline usage.

Changed preload to work on more than straight hFILE_fd, although in
the current incarnation it still won't work on anything that attempts
to use the varags hopen interface for anything else.
Replaces the vargargs method of specifying that the url should be preloaded.
hopen_preload has also been renamed to hpreload to make way for a hopen_preload function that acts as an interface for hfile_add_scheme_handler
@jkbonfield
Copy link
Contributor

Looks good. I tested it works on preload:ftp://blah too, as we'd expected. Running more tests but hopefully merge soon.

@@ -908,21 +1019,25 @@ static const struct hFILE_scheme_handler *find_scheme_handler(const char *s)

hFILE *hopen(const char *fname, const char *mode, ...)
{
hFILE *fp = NULL;
Copy link
Contributor

@jkbonfield jkbonfield Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: ignore that - confused by diff here not showing entire function.

@jkbonfield
Copy link
Contributor

Squashed, resolved conflicts and merged as 8003166

@jkbonfield jkbonfield closed this Dec 14, 2017
@ThomasHickman ThomasHickman deleted the in-mem-io branch December 14, 2017 10:03
@jkbonfield jkbonfield mentioned this pull request Dec 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants