Add in-memory I/O using hFILE fixed buffers #590

ThomasHickman · 2017-09-11T15:57:44Z

This PR implements what was proposed in #417, by implementing writing for fixed buffers and implementing a wrapper.

To create a in-memory hFILE:

char* test_string = strdup("Test string");
hFILE hf = open("mem:", "rw:", test_string, 12);

you can also access the internal buffer using:

char* buffer;
size_t size;
hfile_mem_get_buffer(hf, &buffer, &size);

This PR also implements preloading the enter contents of a file into a fixed buffer:

hFILE hf = hopen("file/path", "r:", "preload");

jkbonfield

Thanks for the PR.

Rob has a far greater knowledge of the hfile overloading code than myself so I'll let him add more comments on that side of thing, specifically the double buffering questions.

jkbonfield · 2017-09-11T16:37:45Z

hfile.c

+    FILE *file = fopen(filename, mode);
+    if (!file) goto error;
+
+    fseek(file, 0, SEEK_END);


You need to check the return value of fseek, incase someone has given a non-seekable stream to this function. (Eg /dev/stdin)

jkbonfield · 2017-09-11T16:39:36Z

hfile.c

+    char* buffer = malloc(len);
+    if(buffer == NULL)
+    {
+        errno = ENOMEM;


Manually setting errno isn't recommended. Malloc will do this itself, as would the fread below. Indeed we're obscuring the fread return as it may have other more meaningful errno values.

jkbonfield · 2017-09-11T16:48:59Z

hfile.c

@@ -821,6 +862,41 @@ static int init_add_plugin(void *obj, int (*init)(struct hFILE_plugin *),
    return 0;
 }

+hFILE *hopenv_mem(const char *filename, const char *mode, va_list args)


I'm wondering if the existing hopen_mem, which is strictly URL based rather than generic memory based, should be reimplemented in terms of the generic hopenv_mem.

Looks like a good idea, though in hopenv_mem there isn't (and I don't think it would be a good idea to) a way to specify a different buffer size from the length of the buffer that is filled - this is needed in hopen_mem. I think it would be a good to create a function create_hfile_mem, which would do the functionality in common with both functions (which I have implemented below)

ThomasHickman · 2017-09-12T09:00:08Z

@jkbonfield thanks for the comments, the double buffering issue is actually to do with the hfile_ref structure in #589, so Rob could have a look at the issue there.

valeriuo

Please review.

valeriuo · 2017-09-12T11:32:52Z

hfile.c

@@ -615,6 +615,39 @@ static hFILE *hopen_fd(const char *filename, const char *mode)
    return NULL;
 }

+static hFILE *hpreload_fd(const char *filename, const char *mode)
+{
+    if(!strchr(mode, 'r'))


You should check mode value. If NULL, strchr will seg fault.

valeriuo · 2017-09-12T11:48:13Z

hfile.c

+static hFILE *create_hfile_mem(char* buffer, const char* mode, size_t buf_filled, size_t buf_size)
+{
+    hFILE_mem *fp = (hFILE_mem *) hfile_init_fixed(sizeof(hFILE_mem), mode, buffer, buf_filled, buf_size);
+    if (fp == NULL) { free(buffer); return NULL; }


You should let the caller function free buffer.

valeriuo · 2017-09-12T11:49:32Z

hfile.c

@@ -735,6 +777,8 @@ static hFILE *hopen_mem(const char *url, const char *mode)
        hts_decode_percent(buffer, &length, data);
    }

+    return create_hfile_mem(buffer, mode, length, size);
+
    hFILE_mem *fp = (hFILE_mem *)


This code is unreachable.

valeriuo · 2017-09-13T08:34:38Z

hfile.c

-    hFILE_mem *fp = (hFILE_mem *)
-        hfile_init_fixed(sizeof (hFILE_mem), mode, buffer, length, size);
-    if (fp == NULL) { free(buffer); return NULL; }
+    return create_hfile_mem(buffer, mode, length, size);


You still need to free buffer. I would check the value returned by create_hfile_mem and, if NULL, free(buffer).

ThomasHickman · 2017-09-22T15:46:31Z

Had another look at this as part of trying to solve the double buffering issue for the hfile_ref structure. I've changed this so the buffer doesn't get freed on hclose - it should be the caller's responsibility to free the buffer as it allocated it.

dkj · 2017-10-03T15:21:18Z

G131 asks @jmarshall do you have time and the inclination to take a peek?

jmarshall · 2017-11-02T11:21:03Z

@dkj: Thanks for the notification, which I have only just seen; will take a peek.

jkbonfield · 2017-12-06T16:46:39Z

The discussion at the start of hfile.c regarding mobile vs immobile windows is now incorrect, as in-memory streams can be "mobile" again. (I'm rather confused on the terminology though.) It probably needs to be explained in a different way, such as a fixed-size memory buffer, with a comment about variable-sized mobile in-memory buffers also being supported.

I'll look at updating this as I have a few other tweaks to do I think.

jkbonfield · 2017-12-06T16:56:03Z

hfile.c

+    }
+
+    hFILE *fp = hpreload_fd(fname, mode);
+    if (fp == NULL && errno == ENOENT) errno = EPROTONOSUPPORT;


I'm unsure why this line exists. hpreload_fd can return NULL/ENOENT for valid reasons, and ENOENT is suitable then I think.

jkbonfield · 2017-12-06T17:40:10Z

hfile.c

 void hfile_destroy(hFILE *fp)
 {
    int save = errno;
-    if (fp) free(fp->buffer);
+    if (fp && fp->backend != &mem_backend) free(fp->buffer);


This extra check means data: hopens now leak memory. Why are we not freeing here?

Changing this back makes test/hfile pass valgrind leak checks provided we also remove the now redundant free(internal_buf) calls.

I think it is probably fine to accept that obtaining the internal buffer doesn't equate to now also owning this buffer. If we wish to have the concept of taking ownership (and hence freeing responsibilities) we'd need an additional flag somewhere.

jkbonfield · 2017-12-07T10:30:32Z

I'm not convinced this works.

I created a noddy test case so I could strace it and see what happens:

    {
        char *b = malloc(1000000), *p = b;
        int len;
        printf("XYZZY: Start\n"); fflush(stdout);
        fin = hopen("/tmp/_", "r");
        while ((len = hread(fin, p, 15551)) > 0) {
            p += len;
        }
        len = p-b;
        if (hclose(fin) != 0) abort();
        printf("XYZZY: Read %d bytes\n", len); fflush(stdout);

        char c[15551];
        printf("XYZZY: Preload start\n"); fflush(stdout);
        fin = hopen("/tmp/_", "r:", "preload");
        printf("XYZZY: Preload End\n"); fflush(stdout);
        p = b;
        while ((len = hread(fin, c, 15551)) > 0) {
            if (memcmp(c, p, len)) {
                fail("mismatch at %d-%d\n", p-b, p-b+len);
            }
            p += len;
        }
        printf("XYZZY: Seek\n"); fflush(stdout);
        hseek(fin, 0, SEEK_SET);
        p = b;
        while ((len = hread(fin, c, 15551)) > 0) {
            if (memcmp(c, p, len)) {
                fail("mismatch at %d-%d\n", p-b, p-b+len);
            }
            p += len;
        }
        printf("XYZZY: Closing\n"); fflush(stdout);
        if (hclose(fin) != 0) abort();
    }

This simply rereads the file each time. It hasn't preloaded at all.

The cause is due to hopen and unknown_scheme. It runs find_scheme_handler(fname) but as "/tmp/_" is a non-url it simply returns NULL instead of returning &unknown_scheme. This then just calls hopen_fd and doesn't have any checking for "preload".

We could change this code in hopen, or make find_scheme_handler always return a scheme instead of returning NULL, so that the unknown scheme is always used (which boils own to straight hopen_fd mostly), but I'm not sure it's the right approach.

I assume the idea of adding a preload argument comes from #417. If this is what we're trying to do, then then hacking the unknown scheme handler to support preload feels wrong. I think hopen() itself should check for preload and if found then proceed with a read loop to preload the file. (If this fails then the entire open fails.) This would make preload uri agnostic too. I'll have a play and see if this works.

jkbonfield · 2017-12-07T11:50:03Z

Hmm, adding this to va_args doesn't help either really. Normally with va_args we either have a scan along it until we hit a known terminator (eg NULL):

hopen(p->url, "r:", "httphdr:v", p->headers, NULL)

Or we have a format string that controls what the arguments are (think printf("%d %s", 10, "foo")), or we have some specific logic that knows precisely how many arguments are parsed. Eg in this bit of test/hfile.c

test/hfile.c:    fin = hopen("mem:", "r:", test_string, 12);

The problem here is adding preload into this just breaks it. We don't know what follows, if anything, and we can't stop processing our va list. The only thing we can do is check if the first argument is preload and then stop, but that then forbids preloading of anything else that wants to use va_list for its own purposes. In short, we make preload only viable for some things, which is perhaps what the intention was, but if so that's not at all clear.

If we're going down that road, eg file only, then frankly we'd be better off just defining it as "preload:filename" and adding a scheme handler for it instead of using the varargs interface. Actually we could even have preload:uri so that "preload:ftp://foo/bar.bam.bai" worked. This layering works, whereas adding it to varags doesn't.

jkbonfield · 2017-12-07T12:32:20Z

@ThomasHickman I have a branch in my home dir at Sanger (~jkb/work/samtools_master/htslib) named mem_io_jkb. It's a work in progress, but attempts to fix some of the above. The preload now works and the buffer is owned by hfile still unless explicitly stolen, so we don't leak memory in the common use case.

I'm unsure though of whether this is better than the original implementation in #417 as I haven't had time to go over that code. There were ongoing discussions between @jmarshall and @pkrusche which fizzled out, but it does appear it mostly did the right thing already bar the preload bit. What was happening with that and this PR? Any comments @jmarshall? You said you'd take a look at this, and obviously also looked at the other PR. Which would you say is the best starting point to work from?

ThomasHickman · 2017-12-12T16:23:13Z

@jkbonfield, thanks for the corrections, I'll add them into my branch. This PR is my attempt to implement suggestions by @jmarshall on the original pull request: #417 (comment) - I think this PR is a better place to go from than the original implementation, as this was written when the varargs version of hopen wasn't implemented and therefore uses a very inelegant method of setting the buffer (as detailed in the conversation around that pull request)
As for the interface for preloading files, I think that creating a new scheme for preloading is probably a better interface that the varargs method currently implemented. I'll have a look at implementing that now

The buffer used in preload is now freed. If we want to take ownership of it and avoid it being freed, the new hfile_mem_steal_buffer function can be called. Also changed the prototype of hfile_mem_get_buffer to return the buffer directly instead of via a pointer to a pointer as it's simpler and permits inline usage. Changed preload to work on more than straight hFILE_fd, although in the current incarnation it still won't work on anything that attempts to use the varags hopen interface for anything else.

Replaces the vargargs method of specifying that the url should be preloaded. hopen_preload has also been renamed to hpreload to make way for a hopen_preload function that acts as an interface for hfile_add_scheme_handler

jkbonfield · 2017-12-14T09:31:14Z

Looks good. I tested it works on preload:ftp://blah too, as we'd expected. Running more tests but hopefully merge soon.

jkbonfield · 2017-12-14T09:36:58Z

hfile.c

@@ -908,21 +1019,25 @@ static const struct hFILE_scheme_handler *find_scheme_handler(const char *s)

 hFILE *hopen(const char *fname, const char *mode, ...)
 {
+    hFILE *fp = NULL;


Edit: ignore that - confused by diff here not showing entire function.

jkbonfield · 2017-12-14T09:58:37Z

Squashed, resolved conflicts and merged as 8003166

Add in-memory I/O

80bb84f

ThomasHickman mentioned this pull request Sep 11, 2017

Abstract ref_to_m5 from cram_io #589

Open

jkbonfield reviewed Sep 11, 2017

View reviewed changes

ThomasHickman added 3 commits September 12, 2017 09:40

Add better error handling in hpreload_fd

58fb979

Add create_hfile_mem

29a6789

Move mem backend code to the correct section

5cd9e07

valeriuo suggested changes Sep 12, 2017

View reviewed changes

Fix issues with mem hfile

3972015

valeriuo reviewed Sep 13, 2017

View reviewed changes

Free hFILE buffer when erroring

3d11377

valeriuo approved these changes Sep 22, 2017

View reviewed changes

Stop freeing hfile mem buffers in hclose

3656732

jkbonfield reviewed Dec 6, 2017

View reviewed changes

jkbonfield and others added 2 commits December 12, 2017 16:26

Create preload scheme

5fa67e9

Replaces the vargargs method of specifying that the url should be preloaded. hopen_preload has also been renamed to hpreload to make way for a hopen_preload function that acts as an interface for hfile_add_scheme_handler

jkbonfield reviewed Dec 14, 2017

View reviewed changes

jkbonfield closed this Dec 14, 2017

ThomasHickman deleted the in-mem-io branch December 14, 2017 10:03

jkbonfield mentioned this pull request Dec 14, 2017

Add in-memory I/O #417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in-memory I/O using hFILE fixed buffers #590

Add in-memory I/O using hFILE fixed buffers #590

ThomasHickman commented Sep 11, 2017

jkbonfield left a comment

jkbonfield Sep 11, 2017

jkbonfield Sep 11, 2017

jkbonfield Sep 11, 2017

ThomasHickman Sep 12, 2017

ThomasHickman commented Sep 12, 2017

valeriuo left a comment

valeriuo Sep 12, 2017

valeriuo Sep 12, 2017

valeriuo Sep 12, 2017

valeriuo Sep 13, 2017

ThomasHickman commented Sep 22, 2017

dkj commented Oct 3, 2017

jmarshall commented Nov 2, 2017

jkbonfield commented Dec 6, 2017

jkbonfield Dec 6, 2017

jkbonfield Dec 6, 2017

jkbonfield commented Dec 7, 2017

jkbonfield commented Dec 7, 2017

jkbonfield commented Dec 7, 2017 •

edited

Loading

ThomasHickman commented Dec 12, 2017

jkbonfield commented Dec 14, 2017

jkbonfield Dec 14, 2017 •

edited

Loading

jkbonfield commented Dec 14, 2017

Add in-memory I/O using hFILE fixed buffers #590

Add in-memory I/O using hFILE fixed buffers #590

Conversation

ThomasHickman commented Sep 11, 2017

jkbonfield left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasHickman commented Sep 12, 2017

valeriuo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasHickman commented Sep 22, 2017

dkj commented Oct 3, 2017

jmarshall commented Nov 2, 2017

jkbonfield commented Dec 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbonfield commented Dec 7, 2017

jkbonfield commented Dec 7, 2017

jkbonfield commented Dec 7, 2017 • edited Loading

ThomasHickman commented Dec 12, 2017

jkbonfield commented Dec 14, 2017

jkbonfield Dec 14, 2017 • edited Loading

Choose a reason for hiding this comment

jkbonfield commented Dec 14, 2017

jkbonfield commented Dec 7, 2017 •

edited

Loading

jkbonfield Dec 14, 2017 •

edited

Loading