-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve cudf::io::datasource::create(). #17115
base: branch-24.12
Are you sure you want to change the base?
Improve cudf::io::datasource::create(). #17115
Conversation
a060422
to
fe981b1
Compare
* all I/O operations when possible. | ||
*/ | ||
KVIKIO = 0, | ||
DEFAULT = KVIKIO, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're considering having a dynamic default - use different implementation based on file or system parameters. I don't want to lock the default down in this API.
I'm thinking that we could default to something like AUTO
here, or pass datasource_kind
as optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah AUTO
or some not-necessarily-KVIKIO
default would be fine. As it currently stands... there's no way of getting the existing behavior, i.e. the file_source
backend, which... seems like something that should be reviewed.
So that also means there's no way of getting as simple cuFile
-based implementation... a CUFILE
datasource kind would be handy.
I also played around with an io_uring
implementation... doesn't play nicely with GDS obviously, but should provide the fastest host-based reader if done correctly. Not including that in this patch, too much scope creep.
Overall it sounds like the parameterizing of datasource_kind
and datasource_params
as an API change is reasonable, though? The other option would be to expose all the *_source
classes and let callers create them directly.
* | ||
* N.B. GDS = GPUDirect Storage | ||
*/ | ||
enum class datasource_kind { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that an equivalent of what is currently env var "GDS" is mising :)
I think that's fine, just wanted to leave a not for viz.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah that's basically just cuFile
device I/O for everything right? I think that's what I was referring to above when I said it would make sense to probably have a CUFILE
datasource_kind
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added datasource_kind::GDS
and corresponding (very simple) cufile_source
implementation.
fe981b1
to
d6f6d03
Compare
Introduce new `datasource_kind` and `datasource_params` data types, and update the cudf::io::datasource::create() signature to allow parameterized datasource creation. Additionally, implement new datasources: - host_source: base class that does simple host-based pread() calls - odirect_source: derivation of above that uses O_DIRECT - kvikio_source: simple Kvikio-based class (that does not fall back to mmap)
d6f6d03
to
6797a63
Compare
6797a63
to
47bccc7
Compare
@tpn Would you be able to avoid force-pushing here? It has a tendency to mess up review comments and we generally avoid it. We always squash on merge, so it is okay to have a messy intermediate history. |
@bdice oh interesting, didn't know that's your preferred style for cudf. Sure, I'll switch to a merge workflow going forward. |
* overhead associated with cache thrashing when memory is tight is | ||
* eliminated. | ||
*/ | ||
ODIRECT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind splitting the O_DIRECT source into a separate PR? I think it's fair to say that it's a separate feature (dependent of the create
API change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, these are more intertwined than I realized. Need to review this very carefully :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would definitely be possible to hoist-out O_DIRECT into a separate PR. Or possibly separate commits to ease review. But yeah things are pretty interrelated.
/** | ||
* @brief Base class for file input. Only implements direct device reads. | ||
*/ | ||
class file_source : public datasource { | ||
public: | ||
explicit file_source(char const* filepath) : _file(filepath, O_RDONLY) | ||
explicit file_source(char const* filepath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the class hierarchy is a bit messy now :)
ideally we'd have a single root class for reading from files, and derive from there. leaving this as a note to revisit once we're set on the options we want to expose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed re: class hierarchy getting messy. I agree a single root class would make sense, and I think it should be the new host_source
proposed by this PR, with all other datasources deriving from it. This is sort of interrelated with the next comment below, so I'll write a little more there.
* @return The number of bytes read on success. An exception is thrown on | ||
* error. | ||
*/ | ||
[[nodiscard]] size_t host_read(int fd, size_t offset, size_t read_size, uint8_t* dst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is file_source without GDS, right? should file_source
be implemented like this? Look naive compared to these functions.
Another question is whether file_source
and host_source
should be merged somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The role of the existing file_source
needs to be reviewed in light of what's being proposed--not only that, but there's not even a way to create file_source
using the updated create()
factory routine, so it's technically dead code.
I don't think you can say host_source
is the same as the current file_source
minus GDS. The current file_source
is a hybrid GDS-kvikio implementation, and benefits from kvikio threadpool I/O if GDS isn't available. But based on my current understanding, it also issues small I/O (< device_read_threshold
) via kvikio, which may be farmed out to a threadpool, which doesn't seem ideal. (Although I say this without any empirical data.)
My proposal would be to retire the existing file_source
, which attempts to tackle a bunch of things depending on runtime environment, then just have very specific derived implementations that Just Do One Thing, then better logic in the create()
factory routine (and a default AUTO
) that picks the most sane option at runtime.
* environments, such as when small reads are done against files residing on | ||
* a network file system (including accelerated file systems like WekaFS). | ||
*/ | ||
HOST_MMAP, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the decision whether to memory map is separate from the way we read directly to device. I'm not sure if we want to keep it that way. I think this warrants an internal discussion :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah mmap is an interesting one. It can suck, hard, on network file systems. On the other hand... if the file resided on a (now-obsolete, unfortunately) pmem device (like an Optane DIMM), mmap would fly. My only reason for including it in the enum is because, well, hey, memory_mapped_source
exists, so why not. Not the most solid reasoning :-)
Apologies for the slow response; we've been thoroughly reevaluating our IO options, in part prompted by your PR :) We're planning to drastically change the available IO options to avoid redundancy and simplify the behavior. What we're aiming to end up with: I think under this proposal, if the performance numbers work out, we would end up with a base I'm writing all of this because, as far as I can we see, the expanded API is blocked on some of these changes; we'll change available options to a large extent, so exposing them now would imply breaking changes to |
@vuule that all sounds great. I anticipated that the PR would be far more likely to spur I/O model changes like you've described, versus just getting approved and merged as-is, so no worries there. Regarding keeping It sounds like it would make more sense to close this PR for now and do new ones for the upcoming work. I can take a stab at an |
This PR introduces new functionality to
cudf::io::datasource
that allows for greater control over backend datasource creation. Specifically, the static factorycreate()
method has been expanded to takedatasource_kind
anddatasource_params
that can be used to parameterize the datasource creation.Also introduced are three new datasources:
- host_source: base class that does simple host-based pread() calls
- odirect_source: derivation of above that uses O_DIRECT
- kvikio_source: simple Kvikio-based class (that does not fall back to mmap)
To NVIDIA cudf folks: this undoubtedly warrants some design discussion as I'm introducing new ways of doing things that might not align with in-flight or planned tasks for the datasource component. Happy to jump on a call and discuss. One thing that stands out to me is that the
KVIKIO
vsKVIKIO_COMPAT
vsKVIKIO_GDS
kinds feels a bit hacky and/or like a leaky abstraction, especially when you factor in alternate configuration like cufile or env vars.The idea behind the PR overall is that callers can have much more control over the exact kind of datasource they want created, including fast-fail if they're expecting to create a GDS-accelerated source but can't at runtime, for whatever reason.
The datasource_kind::ODIRECT is also extremely useful for eliminating the variance associated with the page cache when doing back to back runs of large data sets (where the presence or absence of data in the cache will have a huge impact on runtime).
Checklist