Improve cudf::io::datasource::create(). #17115

tpn · 2024-10-17T17:36:52Z

This PR introduces new functionality to cudf::io::datasource that allows for greater control over backend datasource creation. Specifically, the static factory create() method has been expanded to take datasource_kind and datasource_params that can be used to parameterize the datasource creation.

Also introduced are three new datasources:
- host_source: base class that does simple host-based pread() calls
- odirect_source: derivation of above that uses O_DIRECT
- kvikio_source: simple Kvikio-based class (that does not fall back to mmap)

To NVIDIA cudf folks: this undoubtedly warrants some design discussion as I'm introducing new ways of doing things that might not align with in-flight or planned tasks for the datasource component. Happy to jump on a call and discuss. One thing that stands out to me is that the KVIKIO vs KVIKIO_COMPAT vs KVIKIO_GDS kinds feels a bit hacky and/or like a leaky abstraction, especially when you factor in alternate configuration like cufile or env vars.

The idea behind the PR overall is that callers can have much more control over the exact kind of datasource they want created, including fast-fail if they're expecting to create a GDS-accelerated source but can't at runtime, for whatever reason.

The datasource_kind::ODIRECT is also extremely useful for eliminating the variance associated with the page cache when doing back to back runs of large data sets (where the presence or absence of data in the cache will have a huge impact on runtime).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-10-17T17:36:56Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cpp/src/io/utilities/datasource.cpp

cpp/include/cudf/io/datasource.hpp

cpp/src/io/utilities/datasource.cpp

vuule · 2024-10-23T19:06:19Z

cpp/include/cudf/io/datasource.hpp

+   * all I/O operations when possible.
+   */
+  KVIKIO = 0,
+  DEFAULT = KVIKIO,


We're considering having a dynamic default - use different implementation based on file or system parameters. I don't want to lock the default down in this API.
I'm thinking that we could default to something like AUTO here, or pass datasource_kind as optional.

Yeah AUTO or some not-necessarily-KVIKIO default would be fine. As it currently stands... there's no way of getting the existing behavior, i.e. the file_source backend, which... seems like something that should be reviewed.

So that also means there's no way of getting as simple cuFile-based implementation... a CUFILE datasource kind would be handy.

I also played around with an io_uring implementation... doesn't play nicely with GDS obviously, but should provide the fastest host-based reader if done correctly. Not including that in this patch, too much scope creep.

Overall it sounds like the parameterizing of datasource_kind and datasource_params as an API change is reasonable, though? The other option would be to expose all the *_source classes and let callers create them directly.

vuule · 2024-10-23T19:16:29Z

cpp/include/cudf/io/datasource.hpp

+ *
+ * N.B. GDS = GPUDirect Storage
+ */
+enum class datasource_kind {


I notice that an equivalent of what is currently env var "GDS" is mising :)
I think that's fine, just wanted to leave a not for viz.

Ah yeah that's basically just cuFile device I/O for everything right? I think that's what I was referring to above when I said it would make sense to probably have a CUFILE datasource_kind as well.

Just added datasource_kind::GDS and corresponding (very simple) cufile_source implementation.

Introduce new `datasource_kind` and `datasource_params` data types, and update the cudf::io::datasource::create() signature to allow parameterized datasource creation. Additionally, implement new datasources: - host_source: base class that does simple host-based pread() calls - odirect_source: derivation of above that uses O_DIRECT - kvikio_source: simple Kvikio-based class (that does not fall back to mmap)

…kio.

bdice · 2024-10-24T18:39:16Z

@tpn Would you be able to avoid force-pushing here? It has a tendency to mess up review comments and we generally avoid it. We always squash on merge, so it is okay to have a messy intermediate history.

tpn · 2024-10-24T18:40:41Z

@bdice oh interesting, didn't know that's your preferred style for cudf. Sure, I'll switch to a merge workflow going forward.

vuule · 2024-10-25T22:45:48Z

cpp/include/cudf/io/datasource.hpp

+   * overhead associated with cache thrashing when memory is tight is
+   * eliminated.
+   */
+  ODIRECT,


Would you mind splitting the O_DIRECT source into a separate PR? I think it's fair to say that it's a separate feature (dependent of the create API change).

Actually, these are more intertwined than I realized. Need to review this very carefully :)

It would definitely be possible to hoist-out O_DIRECT into a separate PR. Or possibly separate commits to ease review. But yeah things are pretty interrelated.

vuule · 2024-10-25T23:27:53Z

cpp/src/io/utilities/datasource.cpp

 /**
 * @brief Base class for file input. Only implements direct device reads.
 */
 class file_source : public datasource {
 public:
-  explicit file_source(char const* filepath) : _file(filepath, O_RDONLY)
+  explicit file_source(char const* filepath)


the class hierarchy is a bit messy now :)
ideally we'd have a single root class for reading from files, and derive from there. leaving this as a note to revisit once we're set on the options we want to expose.

Agreed re: class hierarchy getting messy. I agree a single root class would make sense, and I think it should be the new host_source proposed by this PR, with all other datasources deriving from it. This is sort of interrelated with the next comment below, so I'll write a little more there.

vuule · 2024-10-25T23:31:33Z

cpp/src/io/utilities/datasource.cpp

+   * @return The number of bytes read on success.  An exception is thrown on
+   * error.
+   */
+  [[nodiscard]] size_t host_read(int fd, size_t offset, size_t read_size, uint8_t* dst)


This is file_source without GDS, right? should file_source be implemented like this? Look naive compared to these functions.
Another question is whether file_source and host_source should be merged somehow.

The role of the existing file_source needs to be reviewed in light of what's being proposed--not only that, but there's not even a way to create file_source using the updated create() factory routine, so it's technically dead code.

I don't think you can say host_source is the same as the current file_source minus GDS. The current file_source is a hybrid GDS-kvikio implementation, and benefits from kvikio threadpool I/O if GDS isn't available. But based on my current understanding, it also issues small I/O (< device_read_threshold) via kvikio, which may be farmed out to a threadpool, which doesn't seem ideal. (Although I say this without any empirical data.)

My proposal would be to retire the existing file_source, which attempts to tackle a bunch of things depending on runtime environment, then just have very specific derived implementations that Just Do One Thing, then better logic in the create() factory routine (and a default AUTO) that picks the most sane option at runtime.

vuule · 2024-10-25T23:31:58Z

cpp/include/cudf/io/datasource.hpp

+   * environments, such as when small reads are done against files residing on
+   * a network file system (including accelerated file systems like WekaFS).
+   */
+  HOST_MMAP,


Currently, the decision whether to memory map is separate from the way we read directly to device. I'm not sure if we want to keep it that way. I think this warrants an internal discussion :)

Yeah mmap is an interesting one. It can suck, hard, on network file systems. On the other hand... if the file resided on a (now-obsolete, unfortunately) pmem device (like an Optane DIMM), mmap would fly. My only reason for including it in the enum is because, well, hey, memory_mapped_source exists, so why not. Not the most solid reasoning :-)

vuule · 2024-10-31T20:39:19Z

Apologies for the slow response; we've been thoroughly reevaluating our IO options, in part prompted by your PR :)

We're planning to drastically change the available IO options to avoid redundancy and simplify the behavior.

What we're aiming to end up with:
kvikIO: Include three levels on compatibility mode: force off, allow as fallback, force on. This gives us parity with current options, with much less confusion. It does require changes on kvikIO side so we can't expose this ATM. We also need to change file_source to use kvikIO for host reads, thus removing the internal implementation.
memory map: instead of direct file reads, memory map and copy from the buffer (both host and device reads). Current memory mapping only applies to host reads, so some changes are needed here as well. If kvikIO adds support for memory mapping, this potentially affects the wording of the options as well
odirect: as implemented in this PR, only with fallback to kvikIO in compatibility mode. This way we don't re-implement "vanilla" file reads and can outsource that to kvikIO.

I think under this proposal, if the performance numbers work out, we would end up with a base file_source class that only uses kvikIO, and a derived odirect_source class. Until memory mapping is not added to kvikIO, we would also keep the derived memory_mapped_source class.
Do you see any issues with this approach?

I'm writing all of this because, as far as I can we see, the expanded API is blocked on some of these changes; we'll change available options to a large extent, so exposing them now would imply breaking changes to create later. I believe work on odirect_source is not blocked, but it should probably derive from (soon to be reworked) file_source.

tpn · 2024-11-03T18:17:17Z

@vuule that all sounds great. I anticipated that the PR would be far more likely to spur I/O model changes like you've described, versus just getting approved and merged as-is, so no worries there. Regarding keeping odirect, as long as it's exposed in such a way that it can be used by downstream cudf users (i.e. include it in a public header file), that's really all that would be needed.

It sounds like it would make more sense to close this PR for now and do new ones for the upcoming work. I can take a stab at an odirect-only based one if that's easiest.

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 17, 2024

tpn mentioned this pull request Oct 17, 2024

[FEA] Improve cudf::io::datasource::create() #17110

Open

tpn force-pushed the 17110-fea-improve-cudfiodatasourcecreate branch 2 times, most recently from a060422 to fe981b1 Compare October 21, 2024 03:16

tpn marked this pull request as ready for review October 21, 2024 03:26

tpn requested a review from a team as a code owner October 21, 2024 03:26

tpn requested review from mythrocks and nvdbaranec October 21, 2024 03:26

tpn changed the title ~~WIP: Improve cudf::io::datasource::create().~~ Improve cudf::io::datasource::create(). Oct 21, 2024

tpn commented Oct 21, 2024

View reviewed changes

cpp/src/io/utilities/datasource.cpp Outdated Show resolved Hide resolved

tpn commented Oct 21, 2024

View reviewed changes

cpp/include/cudf/io/datasource.hpp Outdated Show resolved Hide resolved

tpn commented Oct 21, 2024

View reviewed changes

cpp/include/cudf/io/datasource.hpp Outdated Show resolved Hide resolved

tpn commented Oct 21, 2024

View reviewed changes

cpp/src/io/utilities/datasource.cpp Outdated Show resolved Hide resolved

tpn commented Oct 21, 2024

View reviewed changes

cpp/src/io/utilities/datasource.cpp Outdated Show resolved Hide resolved

GregoryKimball requested a review from kingcrimsontianyu October 22, 2024 17:21

vuule reviewed Oct 23, 2024

View reviewed changes

tpn force-pushed the 17110-fea-improve-cudfiodatasourcecreate branch from fe981b1 to d6f6d03 Compare October 23, 2024 21:34

tpn force-pushed the 17110-fea-improve-cudfiodatasourcecreate branch from d6f6d03 to 6797a63 Compare October 24, 2024 18:22

tpn added 2 commits October 24, 2024 11:29

Fix PR feedback.

1dcb443

Implement a simple GDS-backed data source that does not rely upon Kvi…

47bccc7

…kio.

tpn force-pushed the 17110-fea-improve-cudfiodatasourcecreate branch from 6797a63 to 47bccc7 Compare October 24, 2024 18:30

vuule reviewed Oct 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve cudf::io::datasource::create(). #17115

Improve cudf::io::datasource::create(). #17115

tpn commented Oct 17, 2024 •

edited

Loading

copy-pr-bot bot commented Oct 17, 2024

vuule Oct 23, 2024

tpn Oct 23, 2024

vuule Oct 23, 2024

tpn Oct 23, 2024

tpn Oct 24, 2024

bdice commented Oct 24, 2024

tpn commented Oct 24, 2024

vuule Oct 25, 2024

vuule Oct 25, 2024

tpn Oct 25, 2024

vuule Oct 25, 2024

tpn Oct 29, 2024

vuule Oct 25, 2024

tpn Oct 29, 2024

vuule Oct 25, 2024

tpn Oct 29, 2024

vuule commented Oct 31, 2024 •

edited

Loading

tpn commented Nov 3, 2024

Improve cudf::io::datasource::create(). #17115

Are you sure you want to change the base?

Improve cudf::io::datasource::create(). #17115

Conversation

tpn commented Oct 17, 2024 • edited Loading

Checklist

copy-pr-bot bot commented Oct 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice commented Oct 24, 2024

tpn commented Oct 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule commented Oct 31, 2024 • edited Loading

tpn commented Nov 3, 2024

tpn commented Oct 17, 2024 •

edited

Loading

vuule commented Oct 31, 2024 •

edited

Loading