Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve cudf::io::datasource::create(). #17115

Open
wants to merge 3 commits into
base: branch-24.12
Choose a base branch
from

Conversation

tpn
Copy link
Contributor

@tpn tpn commented Oct 17, 2024

This PR introduces new functionality to cudf::io::datasource that allows for greater control over backend datasource creation. Specifically, the static factory create() method has been expanded to take datasource_kind and datasource_params that can be used to parameterize the datasource creation.

Also introduced are three new datasources:
- host_source: base class that does simple host-based pread() calls
- odirect_source: derivation of above that uses O_DIRECT
- kvikio_source: simple Kvikio-based class (that does not fall back to mmap)

To NVIDIA cudf folks: this undoubtedly warrants some design discussion as I'm introducing new ways of doing things that might not align with in-flight or planned tasks for the datasource component. Happy to jump on a call and discuss. One thing that stands out to me is that the KVIKIO vs KVIKIO_COMPAT vs KVIKIO_GDS kinds feels a bit hacky and/or like a leaky abstraction, especially when you factor in alternate configuration like cufile or env vars.

The idea behind the PR overall is that callers can have much more control over the exact kind of datasource they want created, including fast-fail if they're expecting to create a GDS-accelerated source but can't at runtime, for whatever reason.

The datasource_kind::ODIRECT is also extremely useful for eliminating the variance associated with the page cache when doing back to back runs of large data sets (where the presence or absence of data in the cache will have a huge impact on runtime).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Oct 17, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 17, 2024
@tpn tpn force-pushed the 17110-fea-improve-cudfiodatasourcecreate branch 2 times, most recently from a060422 to fe981b1 Compare October 21, 2024 03:16
@tpn tpn marked this pull request as ready for review October 21, 2024 03:26
@tpn tpn requested a review from a team as a code owner October 21, 2024 03:26
@tpn tpn changed the title WIP: Improve cudf::io::datasource::create(). Improve cudf::io::datasource::create(). Oct 21, 2024
* all I/O operations when possible.
*/
KVIKIO = 0,
DEFAULT = KVIKIO,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're considering having a dynamic default - use different implementation based on file or system parameters. I don't want to lock the default down in this API.
I'm thinking that we could default to something like AUTO here, or pass datasource_kind as optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah AUTO or some not-necessarily-KVIKIO default would be fine. As it currently stands... there's no way of getting the existing behavior, i.e. the file_source backend, which... seems like something that should be reviewed.

So that also means there's no way of getting as simple cuFile-based implementation... a CUFILE datasource kind would be handy.

I also played around with an io_uring implementation... doesn't play nicely with GDS obviously, but should provide the fastest host-based reader if done correctly. Not including that in this patch, too much scope creep.

Overall it sounds like the parameterizing of datasource_kind and datasource_params as an API change is reasonable, though? The other option would be to expose all the *_source classes and let callers create them directly.

*
* N.B. GDS = GPUDirect Storage
*/
enum class datasource_kind {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that an equivalent of what is currently env var "GDS" is mising :)
I think that's fine, just wanted to leave a not for viz.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah that's basically just cuFile device I/O for everything right? I think that's what I was referring to above when I said it would make sense to probably have a CUFILE datasource_kind as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added datasource_kind::GDS and corresponding (very simple) cufile_source implementation.

@tpn tpn force-pushed the 17110-fea-improve-cudfiodatasourcecreate branch from fe981b1 to d6f6d03 Compare October 23, 2024 21:34
Introduce new `datasource_kind` and `datasource_params` data types, and
update the cudf::io::datasource::create() signature to allow
parameterized datasource creation.

Additionally, implement new datasources:

    - host_source: base class that does simple host-based pread() calls
    - odirect_source: derivation of above that uses O_DIRECT
    - kvikio_source: simple Kvikio-based class (that does not fall back
      to mmap)
@tpn tpn force-pushed the 17110-fea-improve-cudfiodatasourcecreate branch from d6f6d03 to 6797a63 Compare October 24, 2024 18:22
@tpn tpn force-pushed the 17110-fea-improve-cudfiodatasourcecreate branch from 6797a63 to 47bccc7 Compare October 24, 2024 18:30
@bdice
Copy link
Contributor

bdice commented Oct 24, 2024

@tpn Would you be able to avoid force-pushing here? It has a tendency to mess up review comments and we generally avoid it. We always squash on merge, so it is okay to have a messy intermediate history.

@tpn
Copy link
Contributor Author

tpn commented Oct 24, 2024

@bdice oh interesting, didn't know that's your preferred style for cudf. Sure, I'll switch to a merge workflow going forward.

* overhead associated with cache thrashing when memory is tight is
* eliminated.
*/
ODIRECT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind splitting the O_DIRECT source into a separate PR? I think it's fair to say that it's a separate feature (dependent of the create API change).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, these are more intertwined than I realized. Need to review this very carefully :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would definitely be possible to hoist-out O_DIRECT into a separate PR. Or possibly separate commits to ease review. But yeah things are pretty interrelated.

/**
* @brief Base class for file input. Only implements direct device reads.
*/
class file_source : public datasource {
public:
explicit file_source(char const* filepath) : _file(filepath, O_RDONLY)
explicit file_source(char const* filepath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the class hierarchy is a bit messy now :)
ideally we'd have a single root class for reading from files, and derive from there. leaving this as a note to revisit once we're set on the options we want to expose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed re: class hierarchy getting messy. I agree a single root class would make sense, and I think it should be the new host_source proposed by this PR, with all other datasources deriving from it. This is sort of interrelated with the next comment below, so I'll write a little more there.

* @return The number of bytes read on success. An exception is thrown on
* error.
*/
[[nodiscard]] size_t host_read(int fd, size_t offset, size_t read_size, uint8_t* dst)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is file_source without GDS, right? should file_source be implemented like this? Look naive compared to these functions.
Another question is whether file_source and host_source should be merged somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The role of the existing file_source needs to be reviewed in light of what's being proposed--not only that, but there's not even a way to create file_source using the updated create() factory routine, so it's technically dead code.

I don't think you can say host_source is the same as the current file_source minus GDS. The current file_source is a hybrid GDS-kvikio implementation, and benefits from kvikio threadpool I/O if GDS isn't available. But based on my current understanding, it also issues small I/O (< device_read_threshold) via kvikio, which may be farmed out to a threadpool, which doesn't seem ideal. (Although I say this without any empirical data.)

My proposal would be to retire the existing file_source, which attempts to tackle a bunch of things depending on runtime environment, then just have very specific derived implementations that Just Do One Thing, then better logic in the create() factory routine (and a default AUTO) that picks the most sane option at runtime.

* environments, such as when small reads are done against files residing on
* a network file system (including accelerated file systems like WekaFS).
*/
HOST_MMAP,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the decision whether to memory map is separate from the way we read directly to device. I'm not sure if we want to keep it that way. I think this warrants an internal discussion :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah mmap is an interesting one. It can suck, hard, on network file systems. On the other hand... if the file resided on a (now-obsolete, unfortunately) pmem device (like an Optane DIMM), mmap would fly. My only reason for including it in the enum is because, well, hey, memory_mapped_source exists, so why not. Not the most solid reasoning :-)

@vuule
Copy link
Contributor

vuule commented Oct 31, 2024

Apologies for the slow response; we've been thoroughly reevaluating our IO options, in part prompted by your PR :)

We're planning to drastically change the available IO options to avoid redundancy and simplify the behavior.

What we're aiming to end up with:
kvikIO: Include three levels on compatibility mode: force off, allow as fallback, force on. This gives us parity with current options, with much less confusion. It does require changes on kvikIO side so we can't expose this ATM. We also need to change file_source to use kvikIO for host reads, thus removing the internal implementation.
memory map: instead of direct file reads, memory map and copy from the buffer (both host and device reads). Current memory mapping only applies to host reads, so some changes are needed here as well. If kvikIO adds support for memory mapping, this potentially affects the wording of the options as well
odirect: as implemented in this PR, only with fallback to kvikIO in compatibility mode. This way we don't re-implement "vanilla" file reads and can outsource that to kvikIO.

I think under this proposal, if the performance numbers work out, we would end up with a base file_source class that only uses kvikIO, and a derived odirect_source class. Until memory mapping is not added to kvikIO, we would also keep the derived memory_mapped_source class.
Do you see any issues with this approach?

I'm writing all of this because, as far as I can we see, the expanded API is blocked on some of these changes; we'll change available options to a large extent, so exposing them now would imply breaking changes to create later. I believe work on odirect_source is not blocked, but it should probably derive from (soon to be reworked) file_source.

@tpn
Copy link
Contributor Author

tpn commented Nov 3, 2024

@vuule that all sounds great. I anticipated that the PR would be far more likely to spur I/O model changes like you've described, versus just getting approved and merged as-is, so no worries there. Regarding keeping odirect, as long as it's exposed in such a way that it can be used by downstream cudf users (i.e. include it in a public header file), that's really all that would be needed.

It sounds like it would make more sense to close this PR for now and do new ones for the upcoming work. I can take a stab at an odirect-only based one if that's easiest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: In Progress
Status: External Contribution
Development

Successfully merging this pull request may close these issues.

3 participants