[DRAFT] refactor to grid stride for data_type_detection kernel #14694

DanialJavady96 · 2024-01-02T18:10:54Z

Just a small draft to make sure i am not misunderstanding anything before I modify more code. Please lmk if this is not how the grid stride is to be implemented. Addresses #14066

this benchmark is with higher occupancy, using the row count as a guide for grid size

## csv_read_data_type
### [0] NVIDIA GeForce RTX 3080

| data_type |      io       | Samples |  CPU Time  | Noise |  GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|---------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------|
|  INTEGRAL | DEVICE_BUFFER |     45x | 335.173 ms | 5.05% | 335.136 ms | 5.05% |        800975115 |         1.202 GiB |       668.564 MiB |
|     FLOAT | DEVICE_BUFFER |     35x | 430.598 ms | 3.25% | 430.556 ms | 3.25% |        623463015 |         1.041 GiB |       713.885 MiB |
|   DECIMAL | DEVICE_BUFFER |     26x | 594.638 ms | 3.35% | 594.632 ms | 3.35% |        451431261 |         1.501 GiB |       783.420 MiB |
| TIMESTAMP | DEVICE_BUFFER |     22x | 686.076 ms | 3.85% | 686.071 ms | 3.85% |        391264596 |         2.281 GiB |       814.268 MiB |
|  DURATION | DEVICE_BUFFER |     18x | 839.339 ms | 3.25% | 839.347 ms | 3.25% |        319814809 |         2.588 GiB |       971.320 MiB |
|    STRING | DEVICE_BUFFER |     76x | 198.889 ms | 4.69% | 198.884 ms | 4.69% |       1349709651 |       859.526 MiB |       277.082 MiB |

below is with static grid_size at 256, much lower occupancy at 15.7%

# Benchmark Results
## csv_read_data_type
### [0] NVIDIA GeForce RTX 3080

| data_type |      io       | Samples |  CPU Time  | Noise |  GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|---------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------|
|  INTEGRAL | DEVICE_BUFFER |     53x | 287.857 ms | 7.67% | 287.807 ms | 7.67% |        932692483 |         1.202 GiB |       668.564 MiB |
|     FLOAT | DEVICE_BUFFER |     46x | 330.720 ms | 6.69% | 330.697 ms | 6.69% |        811726551 |         1.041 GiB |       713.885 MiB |
|   DECIMAL | DEVICE_BUFFER |     34x | 449.532 ms | 6.43% | 449.526 ms | 6.43% |        597151885 |         1.501 GiB |       783.420 MiB |
| TIMESTAMP | DEVICE_BUFFER |     30x | 514.682 ms | 2.84% | 514.679 ms | 2.84% |        521559027 |         2.281 GiB |       814.268 MiB |
|  DURATION | DEVICE_BUFFER |     24x | 630.456 ms | 3.94% | 630.444 ms | 3.94% |        425788224 |         2.588 GiB |       971.320 MiB |
|    STRING | DEVICE_BUFFER |     90x | 166.848 ms | 2.63% | 166.836 ms | 2.63% |       1608974352 |       859.526 MiB |       277.082 MiB |

cudaOccupancyMaxPotentialBlockSize returns that 128 for block size and 612 for grid size is what achieves max occupancy btw.

Some findings:

On my 3080, I am noticing better performance at 15.7% occupancy than I am at 65-75% occupancy. I do not have access to a workstation GPU so I cannot profile on what would be a practical scenario for a user using cuIO. Given that this kernel is largely memory bound, it is hard for me to simulate a real scenario. I hesitate drawing conclusions from my benchmarking on my consumer gpu.

I am also a bit curious to know what kind of performance gains should be derived from transition towards a grid stride. From my understanding, the main problems that exist(and are compounding off eachother) are:

Non coalesced memory access patterns, consequence of having a thread process an entire row. Scales with avg row size. I think more threads per row would help.
Suboptimal caching
Heavy reliance on atomicAdds
Heavy divergence, since the kernel is generic, it has to deal with a variety of types and a lot of optimizations are lost due to this generic nature
Work balancing between threads varies too much, also a consequence of variable lengths.

So im curious to know, would it make more sense to prioritize modernizing the CSV reader? I don't have perspective on the timeline for that to be achieved so I can't comment.

From this paper, IIUC, I see a lot of interest in transitioning towards DFAs ,and converting to columnar format for example. A lot of interesting strategies like using bitmaps store positions of different symbols, scanning that to find what col and row the symbol belongs to, and then using a stable radix sort so you can convert to columnar format. This would solve a lot of the problems I mentioned earlier because the data layout will be converted to something much easier to operate on. With that being said I imagine the DFA needs to be worked and will be a challenge to get right. Since I lack context / experience with these workflows I'm curious to know how progress is faring on that and if I could help somehow.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-01-02T18:10:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

PointKernel · 2024-01-02T20:05:23Z

/ok to test

PointKernel · 2024-01-02T21:21:36Z

@DanialJavady96 Can you update the copyright year to 2024 to unblock the CI?

update copyright

bdice · 2024-01-02T23:26:24Z

cpp/CMakeLists.txt

@@ -658,6 +658,21 @@ set_source_files_properties(
  PROPERTIES COMPILE_DEFINITIONS "_FILE_OFFSET_BITS=64"
 )

+set_source_files_properties(


I don't think you've changed anything in csv_gpu.cu that requires CMake alterations. Was there an improperly resolved merge conflict somewhere?

bdice

Currently the grid size is chosen such that the number of threads is equal to the row count. This is the core problem we want to address. We want to enable kernels to act on tables larger than the maximum CUDA grid size (currently this is 2^31 - 1 blocks for 1D grids, and blocks have a max of 1024 threads). See #10368.

It looks like the main change so far is to use a for loop with the grid stride, which is half the solution. The other half of the solution is to change the way that the kernel is launched so that it doesn't depend on the row count -- just the kernel itself, to maximize throughput. You may need to do some benchmarking to determine the best strategy, but I think you'll want cudaOccupancyMaxPotentialBlockSize. Search the codebase for grid_size and you'll see various ways to pick a grid size. It's okay to benchmark on your RTX 3080 for testing purposes, and we can validate the final code for any perf changes on a larger GPU once the PR is ready to merge.

First, compute a grid size with cudaOccupancyMaxPotentialBlockSize (or some other approach).

cudf/cpp/src/io/csv/csv_gpu.cu

Lines 798 to 800 in f134513

    
           // Calculate actual block count to use based on records count 
        
           int const block_size = csvparse_block_dim; 
        
           int const grid_size  = (row_starts.size() + block_size - 1) / block_size;

Second, remove this comment once it's no longer true:

cudf/cpp/src/io/csv/csv_gpu.cu

Lines 162 to 163 in f134513

    
            * Data is processed in one row/record at a time, so the number of total 
        
            * threads (tid) is equal to the number of rows.

DanialJavady96 · 2024-01-03T00:05:53Z

Currently the grid size is chosen such that the number of threads is equal to the row count. This is the core problem we want to address. We want to enable kernels to act on tables larger than the maximum CUDA grid size (currently this is 2^31 - 1 blocks for 1D grids, and blocks have a max of 1024 threads). See #10368.

It looks like the main change so far is to use a for loop with the grid stride, which is half the solution. The other half of the solution is to change the way that the kernel is launched so that it doesn't depend on the row count -- just the kernel itself, to maximize throughput. You may need to do some benchmarking to determine the best strategy, but I think you'll want cudaOccupancyMaxPotentialBlockSize. Search the codebase for grid_size and you'll see various ways to pick a grid size. It's okay to benchmark on your RTX 3080 for testing purposes, and we can validate the final code for any perf changes on a larger GPU once the PR is ready to merge.

First, compute a grid size with cudaOccupancyMaxPotentialBlockSize (or some other approach).

cudf/cpp/src/io/csv/csv_gpu.cu

Lines 798 to 800 in f134513

// Calculate actual block count to use based on records count

int const block_size = csvparse_block_dim;

int const grid_size = (row_starts.size() + block_size - 1) / block_size;

Second, remove this comment once it's no longer true:

cudf/cpp/src/io/csv/csv_gpu.cu

Lines 162 to 163 in f134513

* Data is processed in one row/record at a time, so the number of total

* threads (tid) is equal to the number of rows.

Ah yeah, like I mentioned in my OP I did use cudaOccupancyMaxPotentialBlockSize to see what configuration it would give me and it showed 612 thread blocks with a block size of 128 threads for my 3080. I should've left that as a comment instead on that line of code, that's MB. I did benchmarking with both a constant sized grid and one that isn't dependent on the row size. The PR just shows the code where the threads deployed isn't changed.

The benchmark I included shows that at a much lower occupancy (15% vs ~75%) it is performing better and I wanted to rule out the spec differences from my 3080 having much lower memory than a V100 for example

bdice · 2024-01-03T00:21:36Z

Ah, thanks. I read the PR description but got a little lost while reading it. I appreciate the clarification.

It surprises me that a lower occupancy grid/block configuration is performing better for you. @vuule Would you have any insights on this? I don't know the CSV kernels very well.

DanialJavady96 · 2024-01-03T00:49:29Z

Ah, thanks. I read the PR description but got a little lost while reading it. I appreciate the clarification.

It surprises me that a lower occupancy grid/block configuration is performing better for you. @vuule Would you have any insights on this? I don't know the CSV kernels very well.

All good! I get tunnel visioned easily and definitely could've worded this better and more concisely. Something for me to work on 😅

GregoryKimball · 2024-01-04T17:15:27Z

Thank you @DanialJavady96 for studying this issue. For now we should focus on preventing thread indexing overflow and not include performance tuning. Please also see #10368 and #13910 for reference.

refactor to grid stride for data_type_detection kernel

dab25a3

DanialJavady96 requested review from a team as code owners January 2, 2024 18:10

DanialJavady96 requested review from ttnghia and PointKernel January 2, 2024 18:10

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jan 2, 2024

PointKernel added cuIO cuIO issue tech debt improvement Improvement / enhancement to an existing function Performance Performance related issue labels Jan 2, 2024

PointKernel added the non-breaking Non-breaking change label Jan 2, 2024

DanialJavady96 added 2 commits January 2, 2024 17:14

Update csv_gpu.cu

f9ba06a

update copyright

Merge branch 'branch-24.02' into feat-csv-grid-stride

f134513

bdice reviewed Jan 2, 2024

View reviewed changes

bdice requested changes Jan 2, 2024

View reviewed changes

vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed tech debt labels Feb 23, 2024

GregoryKimball closed this Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] refactor to grid stride for data_type_detection kernel #14694

[DRAFT] refactor to grid stride for data_type_detection kernel #14694

DanialJavady96 commented Jan 2, 2024 •

edited

Loading

copy-pr-bot bot commented Jan 2, 2024

PointKernel commented Jan 2, 2024

PointKernel commented Jan 2, 2024

bdice Jan 2, 2024

bdice left a comment •

edited

Loading

DanialJavady96 commented Jan 3, 2024

bdice commented Jan 3, 2024

DanialJavady96 commented Jan 3, 2024

GregoryKimball commented Jan 4, 2024

	// Calculate actual block count to use based on records count
	int const block_size = csvparse_block_dim;
	int const grid_size = (row_starts.size() + block_size - 1) / block_size;

	* Data is processed in one row/record at a time, so the number of total
	* threads (tid) is equal to the number of rows.

[DRAFT] refactor to grid stride for data_type_detection kernel #14694

[DRAFT] refactor to grid stride for data_type_detection kernel #14694

Conversation

DanialJavady96 commented Jan 2, 2024 • edited Loading

Checklist

copy-pr-bot bot commented Jan 2, 2024

PointKernel commented Jan 2, 2024

PointKernel commented Jan 2, 2024

bdice Jan 2, 2024

Choose a reason for hiding this comment

bdice left a comment • edited Loading

Choose a reason for hiding this comment

DanialJavady96 commented Jan 3, 2024

bdice commented Jan 3, 2024

DanialJavady96 commented Jan 3, 2024

GregoryKimball commented Jan 4, 2024

DanialJavady96 commented Jan 2, 2024 •

edited

Loading

bdice left a comment •

edited

Loading