-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GraphBolt] sample_neighbors() on CPU with prob/mask is 14x slower than w/o prob/mask #7462
Comments
As @mfbalin suggested, |
I'm trying to reproduce your result. I didn't change any configuration or parameter options, I assume that those options (non-dist, gb, homo, nc, no prob, etc.) are aligned with the current example code. For the prob/mask data, do we regenerate them: (a) once for the whole code execution, (b) once per batch of training, or (c) once per iteration? Currently I tried all three kinds of prob/mask data generation plan but I manually put them directly in the batch or iteration loop. This might be optimized. I ran on the
I hypothesized that a reason of the huge performance difference is because we recomputed the whole prob/mask data per iteration, and for huge dataset it can be really costly. For prob data, 57.52/3.43=16.76, which is roughly 14x difference, but it confuses me when the mask data doesn't have a huge difference. I'm also confused of your result: P.S. shall we update the link of the node classification example from the description? It is moved to a different path. |
The issue I hit when reporting, the prob/mask is generated once we load the dataset, so it's not changed during model training. So according to the results you shared above, you didn't reproduce the issue. could you file a draft PR for your reproduce code changes? then I could review it and find out what's the discrepancy. feel free to update the example path accordingly. |
@Rhett-Ying @az15240 You probably ran the GPU version of the kernels. It is expected that the GPU results are this fast. You need to move the graph to the CPU to measure CPU sampling time. |
Please checkout PR #7580 for more information! |
@Rhett-Ying How did you calculate the average time for sampling? I think it is more direct If we focus on the number of iterations per second. This is what I get using CPU, and seems like the runtime isn't too bad. None: 2.32it/s |
Try using |
You might want to use |
@az15240 As mentioned above, please make sure you're using CPU for sampling. This issue happens with CPU sampling only. I think we already see the inefficiency of cpu sampling on prob/mask from the regression you previously added. |
I don't know why but I got a |
Notes:
|
NumPick will decide the number of nodes/edges to pick, for example when there are a conflict between fanout number and number of non-zero prob/mask entries. |
The first option has no change to the performance, since it changes on TemporalNumPick instead of NumPick. |
One hypothesis to explain the huge sampling time is that, in NonUniformPickOp function, auto positive_probs_indices = probs.nonzero().squeeze(1);
auto num_positive_probs = positive_probs_indices.size(0); takes too long to execute. Meanwhile, when dgl/graphbolt/src/fused_csc_sampling_graph.cc Lines 1227 to 1229 in ed50c17
|
Some progress that I currently have:
Data for 4.4: https://docs.google.com/spreadsheets/d/1QrH-A4Fch0McHxux7pwaNgSkaoH4HslwFHSYdZ_ADwc/edit?usp=sharing, by running benchmark_graphbolt_sampling.py I think we can take the 4.4 change. |
🔨Work Item
IMPORTANT:
Project tracker: https://github.com/orgs/dmlc/projects/2
Description
Below numbers are from node classification example on latest master branch(2024.06.14) with CPU sampling.
prob data
non-dist, gb, homo, nc, no prob
Training...
Training: 3it [00:05, 1.55s/it]---- Average time for sampling: 0.082120824418962
Training: 10it [00:13, 1.19s/it]---- Average time for sampling: 0.0836272995453328
Training: 16it [00:19, 1.12s/it]---- Average time for sampling: 0.08516605570912361
Training: 23it [00:27, 1.14s/it]---- Average time for sampling: 0.0846009589266032
Training: 30it [00:35, 1.13s/it]---- Average time for sampling: 0.0846672392077744
Training: 36it [00:42, 1.12s/it]---- Average time for sampling: 0.08583860701570908
non-dist, gb, homo, nc, prob
Training: 4it [00:22, 5.03s/it]---- Average time for sampling: 1.1615502193570137
Training: 11it [00:52, 4.32s/it]---- Average time for sampling: 1.228120240289718
Training: 18it [01:22, 4.19s/it]---- Average time for sampling: 1.2553682390290002
Training: 24it [01:47, 4.19s/it]---- Average time for sampling: 1.2268523721955717
Training: 31it [02:16, 4.18s/it]---- Average time for sampling: 1.2365567354112863
Training: 38it [02:46, 4.21s/it]---- Average time for sampling: 1.2496669557721665
mask data
GB + NoMask
Training...
Training: 3it [00:00, 3.37it/s]---- Average time for sampling: 0.0064875221811234954
Training: 10it [00:02, 4.05it/s]---- Average time for sampling: 0.006722455704584717
Training: 16it [00:04, 3.92it/s]---- Average time for sampling: 0.008144200344880422
Training: 23it [00:06, 3.68it/s]---- Average time for sampling: 0.008010426000691951
Training: 30it [00:07, 4.15it/s]---- Average time for sampling: 0.008361106105148793
Training: 36it [00:09, 4.31it/s]---- Average time for sampling: 0.00815806492852668
GB + Mask
Training...
Training: 3it [00:01, 2.61it/s]---- Average time for sampling: 0.04655262678861618
Training: 10it [00:03, 3.52it/s]---- Average time for sampling: 0.05128640588372946
Training: 16it [00:04, 3.61it/s]---- Average time for sampling: 0.054167306640495856
Training: 23it [00:06, 3.64it/s]---- Average time for sampling: 0.052452172990888356
Training: 30it [00:08, 3.74it/s]---- Average time for sampling: 0.05273009695112705
Training: 36it [00:10, 3.49it/s]---- Average time for sampling: 0.05279933324394127
critical call
dgl/graphbolt/src/fused_csc_sampling_graph.cc
Lines 1322 to 1328 in ed50c17
Depending work items or issues
The text was updated successfully, but these errors were encountered: