[GraphBolt] Refactor NeighborSampler and expose fine-grained datapipes. #6983

mfbalin · 2024-01-20T02:55:40Z

Description

We expose the different steps of the graph sampling operation so that optimizations can be implemented later.

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
I've leverage the tools to beautify the python and c++ code.
The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
Related issue is referred in this PR
If the PR is for a new model/paper, I've updated the example index here.

Changes

dgl-bot · 2024-01-20T02:56:10Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot · 2024-01-20T02:58:05Z

Commit ID: b4c045f

Build ID: 1

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot · 2024-01-20T03:06:00Z

Commit ID: a26e354

Build ID: 2

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot · 2024-01-20T03:10:19Z

Commit ID: 58a1190

Build ID: 3

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

peizhou001 · 2024-01-23T06:26:32Z

Incorporating the concept of pre-process aligns with our plan. However, I am hesitant about transforming sample_per_layer and compact_per_layer into datapipes; this seems overly intricate and somewhat excessively optimized. In my view, a more straightforward and adaptable approach involves offering fundamental ops such as sample and compact, allowing users to assemble these ops as needed.

Additionally, there is a dependency between sample and compact, as well as between two layers of sample, preventing us from capitalizing on scheduling benefits by splitting them.

mfbalin · 2024-01-23T06:29:53Z

Incorporating the concept of pre-process aligns with our plan. However, I am hesitant about transforming sample_per_layer and compact_per_layer into datapipes; this seems overly intricate and somewhat excessively optimized. In my view, a more straightforward and adaptable approach involves offering fundamental ops such as sample and compact, allowing users to assemble these ops as needed.

Additionally, there is a dependency between sample and compact, as well as between two layers of sample, preventing us from capitalizing on scheduling benefits by splitting them.

I am not sure I fully understand your point of view. I guess discussing this in a meeting is the best way forward as communication over text can be slow and prone to misunderstanding. Thank you very much for the early feedback!

mfbalin · 2024-01-23T06:31:16Z

I wanted to separate sampling and the compact operation because implementing the Cooperative Minibatching idea will require us to customize the compact operation later to avoid unnecessary work.

Cooperative Minibatching: https://arxiv.org/pdf/2310.12403.pdf

peizhou001 · 2024-01-23T06:32:49Z

Incorporating the concept of pre-process aligns with our plan. However, I am hesitant about transforming sample_per_layer and compact_per_layer into datapipes; this seems overly intricate and somewhat excessively optimized. In my view, a more straightforward and adaptable approach involves offering fundamental ops such as sample and compact, allowing users to assemble these ops as needed.
Additionally, there is a dependency between sample and compact, as well as between two layers of sample, preventing us from capitalizing on scheduling benefits by splitting them.

I am not sure I fully understand your point of view. I guess discussing this in a meeting is the best way forward as communication over text can be slow and prone to misunderstanding. Thank you very much for the early feedback!

Just ping me when needed, we can have a short meeting or discuss it in the weekly meeting

Rhett-Ying · 2024-01-23T07:05:55Z

Splitting large datapipe(stage) into smaller ones(sub-stage) gives the chance for better pipelining, scheduling, then better performance in further. And this could further benefit from optimization in specific calls(OPs) such as sampling, compact. But the tradeoff between scheduling benefits and overhead incurred should be well measured and profiled. And I think the schedule logic(or top DataLoader) is the major part that needs sophisticated design to accommodate various cases.

mfbalin · 2024-01-23T07:44:05Z

@Rhett-Ying the modified example is runnable. I will run the two versions and report if there is any regression. In my experience so far, the runtime didn't get affected at all, will provide actual data on it though.

dgl-bot · 2024-01-23T19:01:03Z

Commit ID: eedb8d1

Build ID: 4

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

mfbalin · 2024-01-23T19:07:26Z

sample_neighbor:

mfbalin@BALIN-PC:~/dgl-1/examples/sampling/graphbolt/lightning$ time python ../node_classification.py --mode=cuda-cuda
Training in cuda-cuda mode.
Loading data...
The dataset is already preprocessed.
Training...
Training: 193it [00:03, 56.63it/s]
Evaluating: 39it [00:00, 82.79it/s]
Epoch 00000 | Loss 1.3463 | Accuracy 0.8546 | Time 3.4102
Training: 193it [00:03, 58.04it/s]
Evaluating: 39it [00:00, 93.24it/s]
Epoch 00001 | Loss 0.6015 | Accuracy 0.8772 | Time 3.3272
Training: 193it [00:03, 62.76it/s]
Evaluating: 39it [00:00, 89.93it/s]
Epoch 00002 | Loss 0.4923 | Accuracy 0.8855 | Time 3.0765
Training: 193it [00:03, 56.94it/s]
Evaluating: 39it [00:00, 86.70it/s]
Epoch 00003 | Loss 0.4435 | Accuracy 0.8899 | Time 3.3910
Training: 193it [00:03, 60.06it/s]
Evaluating: 39it [00:00, 93.32it/s]
Epoch 00004 | Loss 0.4156 | Accuracy 0.8923 | Time 3.2145
Training: 193it [00:03, 59.32it/s]
Evaluating: 39it [00:00, 84.14it/s]
Epoch 00005 | Loss 0.3964 | Accuracy 0.8964 | Time 3.2547
Training: 193it [00:03, 59.73it/s]
Evaluating: 39it [00:00, 83.30it/s]
Epoch 00006 | Loss 0.3812 | Accuracy 0.8964 | Time 3.2326
Training: 193it [00:03, 58.72it/s]
Evaluating: 39it [00:00, 83.97it/s]
Epoch 00007 | Loss 0.3815 | Accuracy 0.8977 | Time 3.2878
Training: 193it [00:03, 63.09it/s]
Evaluating: 39it [00:00, 93.26it/s]
Epoch 00008 | Loss 0.3708 | Accuracy 0.9005 | Time 3.0602
Training: 193it [00:03, 62.66it/s]
Evaluating: 39it [00:00, 85.37it/s]
Epoch 00009 | Loss 0.3534 | Accuracy 0.9019 | Time 3.0814
Testing...
598it [00:02, 232.05it/s]
598it [00:02, 224.06it/s]
598it [00:02, 289.34it/s]
Test accuracy 0.7621

real    0m51.041s
user    2m32.037s
sys     0m13.134s

sample_neighbor2:

mfbalin@BALIN-PC:~/dgl-1/examples/sampling/graphbolt/lightning$ time python ../node_classification.py --mode=cuda-cuda
Training in cuda-cuda mode.
Loading data...
The dataset is already preprocessed.
Training...
/home/mfbalin/.local/lib/python3.10/site-packages/torch/utils/data/datapipes/utils/common.py:141: UserWarning: Local function is not supported by pickle, please use regular python function or functools.partial instead.
  warnings.warn(
Training: 193it [00:03, 54.45it/s]
Evaluating: 39it [00:00, 75.89it/s]
Epoch 00000 | Loss 1.2735 | Accuracy 0.8571 | Time 3.5468
Training: 193it [00:03, 61.34it/s]
Evaluating: 39it [00:00, 81.49it/s]
Epoch 00001 | Loss 0.5809 | Accuracy 0.8773 | Time 3.1485
Training: 193it [00:03, 59.74it/s]
Evaluating: 39it [00:00, 71.88it/s]
Epoch 00002 | Loss 0.4913 | Accuracy 0.8861 | Time 3.2323
Training: 193it [00:03, 63.60it/s]
Evaluating: 39it [00:00, 86.88it/s]
Epoch 00003 | Loss 0.4441 | Accuracy 0.8912 | Time 3.0362
Training: 193it [00:03, 62.74it/s]
Evaluating: 39it [00:00, 79.67it/s]
Epoch 00004 | Loss 0.4180 | Accuracy 0.8940 | Time 3.0776
Training: 193it [00:03, 61.73it/s]
Evaluating: 39it [00:00, 84.68it/s]
Epoch 00005 | Loss 0.4019 | Accuracy 0.8961 | Time 3.1277
Training: 193it [00:03, 61.69it/s]
Evaluating: 39it [00:00, 77.86it/s]
Epoch 00006 | Loss 0.3848 | Accuracy 0.8976 | Time 3.1301
Training: 193it [00:02, 65.94it/s]
Evaluating: 39it [00:00, 67.68it/s]
Epoch 00007 | Loss 0.3744 | Accuracy 0.8991 | Time 2.9285
Training: 193it [00:02, 65.85it/s]
Evaluating: 39it [00:00, 79.20it/s]
Epoch 00008 | Loss 0.3684 | Accuracy 0.9015 | Time 2.9325
Training: 193it [00:02, 65.26it/s]
Evaluating: 39it [00:00, 84.56it/s]
Epoch 00009 | Loss 0.3579 | Accuracy 0.9016 | Time 2.9591
Testing...
598it [00:02, 224.16it/s]
598it [00:02, 212.81it/s]
598it [00:02, 266.85it/s]
Test accuracy 0.7675

real    0m50.712s
user    2m34.052s
sys     0m13.463s

dgl-bot · 2024-01-24T06:33:26Z

Commit ID: 3d2ed98

Build ID: 5

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot · 2024-01-24T06:37:50Z

Commit ID: 105924a

Build ID: 6

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot · 2024-01-24T06:45:15Z

Commit ID: bfb28ec

Build ID: 7

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot · 2024-01-31T12:40:11Z

Commit ID: 144134c

Build ID: 38

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-01-31T12:54:03Z

Commit ID: 96bac52

Build ID: 39

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-01-31T12:55:30Z

Commit ID: 718cab8

Build ID: 40

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

mfbalin · 2024-01-31T13:40:51Z

I am quite happy with the final refactored version. Quite elegant in my opinion thanks to the reviewers' help.

dgl-bot · 2024-01-31T13:52:07Z

Commit ID: ee3a7d7

Build ID: 41

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

frozenbugs · 2024-02-01T03:54:22Z

LGTM to me, @peizhou001 please take anther look and approve.

python/dgl/graphbolt/impl/neighbor_sampler.py

peizhou001

Approce with samll comment

python/dgl/graphbolt/impl/neighbor_sampler.py

python/dgl/graphbolt/subgraph_sampler.py

dgl-bot · 2024-02-01T06:02:19Z

Commit ID: 11895f3969dd631d6d4847f380047ac4e1268b73

Build ID: 42

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-02-01T06:09:37Z

Commit ID: 6ab2e75

Build ID: 43

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-02-01T06:21:41Z

Commit ID: 6f880c0

Build ID: 44

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-02-01T07:07:16Z

Commit ID: 5d907ee

Build ID: 45

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

prototyping

b4c045f

mfbalin marked this pull request as draft January 20, 2024 02:55

mfbalin requested review from frozenbugs and Rhett-Ying January 20, 2024 02:55

fix bug

a26e354

remove print expressions, works now

58a1190

mfbalin marked this pull request as ready for review January 20, 2024 03:08

mfbalin requested a review from peizhou001 January 20, 2024 03:09

mfbalin changed the title ~~[GraphBolt] Refactor NeighborSampler and expose fine-grained datapipes.~~ [DO NOT MERGE][GraphBolt] Refactor NeighborSampler and expose fine-grained datapipes. Jan 20, 2024

Merge branch 'master' into gb_refactor_neighbor_sampler

eedb8d1

mfbalin changed the title ~~[DO NOT MERGE][GraphBolt] Refactor NeighborSampler and expose fine-grained datapipes.~~ [GraphBolt] Refactor NeighborSampler and expose fine-grained datapipes. Jan 24, 2024

add tests

3d2ed98

use seeds_timestamp in preprocess

105924a

add docstring for linting

bfb28ec

fix linting

e4becc9

mfbalin requested a review from frozenbugs January 31, 2024 11:51

mfbalin mentioned this pull request Jan 31, 2024

[GraphBolt][CUDA] Full GPU support for GraphBolt #6826

Open

27 tasks

mfbalin added 2 commits January 31, 2024 15:40

Merge branch 'master' into gb_refactor_neighbor_sampler

96bac52

Cleanup NeighborSampler as it does not need to store anything itself.

718cab8

linting

ee3a7d7

mfbalin mentioned this pull request Jan 31, 2024

[GraphBolt] Modify SubgraphSampler to support seeds. #7049

Merged

8 tasks

peizhou001 reviewed Feb 1, 2024

View reviewed changes

python/dgl/graphbolt/impl/neighbor_sampler.py Outdated Show resolved Hide resolved

peizhou001 reviewed Feb 1, 2024

View reviewed changes

python/dgl/graphbolt/impl/neighbor_sampler.py Outdated Show resolved Hide resolved

peizhou001 approved these changes Feb 1, 2024

View reviewed changes

peizhou001 reviewed Feb 1, 2024

View reviewed changes

python/dgl/graphbolt/impl/neighbor_sampler.py Show resolved Hide resolved

Rhett-Ying reviewed Feb 1, 2024

View reviewed changes

python/dgl/graphbolt/subgraph_sampler.py Show resolved Hide resolved

Rhett-Ying reviewed Feb 1, 2024

View reviewed changes

python/dgl/graphbolt/subgraph_sampler.py Outdated Show resolved Hide resolved

mfbalin added 2 commits February 1, 2024 09:00

address reviews by not passing sampler as string argument.

1d906e7

Merge branch 'master' into gb_refactor_neighbor_sampler

6ab2e75

Talk about sampling_stages in the SubgraphSampler API.

6f880c0

add more documentation for sampling_stages.

5d907ee

Rhett-Ying approved these changes Feb 1, 2024

View reviewed changes

frozenbugs approved these changes Feb 1, 2024

View reviewed changes

mfbalin merged commit 50eb101 into dmlc:master Feb 1, 2024
2 checks passed

mfbalin deleted the gb_refactor_neighbor_sampler branch February 1, 2024 07:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GraphBolt] Refactor NeighborSampler and expose fine-grained datapipes. #6983

[GraphBolt] Refactor NeighborSampler and expose fine-grained datapipes. #6983

mfbalin commented Jan 20, 2024 •

edited

Loading

dgl-bot commented Jan 20, 2024

dgl-bot commented Jan 20, 2024

dgl-bot commented Jan 20, 2024

dgl-bot commented Jan 20, 2024

peizhou001 commented Jan 23, 2024

mfbalin commented Jan 23, 2024

mfbalin commented Jan 23, 2024 •

edited

Loading

peizhou001 commented Jan 23, 2024

Rhett-Ying commented Jan 23, 2024

mfbalin commented Jan 23, 2024

dgl-bot commented Jan 23, 2024

mfbalin commented Jan 23, 2024

dgl-bot commented Jan 24, 2024

dgl-bot commented Jan 24, 2024

dgl-bot commented Jan 24, 2024

dgl-bot commented Jan 31, 2024

dgl-bot commented Jan 31, 2024

dgl-bot commented Jan 31, 2024

mfbalin commented Jan 31, 2024

dgl-bot commented Jan 31, 2024

frozenbugs commented Feb 1, 2024

peizhou001 left a comment

dgl-bot commented Feb 1, 2024

dgl-bot commented Feb 1, 2024

dgl-bot commented Feb 1, 2024

dgl-bot commented Feb 1, 2024

[GraphBolt] Refactor NeighborSampler and expose fine-grained datapipes. #6983

[GraphBolt] Refactor NeighborSampler and expose fine-grained datapipes. #6983

Conversation

mfbalin commented Jan 20, 2024 • edited Loading

Description

Checklist

Changes

dgl-bot commented Jan 20, 2024

dgl-bot commented Jan 20, 2024

dgl-bot commented Jan 20, 2024

dgl-bot commented Jan 20, 2024

peizhou001 commented Jan 23, 2024

mfbalin commented Jan 23, 2024

mfbalin commented Jan 23, 2024 • edited Loading

peizhou001 commented Jan 23, 2024

Rhett-Ying commented Jan 23, 2024

mfbalin commented Jan 23, 2024

dgl-bot commented Jan 23, 2024

mfbalin commented Jan 23, 2024

dgl-bot commented Jan 24, 2024

dgl-bot commented Jan 24, 2024

dgl-bot commented Jan 24, 2024

dgl-bot commented Jan 31, 2024

dgl-bot commented Jan 31, 2024

dgl-bot commented Jan 31, 2024

mfbalin commented Jan 31, 2024

dgl-bot commented Jan 31, 2024

frozenbugs commented Feb 1, 2024

peizhou001 left a comment

Choose a reason for hiding this comment

dgl-bot commented Feb 1, 2024

dgl-bot commented Feb 1, 2024

dgl-bot commented Feb 1, 2024

dgl-bot commented Feb 1, 2024

mfbalin commented Jan 20, 2024 •

edited

Loading

mfbalin commented Jan 23, 2024 •

edited

Loading