Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt] Refactor NeighborSampler and expose fine-grained datapipes. #6983

Merged
merged 40 commits into from
Feb 1, 2024

Conversation

mfbalin
Copy link
Collaborator

@mfbalin mfbalin commented Jan 20, 2024

Description

We expose the different steps of the graph sampling operation so that optimizations can be implemented later.

image

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@mfbalin mfbalin marked this pull request as draft January 20, 2024 02:55
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 20, 2024

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 20, 2024

Commit ID: b4c045f

Build ID: 1

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 20, 2024

Commit ID: a26e354

Build ID: 2

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@mfbalin mfbalin marked this pull request as ready for review January 20, 2024 03:08
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 20, 2024

Commit ID: 58a1190

Build ID: 3

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@mfbalin mfbalin changed the title [GraphBolt] Refactor NeighborSampler and expose fine-grained datapipes. [DO NOT MERGE][GraphBolt] Refactor NeighborSampler and expose fine-grained datapipes. Jan 20, 2024
@peizhou001
Copy link
Collaborator

Incorporating the concept of pre-process aligns with our plan. However, I am hesitant about transforming sample_per_layer and compact_per_layer into datapipes; this seems overly intricate and somewhat excessively optimized. In my view, a more straightforward and adaptable approach involves offering fundamental ops such as sample and compact, allowing users to assemble these ops as needed.

Additionally, there is a dependency between sample and compact, as well as between two layers of sample, preventing us from capitalizing on scheduling benefits by splitting them.

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jan 23, 2024

Incorporating the concept of pre-process aligns with our plan. However, I am hesitant about transforming sample_per_layer and compact_per_layer into datapipes; this seems overly intricate and somewhat excessively optimized. In my view, a more straightforward and adaptable approach involves offering fundamental ops such as sample and compact, allowing users to assemble these ops as needed.

Additionally, there is a dependency between sample and compact, as well as between two layers of sample, preventing us from capitalizing on scheduling benefits by splitting them.

I am not sure I fully understand your point of view. I guess discussing this in a meeting is the best way forward as communication over text can be slow and prone to misunderstanding. Thank you very much for the early feedback!

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jan 23, 2024

I wanted to separate sampling and the compact operation because implementing the Cooperative Minibatching idea will require us to customize the compact operation later to avoid unnecessary work.

Cooperative Minibatching: https://arxiv.org/pdf/2310.12403.pdf

@peizhou001
Copy link
Collaborator

Incorporating the concept of pre-process aligns with our plan. However, I am hesitant about transforming sample_per_layer and compact_per_layer into datapipes; this seems overly intricate and somewhat excessively optimized. In my view, a more straightforward and adaptable approach involves offering fundamental ops such as sample and compact, allowing users to assemble these ops as needed.
Additionally, there is a dependency between sample and compact, as well as between two layers of sample, preventing us from capitalizing on scheduling benefits by splitting them.

I am not sure I fully understand your point of view. I guess discussing this in a meeting is the best way forward as communication over text can be slow and prone to misunderstanding. Thank you very much for the early feedback!

Just ping me when needed, we can have a short meeting or discuss it in the weekly meeting

@Rhett-Ying
Copy link
Collaborator

Splitting large datapipe(stage) into smaller ones(sub-stage) gives the chance for better pipelining, scheduling, then better performance in further. And this could further benefit from optimization in specific calls(OPs) such as sampling, compact. But the tradeoff between scheduling benefits and overhead incurred should be well measured and profiled. And I think the schedule logic(or top DataLoader) is the major part that needs sophisticated design to accommodate various cases.

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jan 23, 2024

@Rhett-Ying the modified example is runnable. I will run the two versions and report if there is any regression. In my experience so far, the runtime didn't get affected at all, will provide actual data on it though.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 23, 2024

Commit ID: eedb8d1

Build ID: 4

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jan 23, 2024

sample_neighbor:

mfbalin@BALIN-PC:~/dgl-1/examples/sampling/graphbolt/lightning$ time python ../node_classification.py --mode=cuda-cuda
Training in cuda-cuda mode.
Loading data...
The dataset is already preprocessed.
Training...
Training: 193it [00:03, 56.63it/s]
Evaluating: 39it [00:00, 82.79it/s]
Epoch 00000 | Loss 1.3463 | Accuracy 0.8546 | Time 3.4102
Training: 193it [00:03, 58.04it/s]
Evaluating: 39it [00:00, 93.24it/s]
Epoch 00001 | Loss 0.6015 | Accuracy 0.8772 | Time 3.3272
Training: 193it [00:03, 62.76it/s]
Evaluating: 39it [00:00, 89.93it/s]
Epoch 00002 | Loss 0.4923 | Accuracy 0.8855 | Time 3.0765
Training: 193it [00:03, 56.94it/s]
Evaluating: 39it [00:00, 86.70it/s]
Epoch 00003 | Loss 0.4435 | Accuracy 0.8899 | Time 3.3910
Training: 193it [00:03, 60.06it/s]
Evaluating: 39it [00:00, 93.32it/s]
Epoch 00004 | Loss 0.4156 | Accuracy 0.8923 | Time 3.2145
Training: 193it [00:03, 59.32it/s]
Evaluating: 39it [00:00, 84.14it/s]
Epoch 00005 | Loss 0.3964 | Accuracy 0.8964 | Time 3.2547
Training: 193it [00:03, 59.73it/s]
Evaluating: 39it [00:00, 83.30it/s]
Epoch 00006 | Loss 0.3812 | Accuracy 0.8964 | Time 3.2326
Training: 193it [00:03, 58.72it/s]
Evaluating: 39it [00:00, 83.97it/s]
Epoch 00007 | Loss 0.3815 | Accuracy 0.8977 | Time 3.2878
Training: 193it [00:03, 63.09it/s]
Evaluating: 39it [00:00, 93.26it/s]
Epoch 00008 | Loss 0.3708 | Accuracy 0.9005 | Time 3.0602
Training: 193it [00:03, 62.66it/s]
Evaluating: 39it [00:00, 85.37it/s]
Epoch 00009 | Loss 0.3534 | Accuracy 0.9019 | Time 3.0814
Testing...
598it [00:02, 232.05it/s]
598it [00:02, 224.06it/s]
598it [00:02, 289.34it/s]
Test accuracy 0.7621

real    0m51.041s
user    2m32.037s
sys     0m13.134s

sample_neighbor2:

mfbalin@BALIN-PC:~/dgl-1/examples/sampling/graphbolt/lightning$ time python ../node_classification.py --mode=cuda-cuda
Training in cuda-cuda mode.
Loading data...
The dataset is already preprocessed.
Training...
/home/mfbalin/.local/lib/python3.10/site-packages/torch/utils/data/datapipes/utils/common.py:141: UserWarning: Local function is not supported by pickle, please use regular python function or functools.partial instead.
  warnings.warn(
Training: 193it [00:03, 54.45it/s]
Evaluating: 39it [00:00, 75.89it/s]
Epoch 00000 | Loss 1.2735 | Accuracy 0.8571 | Time 3.5468
Training: 193it [00:03, 61.34it/s]
Evaluating: 39it [00:00, 81.49it/s]
Epoch 00001 | Loss 0.5809 | Accuracy 0.8773 | Time 3.1485
Training: 193it [00:03, 59.74it/s]
Evaluating: 39it [00:00, 71.88it/s]
Epoch 00002 | Loss 0.4913 | Accuracy 0.8861 | Time 3.2323
Training: 193it [00:03, 63.60it/s]
Evaluating: 39it [00:00, 86.88it/s]
Epoch 00003 | Loss 0.4441 | Accuracy 0.8912 | Time 3.0362
Training: 193it [00:03, 62.74it/s]
Evaluating: 39it [00:00, 79.67it/s]
Epoch 00004 | Loss 0.4180 | Accuracy 0.8940 | Time 3.0776
Training: 193it [00:03, 61.73it/s]
Evaluating: 39it [00:00, 84.68it/s]
Epoch 00005 | Loss 0.4019 | Accuracy 0.8961 | Time 3.1277
Training: 193it [00:03, 61.69it/s]
Evaluating: 39it [00:00, 77.86it/s]
Epoch 00006 | Loss 0.3848 | Accuracy 0.8976 | Time 3.1301
Training: 193it [00:02, 65.94it/s]
Evaluating: 39it [00:00, 67.68it/s]
Epoch 00007 | Loss 0.3744 | Accuracy 0.8991 | Time 2.9285
Training: 193it [00:02, 65.85it/s]
Evaluating: 39it [00:00, 79.20it/s]
Epoch 00008 | Loss 0.3684 | Accuracy 0.9015 | Time 2.9325
Training: 193it [00:02, 65.26it/s]
Evaluating: 39it [00:00, 84.56it/s]
Epoch 00009 | Loss 0.3579 | Accuracy 0.9016 | Time 2.9591
Testing...
598it [00:02, 224.16it/s]
598it [00:02, 212.81it/s]
598it [00:02, 266.85it/s]
Test accuracy 0.7675

real    0m50.712s
user    2m34.052s
sys     0m13.463s

@mfbalin mfbalin changed the title [DO NOT MERGE][GraphBolt] Refactor NeighborSampler and expose fine-grained datapipes. [GraphBolt] Refactor NeighborSampler and expose fine-grained datapipes. Jan 24, 2024
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 24, 2024

Commit ID: 3d2ed98

Build ID: 5

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 24, 2024

Commit ID: 105924a

Build ID: 6

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 24, 2024

Commit ID: bfb28ec

Build ID: 7

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 31, 2024

Commit ID: 144134c

Build ID: 38

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 31, 2024

Commit ID: 96bac52

Build ID: 39

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 31, 2024

Commit ID: 718cab8

Build ID: 40

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jan 31, 2024

I am quite happy with the final refactored version. Quite elegant in my opinion thanks to the reviewers' help.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 31, 2024

Commit ID: ee3a7d7

Build ID: 41

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@frozenbugs
Copy link
Collaborator

LGTM to me, @peizhou001 please take anther look and approve.

Copy link
Collaborator

@peizhou001 peizhou001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approce with samll comment

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 1, 2024

Commit ID: 11895f3969dd631d6d4847f380047ac4e1268b73

Build ID: 42

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 1, 2024

Commit ID: 6ab2e75

Build ID: 43

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 1, 2024

Commit ID: 6f880c0

Build ID: 44

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 1, 2024

Commit ID: 5d907ee

Build ID: 45

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin mfbalin merged commit 50eb101 into dmlc:master Feb 1, 2024
2 checks passed
@mfbalin mfbalin deleted the gb_refactor_neighbor_sampler branch February 1, 2024 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants