Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Evaluation on AMD 16-Core CPU Bare Metal via Latitude.sh Hardware Cloud #306

Open
wants to merge 62 commits into
base: main
Choose a base branch
from

Conversation

sourcesync
Copy link
Collaborator

@sourcesync sourcesync commented Aug 30, 2024

What is this PR?

This PR provides competition evaluation on new hardware, based on the AMD 16-core CPU (bare metal.)

A little background first. I, Harsha, and Amir Ingber were interviewed by Harald Carlens of MLContests at NeurIPS2023. Harald introduced us to Victor Chiea of Latitude.sh. Latitude graciously donated credits for use of their hardware cloud, which provides many flavors of CPUs and GPUs.

As a first step, the decision was made to evaluate on a Latitude system similar to the ones used for the 2023 competition. This PR is the result of that on-going effort.

How do I get started? How do I view the track rankings on this hardware?

The track rankings are here.. Also included are track Pareto plots, detailed hardware inventory, commands used, and additional notes.

Why is this PR still WIP? How can I help?

  • The streaming track rankings are not available (not sure yet how that's done but working on it :)
  • The track rankings appear to be a bit different than the 2023 competition rankings, so it would be great if anyone who performed 2023 evaluations provide a sanity check.
  • Any other feedback is very much appreciated.

@magdalendobson
Copy link
Collaborator

Thanks so much to you and Latitude for this comparison! Really interesting to see. Posting a few preliminary thoughts right away since this PR is still a work in progress.

From looking over the revised rankings, the thing that jumps out to me the most is that SCANN's submission seems to do very poorly compared to the rankings on the competition machine. It might be worth looking into this one a little bit and trying to understand the discrepancy--it seems like by far the largest jump in rankings on the board. I'm OOF until mid-September but I would be interested in looking into this when I return.

Another thing that occurs to me is that at least the baseline for OOD (DiskANN) sets the number of threads for query time as an explicit parameter in the configuration. So unless you changed their config, it would be running with 8 threads on your 16 core machine, while other algorithms may automatically adjust the number of threads they use to the number of available threads. It might be good to try to standardize this--I did some spot checking on other algorithms and didn't find any other instances where the number of query-time threads is set explicitly to 8 (my first thought with SCANN, but I didn't find evidence of this), but it's definitely possible I missed some.

I am happy to help with producing results with the streaming track, I use that code frequently and recently contributed some new runbooks. I did not completely understand whether the problem was with running the algorithms or producing a ranking--let me know and I can probably help out.

@sourcesync
Copy link
Collaborator Author

sourcesync commented Aug 30, 2024

Thanks so much to you and Latitude for this comparison! Really interesting to see. Posting a few preliminary thoughts right away since this PR is still a work in progress.

Great @magdalendobson! My responses in-line...

From looking over the revised rankings, the thing that jumps out to me the most is that SCANN's submission seems to do very poorly compared to the rankings on the competition machine. It might be worth looking into this one a little bit and trying to understand the discrepancy--it seems like by far the largest jump in rankings on the board. I'm OOF until mid-September but I would be interested in looking into this when I return.

Great. Assuming all things are equal, then perhaps the difference is hardware related (bare-metal instead of virtualized, different CPU, different NVMe drive, etc.). As you suggest, this should be verified with some additional debugging.

Another thing that occurs to me is that at least the baseline for OOD (DiskANN) sets the number of threads for query time as an explicit parameter in the configuration. So unless you changed their config, it would be running with 8 threads on your 16 core machine, while other algorithms may automatically adjust the number of threads they use to the number of available threads. It might be good to try to standardize this--I did some spot checking on other algorithms and didn't find any other instances where the number of query-time threads is set explicitly to 8 (my first thought with SCANN, but I didn't find evidence of this), but it's definitely possible I missed some.

Yeah, I did not change any configs. If I recall, the competition leverages Docker to limit/standardize the use of the underlying resources, and I did not change any of the default behavior in this regard.

I am happy to help with producing results with the streaming track, I use that code frequently and recently contributed some new runbooks. I did not completely understand whether the problem was with running the algorithms or producing a ranking--let me know and I can probably help out.

Great. I think I'm missing something super obvious.

Just as an example, i'm running the following commands for streaming diskann. It appears to run ok, but I can't extract the results either with data_export.py or plot.py. I think I'm missing something super obvious.

python install.py --neurips23track streaming --algorithm diskann   # SUCCESS

python3 run.py --dataset msturing-30M-clustered --algorithm diskann --neurips23track streaming --runbook_path neurips23/streaming/final_runbook.yaml   # SUCCESS

sudo chmod ugo+rw -R ./results/   # SUCCESS

python data_export.py --recompute --output /tmp/export.csv   # SUCCESS

cat /tmp/export.csv | grep streaming  # NO MATCHES

python plot.py --neurips23track streaming --output neurips23/latitude/streaming.png --raw --recompute --dataset msturing-30M-clustered   # ERROR, see stack trace below
Traceback (most recent call last):
  File "/home/gwilliams/Projects/BigANN/big-ann-benchmarks/plot.py", line 183, in <module>
    runs = compute_metrics(dataset.get_groundtruth(k=args.count),
  File "/home/gwilliams/Projects/BigANN/big-ann-benchmarks/benchmark/plotting/utils.py", line 50, in compute_metrics
    for i, (properties, run) in enumerate(res):
  File "/home/gwilliams/Projects/BigANN/big-ann-benchmarks/benchmark/results.py", line 76, in load_all_results
    for root, _, files in os.walk(get_result_filename(dataset, count, \
  File "/home/gwilliams/Projects/BigANN/big-ann-benchmarks/benchmark/results.py", line 17, in get_result_filename
    raise RuntimeError('Need runbook_path to store results')
RuntimeError: Need runbook_path to store results

@magdalendobson
Copy link
Collaborator

Paging @arron2003 to take a look at SCANN results here--in your experience do results on this hardware look accurate to you? Any thoughts on whether there may be an easily addressed issue?

@magdalendobson
Copy link
Collaborator

@sourcesync the plotting code (e.g. plot.py) isn't usually used to generate streaming results, since streaming only generates one result (average recall) per run. Usually we would use data_export.py, as you did in your first command. Would you be able to post both the output of the command and the contents of the resulting CSV file so we can debug more?

@sourcesync
Copy link
Collaborator Author

sourcesync commented Sep 12, 2024

@sourcesync the plotting code (e.g. plot.py) isn't usually used to generate streaming results, since streaming only generates one result (average recall) per run. Usually we would use data_export.py, as you did in your first command. Would you be able to post both the output of the command and the contents of the resulting CSV file so we can debug more?

Thanks @magdalendobson! I put a copy of the data export into this PR branch. I didn't notice 'streaming' in the track column. Is there are specialized method to export streaming track results?

@arron2003
Copy link
Contributor

Paging @arron2003 to take a look at SCANN results here--in your experience do results on this hardware look accurate to you? Any thoughts on whether there may be an easily addressed issue?

Can you share what is the VM used for this, and how can I reproduce this?
I think the might be that we are using 8 threads in the submission.
https://github.com/harsha-simhadri/big-ann-benchmarks/blob/main/neurips23/ood/scann/scann.py#L57

For previous Azure VM with 16 vCPU, it was the case that there were only 8 physical cores, thus 8 threads.
My guess is that we need to change the batch size and num_threads and results would be much better.

@sourcesync
Copy link
Collaborator Author

sourcesync commented Sep 12, 2024

Paging @arron2003 to take a look at SCANN results here--in your experience do results on this hardware look accurate to you? Any thoughts on whether there may be an easily addressed issue?

Can you share what is the VM used for this, and how can I reproduce this? I think the might be that we are using 8 threads in the submission. https://github.com/harsha-simhadri/big-ann-benchmarks/blob/main/neurips23/ood/scann/scann.py#L57

For previous Azure VM with 16 vCPU, it was the case that there were only 8 physical cores, thus 8 threads. My guess is that we need to change the batch size and num_threads and results would be much better.

Hi @arron2003 ! This is a bare-metal system. I put detailed hardware inventory in this README. Scroll down to Hardware Inventory. These systems were donated by Latitude.sh. If you need access I would need to give you credentials and instructions. Let me know if that is useful. I can also run some commands on your behalf it that's easier.

@arron2003
Copy link
Contributor

It will be helpful if you can share the credential to me.

  1. As a shot in the dark you can use use the following command to bump the threads to 16.
sed -i 's/set_num_threads(8)/set_num_threads(16)/g;s/batch_size=12500/batch_size=6250/g' neurips23/ood/scann/scann.py
  1. However, I still think something else is missing. Usually ScaNN perform much better on bare metal machines compare to Cloud VMs. That's why I'd like dig a bit into the issue.

@sourcesync
Copy link
Collaborator Author

It will be helpful if you can share the credential to me.

  1. As a shot in the dark you can use use the following command to bump the threads to 16.
sed -i 's/set_num_threads(8)/set_num_threads(16)/g;s/batch_size=12500/batch_size=6250/g' neurips23/ood/scann/scann.py
  1. However, I still think something else is missing. Usually ScaNN perform much better on bare metal machines compare to Cloud VMs. That's why I'd like dig a bit into the issue.

OK @arron2003 , let's get you access. What's the best way to share VPN and login credentials with you privately? I can send the credentials to an email of your choice, or I can invite you to my Slack for DM (i'll need an email in that case as well). Or something else?

@arron2003
Copy link
Contributor

You can find me at my github handle @ gmail dot com.

@sourcesync
Copy link
Collaborator Author

You can find me at my github handle @ gmail dot com.

OK @arron2003, sent.

@sourcesync
Copy link
Collaborator Author

@arron2003 hey I went ahead and merged your remote main into my PR branch. Here are the new rankings. Is ok?

@arron2003
Copy link
Contributor

arron2003 commented Sep 13, 2024 via email

@sourcesync
Copy link
Collaborator Author

FYI. I also updated the OOD graph in the README @arron2003.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants