[WIP] Evaluation on AMD 16-Core CPU Bare Metal via Latitude.sh Hardware Cloud #306

sourcesync · 2024-08-30T16:26:05Z

What is this PR?

This PR provides competition evaluation on new hardware, based on the AMD 16-core CPU (bare metal.)

A little background first. I, Harsha, and Amir Ingber were interviewed by Harald Carlens of MLContests at NeurIPS2023. Harald introduced us to Victor Chiea of Latitude.sh. Latitude graciously donated credits for use of their hardware cloud, which provides many flavors of CPUs and GPUs.

As a first step, the decision was made to evaluate on a Latitude system similar to the ones used for the 2023 competition. This PR is the result of that on-going effort.

How do I get started? How do I view the track rankings on this hardware?

The track rankings are here.. Also included are track Pareto plots, detailed hardware inventory, commands used, and additional notes.

Why is this PR still WIP? How can I help?

The streaming track rankings are not available (not sure yet how that's done but working on it :)
The track rankings appear to be a bit different than the 2023 competition rankings, so it would be great if anyone who performed 2023 evaluations provide a sanity check.
Any other feedback is very much appreciated.

magdalendobson · 2024-08-30T17:10:03Z

Thanks so much to you and Latitude for this comparison! Really interesting to see. Posting a few preliminary thoughts right away since this PR is still a work in progress.

From looking over the revised rankings, the thing that jumps out to me the most is that SCANN's submission seems to do very poorly compared to the rankings on the competition machine. It might be worth looking into this one a little bit and trying to understand the discrepancy--it seems like by far the largest jump in rankings on the board. I'm OOF until mid-September but I would be interested in looking into this when I return.

Another thing that occurs to me is that at least the baseline for OOD (DiskANN) sets the number of threads for query time as an explicit parameter in the configuration. So unless you changed their config, it would be running with 8 threads on your 16 core machine, while other algorithms may automatically adjust the number of threads they use to the number of available threads. It might be good to try to standardize this--I did some spot checking on other algorithms and didn't find any other instances where the number of query-time threads is set explicitly to 8 (my first thought with SCANN, but I didn't find evidence of this), but it's definitely possible I missed some.

I am happy to help with producing results with the streaming track, I use that code frequently and recently contributed some new runbooks. I did not completely understand whether the problem was with running the algorithms or producing a ranking--let me know and I can probably help out.

sourcesync · 2024-08-30T20:42:02Z

Thanks so much to you and Latitude for this comparison! Really interesting to see. Posting a few preliminary thoughts right away since this PR is still a work in progress.

Great @magdalendobson! My responses in-line...

From looking over the revised rankings, the thing that jumps out to me the most is that SCANN's submission seems to do very poorly compared to the rankings on the competition machine. It might be worth looking into this one a little bit and trying to understand the discrepancy--it seems like by far the largest jump in rankings on the board. I'm OOF until mid-September but I would be interested in looking into this when I return.

Great. Assuming all things are equal, then perhaps the difference is hardware related (bare-metal instead of virtualized, different CPU, different NVMe drive, etc.). As you suggest, this should be verified with some additional debugging.

Another thing that occurs to me is that at least the baseline for OOD (DiskANN) sets the number of threads for query time as an explicit parameter in the configuration. So unless you changed their config, it would be running with 8 threads on your 16 core machine, while other algorithms may automatically adjust the number of threads they use to the number of available threads. It might be good to try to standardize this--I did some spot checking on other algorithms and didn't find any other instances where the number of query-time threads is set explicitly to 8 (my first thought with SCANN, but I didn't find evidence of this), but it's definitely possible I missed some.

Yeah, I did not change any configs. If I recall, the competition leverages Docker to limit/standardize the use of the underlying resources, and I did not change any of the default behavior in this regard.

I am happy to help with producing results with the streaming track, I use that code frequently and recently contributed some new runbooks. I did not completely understand whether the problem was with running the algorithms or producing a ranking--let me know and I can probably help out.

Great. I think I'm missing something super obvious.

Just as an example, i'm running the following commands for streaming diskann. It appears to run ok, but I can't extract the results either with data_export.py or plot.py. I think I'm missing something super obvious.

python install.py --neurips23track streaming --algorithm diskann   # SUCCESS

python3 run.py --dataset msturing-30M-clustered --algorithm diskann --neurips23track streaming --runbook_path neurips23/streaming/final_runbook.yaml   # SUCCESS

sudo chmod ugo+rw -R ./results/   # SUCCESS

python data_export.py --recompute --output /tmp/export.csv   # SUCCESS

cat /tmp/export.csv | grep streaming  # NO MATCHES

python plot.py --neurips23track streaming --output neurips23/latitude/streaming.png --raw --recompute --dataset msturing-30M-clustered   # ERROR, see stack trace below
Traceback (most recent call last):
  File "/home/gwilliams/Projects/BigANN/big-ann-benchmarks/plot.py", line 183, in <module>
    runs = compute_metrics(dataset.get_groundtruth(k=args.count),
  File "/home/gwilliams/Projects/BigANN/big-ann-benchmarks/benchmark/plotting/utils.py", line 50, in compute_metrics
    for i, (properties, run) in enumerate(res):
  File "/home/gwilliams/Projects/BigANN/big-ann-benchmarks/benchmark/results.py", line 76, in load_all_results
    for root, _, files in os.walk(get_result_filename(dataset, count, \
  File "/home/gwilliams/Projects/BigANN/big-ann-benchmarks/benchmark/results.py", line 17, in get_result_filename
    raise RuntimeError('Need runbook_path to store results')
RuntimeError: Need runbook_path to store results

magdalendobson · 2024-09-12T15:32:48Z

Paging @arron2003 to take a look at SCANN results here--in your experience do results on this hardware look accurate to you? Any thoughts on whether there may be an easily addressed issue?

magdalendobson · 2024-09-12T15:35:19Z

@sourcesync the plotting code (e.g. plot.py) isn't usually used to generate streaming results, since streaming only generates one result (average recall) per run. Usually we would use data_export.py, as you did in your first command. Would you be able to post both the output of the command and the contents of the resulting CSV file so we can debug more?

sourcesync · 2024-09-12T16:31:58Z

@sourcesync the plotting code (e.g. plot.py) isn't usually used to generate streaming results, since streaming only generates one result (average recall) per run. Usually we would use data_export.py, as you did in your first command. Would you be able to post both the output of the command and the contents of the resulting CSV file so we can debug more?

Thanks @magdalendobson! I put a copy of the data export into this PR branch. I didn't notice 'streaming' in the track column. Is there are specialized method to export streaming track results?

arron2003 · 2024-09-12T17:51:15Z

Paging @arron2003 to take a look at SCANN results here--in your experience do results on this hardware look accurate to you? Any thoughts on whether there may be an easily addressed issue?

Can you share what is the VM used for this, and how can I reproduce this?
I think the might be that we are using 8 threads in the submission.
https://github.com/harsha-simhadri/big-ann-benchmarks/blob/main/neurips23/ood/scann/scann.py#L57

For previous Azure VM with 16 vCPU, it was the case that there were only 8 physical cores, thus 8 threads.
My guess is that we need to change the batch size and num_threads and results would be much better.

sourcesync · 2024-09-12T18:16:21Z

Paging @arron2003 to take a look at SCANN results here--in your experience do results on this hardware look accurate to you? Any thoughts on whether there may be an easily addressed issue?

Can you share what is the VM used for this, and how can I reproduce this? I think the might be that we are using 8 threads in the submission. https://github.com/harsha-simhadri/big-ann-benchmarks/blob/main/neurips23/ood/scann/scann.py#L57

For previous Azure VM with 16 vCPU, it was the case that there were only 8 physical cores, thus 8 threads. My guess is that we need to change the batch size and num_threads and results would be much better.

Hi @arron2003 ! This is a bare-metal system. I put detailed hardware inventory in this README. Scroll down to Hardware Inventory. These systems were donated by Latitude.sh. If you need access I would need to give you credentials and instructions. Let me know if that is useful. I can also run some commands on your behalf it that's easier.

arron2003 · 2024-09-12T21:23:59Z

It will be helpful if you can share the credential to me.

As a shot in the dark you can use use the following command to bump the threads to 16.

sed -i 's/set_num_threads(8)/set_num_threads(16)/g;s/batch_size=12500/batch_size=6250/g' neurips23/ood/scann/scann.py

However, I still think something else is missing. Usually ScaNN perform much better on bare metal machines compare to Cloud VMs. That's why I'd like dig a bit into the issue.

sourcesync · 2024-09-12T21:33:48Z

It will be helpful if you can share the credential to me.

As a shot in the dark you can use use the following command to bump the threads to 16.
sed -i 's/set_num_threads(8)/set_num_threads(16)/g;s/batch_size=12500/batch_size=6250/g' neurips23/ood/scann/scann.py
However, I still think something else is missing. Usually ScaNN perform much better on bare metal machines compare to Cloud VMs. That's why I'd like dig a bit into the issue.

OK @arron2003 , let's get you access. What's the best way to share VPN and login credentials with you privately? I can send the credentials to an email of your choice, or I can invite you to my Slack for DM (i'll need an email in that case as well). Or something else?

arron2003 · 2024-09-13T02:20:34Z

You can find me at my github handle @ gmail dot com.

sourcesync · 2024-09-13T12:52:43Z

You can find me at my github handle @ gmail dot com.

OK @arron2003, sent.

…main and reran scann

…dri/big-ann-benchmarks into gw/latitude_m4_metal_medium

…anking

sourcesync · 2024-09-13T20:33:01Z

@arron2003 hey I went ahead and merged your remote main into my PR branch. Here are the new rankings. Is ok?

arron2003 · 2024-09-13T23:19:17Z

Sure - no problem at all!

…

On Fri, Sep 13, 2024, 4:33 PM C. George Williams ***@***.***> wrote: @arron2003 <https://github.com/arron2003> hey I went ahead and merged your remote main into my PR branch. Here are the new rankings <https://github.com/harsha-simhadri/big-ann-benchmarks/blob/23eb1a986c5c6fdb956a09704bf079c8f2d0a244/neurips23/latitude-m4-metal-medium.md>. Is ok? — Reply to this email directly, view it on GitHub <#306 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACAF2Z5ONBAWQPEBJLBKVGDZWNDZFAVCNFSM6AAAAABNMZCSZ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJQGEYTENZZGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

…dri/big-ann-benchmarks into gw/latitude_m4_metal_medium

sourcesync · 2024-09-14T15:04:23Z

FYI. I also updated the OOD graph in the README @arron2003.

sourcesync added 30 commits August 27, 2024 22:06

fleshing out outline of the latitude readme

7105c49

fix links

83f3a17

fix typos and refactor sections

1ef3d49

refactor readme; add external files and links

f37301f

fix some typos

37aa5f9

fix some typos

04ad340

fix title

f9792ef

fix typo

5b63e39

filled out credits section

e40ce97

completing hw inventory

178c93f

fix typos

e2c2d43

adding latitude logo

b54eb5b

change instance text; adding latitude logo

5b05b77

changing logo to src tag

e216c5e

changing logo bg

89a4f6d

trying to change bg color at logo area

605aa3e

remove rect w/wbg

995a264

refactor markdown

67c3017

refactor hw inventory files

cf0c31d

fix typo

ee23751

update credits

c8bb4d9

fix typo

b285905

starting markdown ranking table

dc997d5

adding zilliz; re-export data csv

d0d94a7

zilliz sparse; re-export csv

e8c33e8

adding results table and run scripts

1acae16

fix link path

d1d5094

refactor instance name; re-run to get error outputs

d3f652d

rename data export

f57a913

error dir cleanup ; fixup nb and rerun

1b3bb52

sourcesync requested review from maumueller, magdalendobson and ingberam August 30, 2024 16:26

sourcesync self-assigned this Aug 30, 2024

cleanup jupyter checkpoints dir

3d0d3c9

Change ScaNN submission to pin version and use all available cores.

ed5c9aa

arron2003 mentioned this pull request Sep 13, 2024

Change ScaNN submission to pin version and use all available cores. #310

Merged

sourcesync added 7 commits September 13, 2024 20:01

Merge branch 'main' into gw/latitude_m4_metal_medium

f46c421

Merge branch 'arron2003_main' into gw/latitude_m4_metal_medium

2374291

merged scann PR https://github.com/arron2003/big-ann-benchmarks/tree/…

eeb92ba

…main and reran scann

Merge branch 'gw/latitude_m4_metal_medium' of github.com:harsha-simha…

acb7d74

…dri/big-ann-benchmarks into gw/latitude_m4_metal_medium

adding algo ignore for hanns temporarily; rerender markdown for new r…

7f5b1cc

…anking

run all nb cells; rerender markdown and rankings

e5125a6

removing algos in IGNORE_ALGOS config; rerender markdown

23eb1a9

sourcesync added 2 commits September 14, 2024 15:00

need to update paretto graph since we reran scann

c66f803

Merge branch 'gw/latitude_m4_metal_medium' of github.com:harsha-simha…

691fba6

…dri/big-ann-benchmarks into gw/latitude_m4_metal_medium

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Evaluation on AMD 16-Core CPU Bare Metal via Latitude.sh Hardware Cloud #306

[WIP] Evaluation on AMD 16-Core CPU Bare Metal via Latitude.sh Hardware Cloud #306

sourcesync commented Aug 30, 2024 •

edited

Loading

magdalendobson commented Aug 30, 2024

sourcesync commented Aug 30, 2024 •

edited

Loading

magdalendobson commented Sep 12, 2024

magdalendobson commented Sep 12, 2024

sourcesync commented Sep 12, 2024 •

edited

Loading

arron2003 commented Sep 12, 2024

sourcesync commented Sep 12, 2024 •

edited

Loading

arron2003 commented Sep 12, 2024

sourcesync commented Sep 12, 2024

arron2003 commented Sep 13, 2024

sourcesync commented Sep 13, 2024

sourcesync commented Sep 13, 2024

arron2003 commented Sep 13, 2024 via email

sourcesync commented Sep 14, 2024

[WIP] Evaluation on AMD 16-Core CPU Bare Metal via Latitude.sh Hardware Cloud #306

Are you sure you want to change the base?

[WIP] Evaluation on AMD 16-Core CPU Bare Metal via Latitude.sh Hardware Cloud #306

Conversation

sourcesync commented Aug 30, 2024 • edited Loading

What is this PR?

How do I get started? How do I view the track rankings on this hardware?

Why is this PR still WIP? How can I help?

magdalendobson commented Aug 30, 2024

sourcesync commented Aug 30, 2024 • edited Loading

magdalendobson commented Sep 12, 2024

magdalendobson commented Sep 12, 2024

sourcesync commented Sep 12, 2024 • edited Loading

arron2003 commented Sep 12, 2024

sourcesync commented Sep 12, 2024 • edited Loading

arron2003 commented Sep 12, 2024

sourcesync commented Sep 12, 2024

arron2003 commented Sep 13, 2024

sourcesync commented Sep 13, 2024

sourcesync commented Sep 13, 2024

arron2003 commented Sep 13, 2024 via email

sourcesync commented Sep 14, 2024

sourcesync commented Aug 30, 2024 •

edited

Loading

sourcesync commented Aug 30, 2024 •

edited

Loading

sourcesync commented Sep 12, 2024 •

edited

Loading

sourcesync commented Sep 12, 2024 •

edited

Loading