Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug report] about the mem_limit of dockers #547

Open
ShawnShawnYou opened this issue Oct 16, 2024 · 1 comment
Open

[bug report] about the mem_limit of dockers #547

ShawnShawnYou opened this issue Oct 16, 2024 · 1 comment

Comments

@ShawnShawnYou
Copy link

Hi! Team:

I've been trying to test several algorithms on the benchmark and used the following command:

python3 run.py --parallelism 31 --dataset gist-960-euclidean --runs 5 --force

I found that many algorithms failed and returned error 137. Upon checking the log, it showed that some algorithms were allocated less memory compared to others. The machine I used is the same as Erikbern's (i.e., an r6i.16xlarge machine on AWS, with 512GB of memory). Specifically, part of the log reads as follows:

Snipaste_2024-10-16_16-13-16

Actually, we expect that each algorithm is limited to about 512 GB / 32 = 16 GB of memory. However, you can see that these algorithms are only allocated about 11 GB, which is far less than our expectation.

About Fix

Checking the code, I found that there is a bug when setting the mem_limit at Line 73 in ann_benchmarks/main.py:

mem_limit = int((psutil.virtual_memory().available - memory_margin) / args.parallelism)

When using "available," the algorithms in the first batch will get 16 GB of memory, while the algorithms in the latter batches will get less than 16 GB of memory.

So, I think this line should be modified to return the correct memory limit:

mem_limit = int((psutil.virtual_memory().total - memory_margin) / args.parallelism)

Or is it my misunderstanding about the setting of mem_limit? Thanks!

Environment
a r6i.16xlarge machine on AWS

@maumueller
Copy link
Collaborator

Thanks @ShawnShawnYou . I'm a bit split here, because that change only works for machines that are completely devoted to running the benchmark. In a setting where you share the machine, it would be wrong to split it up like you propose.

The problem only appears when a container is done and a new one is spawned, does it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants