Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VGG16 scoring model taking up > 500 GB of RAM #1305

Open
ShreyaKapoor18 opened this issue Oct 9, 2024 · 12 comments
Open

VGG16 scoring model taking up > 500 GB of RAM #1305

ShreyaKapoor18 opened this issue Oct 9, 2024 · 12 comments

Comments

@ShreyaKapoor18
Copy link

ShreyaKapoor18 commented Oct 9, 2024

Hi,

I was trying to score a VGG16 model on our own cluster, i.e. by running a local instance
When I use the following layer names

['avgpool' , 'features', 'classifier']

i.e. more high level layers the RAM consumption seems to be OK and works fine
However, when I go to the detailed level layer names

['features.1', 'features.2', 'features.3', 'features.4', 'features.5', 'features.6', 'features.7', 'features.8', 'features.9', 'features.10', 'features.11', 'features.12', 'features.13', 'features.14', 'features.15', 'features.16', 'features.17', 'features.18', 'features.19', 'features.20', 'features.21', 'features.22', 'features.23', 'features.24', 'features.25', 'features.26', 'features.27', 'features.28', 'features.29', 'features.30', 'classifier.0', 'classifier.1', 'classifier.2', 'classifier.3', 'classifier.4', 'classifier.5', 'classifier.6']

It takes computational space > 500 GB of RAM and runs OOM
The thing is with the high level layers
my scores are

    "imagenet_trained": {
        "V4": "0.3685106454430227",
        "IT": "0.5185169380743393",
        "V1": "0.09256884647589658",
        "V2": "0.2600441204932774"
    }

in V1 the score for no training is higher than imagenet trained, which is a weird effect since
the weights are random. I know sometimes a random weight could also just match because
of a statistical artefact, but this occurs in 2 iterations

    "no_training": {
        "V4": "0.3413502290434312",
        "IT": "0.2947047868783302",
        "V1": "0.2026004427555423",
        "V2": "0.1448800686541028"
    }
    "no_training_2": {
        "V4": "0.33954039465787034",
        "IT": "0.29491768114613165",
        "V1": "0.1974565275931902",
        "V2": "0.15089219267469867"
    }

I am using the following public benchmarks for scoring my model

  benchmark_identifiers = ['MajajHong2015public.V4-pls', 'MajajHong2015public.IT-pls', 
                             'FreemanZiemba2013public.V1-pls', 'FreemanZiemba2013public.V2-pls']

Any help would be gladly appreciated

Best regards,
Shreya

@mike-ferguson
Copy link
Member

Hi Shreya, thanks for opening an issue! Usually in Brain-Score, earlier layers of the model are more computationally expensive (RAM-wise) to score, as they tend to be much bigger than later model layers. Also, it could be the case that for VGG16, the more granular layers are bigger themselves, or are full convolutional layers as opposed to perhaps a pooling or relu layer (I am not entirely sure here, as I would need a refresher on VGG16 architecture). As for the issue of random weights scoring higher, I am linking @mschrimpf in who may be able more scientific insight as to what might be occurring.

@ShreyaKapoor18
Copy link
Author

Hi Shreya, thanks for opening an issue! Usually in Brain-Score, earlier layers of the model are more computationally expensive (RAM-wise) to score, as they tend to be much bigger than later model layers. Also, it could be the case that for VGG16, the more granular layers are bigger themselves, or are full convolutional layers as opposed to perhaps a pooling or relu layer (I am not entirely sure here, as I would need a refresher on VGG16 architecture). As for the issue of random weights scoring higher, I am linking @mschrimpf in who may be able more scientific insight as to what might be occurring.

Dear Mike,

Thanks a lot for your answer! It is really helpful, yes, more layers indeed lead to higher computational complexity. I just wanted to know how could an untrained network
match super well. Guess that will fill the gap

Best regards,
Shreya

@ShreyaKapoor18
Copy link
Author

ShreyaKapoor18 commented Nov 8, 2024

Hi @mike-ferguson
In such a case do people usually compute the brainscore on the convolutional layers only?

Best regards,
Shreya

@mike-ferguson
Copy link
Member

Hi @ShreyaKapoor18, sorry for the late reply, I was Out of Office for a couple of days; as to your question, I think it depends! For the most part I think people usually do like the conv layers when hand-selecting/passing in layers to score, just because they align nicely with the conceptual framework. However, I have also seen pooling layers and even relu/activation function layers be passed in manually as well. I do not know off the top of my head which tend to have better alignment, and I am sure Martin/others have looked into this, but I think passing in all conv layers is a reasonable thing to do!

@ShreyaKapoor18
Copy link
Author

Hi Mike,

Thanks for your reply. I did just that, passing only the conv layers but am still running OOM.
It's just a bit confusing what the state of the art is for comparing.
For other methods I am using to align the networks to the brain usually it is not so computationally expensive and I am not able to compare brainscore results to these networks. I guess it is an open question.

Best regards,
Shreya Kapoor

@mike-ferguson
Copy link
Member

@ShreyaKapoor18 Gotcha- have you tried submitting recently on our website? If you do that, I should be able to see the logs and troubleshoot and see what exactly is eating up so much memory. We also have a new procedure to map layers that should drastically cut down on RAM usage, but it is only available through our submitting through our site (or a PR) at the moment (still working on deploying that fix to local scoring schemas)

@ShreyaKapoor18
Copy link
Author

Hi @mike-ferguson
I tried submitting the models on your website. However, I submitted it 23h ago and did not receive any email about it failing or being scored. I wonder what the reason could be?
My email id is [email protected] and I made the submission, what could be the reason?

Best regards,
Shreya Kapoor

@KartikP
Copy link
Collaborator

KartikP commented Nov 30, 2024

@ShreyaKapoor18

Hi Shreya, a number of PRs were automatically made yesterday with a model identifier called "vgg16_less_variation_iteration=1". Do these happen to be yours? I'll link a few of the latest ones below.

If yes, they also seem to be running into the same issue with the use of ModelCommitment. Specifically, only the identifier has been passed as an argument but they activations_model and layers need to also been passed. You already have these in the model.py file but just need to refer to it in the init.py when adding your model to the registry.

You can refer to the resnet50_tutorial model.

As to why you were not notified: We only send out emails on the status of scoring however the model submissions were failing on the previous step (i.e., unit tests), which we do not notify. You point out an important gap in our communication, and we'll try to at least implement some notification system (whether that is a GitHub comment or full email) that resolves this.

Let me know if this helps or not.

Latest web submissions to the repo
#1496
#1495

If you want to see the reason for failure, you can scroll through the checks at the bottom of the PR. You will see "Brain-Score Plugins Unit tests (AWS Jenkins, AWS Execution) — Build Failure". You can click on details > Console Output or the Console Output (Parsed).

The common errors:

  • model_registry['vgg16_less_variation_iteration=1'] = lambda: ModelCommitment(identifier='vgg16_less_variation_iteration=1')
    E TypeError: ModelCommitment.init() missing 2 required positional arguments: 'activations_model' and 'layers'

  • ERROR: Could not find a version that satisfies the requirement open_clip (from unknown) (from versions: none)
    ERROR: No matching distribution found for open_clip

I would personally try to resolve the first error and see if that also handles the open_clip error. If open_clip appears again, we may need to find the version you are using with your local environment and just add it to setup.

@ShreyaKapoor18
Copy link
Author

Hi @KartikP

Yes, these were my submissions.
Thanks a lot for your reply. I really appreciate the clear reply. I changed the errors and submitted again. However, I don't know yet if its running or it is cancelled somehow.

Best regards,
Shreya

@KartikP
Copy link
Collaborator

KartikP commented Nov 30, 2024

@ShreyaKapoor18 I'm just keeping an eye out on your web submissions. It seems like you had made some of the changes but not all of the required (https://github.com/brain-score/vision/pull/1499/files). The console log for this PR can be seen here:

>   model_registry['vgg16_less_variation_iteration=1'] = lambda: ModelCommitment(identifier='vgg16_less_variation_iteration=1',activations_model=get_model('vgg16_less_variation_iteration=1'), layers=get_layers('vgg16_less_variation_iteration=1'))
E   NameError: name 'get_model' is not defined

brainscore_vision/models/vgg16_less_variation_1/__init__.py:4: NameError

Specifically, the init file needs to import your get_model() and get_layers() from the model.py file.

To explain the workflow we have, when you make a web submission, it creates a Pull Request into this repo on your behalf. If you go to the Pull Request tab, you will see your web submission and all the files that are associated with it. You can find the status of your web submission there. Once unit tests pass, we perform layer mapping, then the PR is merged, and then once merged it will be scored. Once scored, you will get an email on success or failure, and then your score will appear in the leaderboard after 24 hrs.

I'm also seeing some newer PRs from your web submission that has a broken init file that is attempting to run tests instead of add your model to the registry. https://github.com/brain-score/vision/pull/1503/files

If you would like, you can message me on Slack and we can confirm the code before submitting.

@ShreyaKapoor18
Copy link
Author

Thanks @KartikP this is awesome!
Thank you very much! I now have a hang of it and think it should work!

Best regards,
Shreya

@ShreyaKapoor18
Copy link
Author

Hi everyone,

I have submitted multiple models and they seem to work! However when I want to use the method to submit multiple models at once using
https://github.com/brain-score/vision/tree/master/brainscore_vision/models/scaling_models

I get duplicate IDs error. To check that I ran the following:


python brainscore_vision score --model_identifier='alexnet_no_variation_1' --benchmark_identifier='Ferguson2024convergence-value_delta'

(myenv) shreya@faustaff-010-020-035-084 objaverse-training % python brainscore_vision score --model_identifier='alexnet_less_variation_1' --benchmark_identifier='Ferguson2024convergence-value_delta'
/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/brainscore_core/metrics/__init__.py:16: FutureWarning: xarray subclass Score should explicitly define __slots__
  class Score(DataAssembly):
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/shreya/Documents/GitHub/objaverse-training/brainscore_vision/__main__.py", line 20, in <module>
    fire.Fire()
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/Documents/GitHub/objaverse-training/brainscore_vision/__main__.py", line 9, in score
    result = _score_function(model_identifier, benchmark_identifier, conda_active)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/brainscore_vision/__init__.py", line 103, in score
    return wrap_score(__file__,
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/brainscore_core/plugin_management/conda_score.py", line 88, in wrap_score
    result = score_function(model_identifier, benchmark_identifier)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/brainscore_vision/__init__.py", line 74, in _run_score
    model: BrainModel = load_model(model_identifier)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/brainscore_vision/__init__.py", line 65, in load_model
    import_plugin('brainscore_vision', 'models', identifier)
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/brainscore_core/plugin_management/import_plugin.py", line 105, in import_plugin
    importer = ImportPlugin(library_root=library_root, plugin_type=plugin_type, identifier=identifier,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/brainscore_core/plugin_management/import_plugin.py", line 32, in __init__
    self.plugin_dirname = self.locate_plugin()
                          ^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/brainscore_core/plugin_management/import_plugin.py", line 59, in locate_plugin
    assert plugin_registrations_count > 0, f"No registrations found for {self.identifier}"
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: No registrations found for alexnet_less_variation_1
(myenv) shreya@faustaff-010-020-035-084 objaverse-training % python brainscore_vision score --model_identifier='alexnet_no_variation_1' --benchmark_identifier='Ferguson2024convergence-value_delta'
/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/brainscore_core/metrics/__init__.py:16: FutureWarning: xarray subclass Score should explicitly define __slots__
  class Score(DataAssembly):
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/shreya/Documents/GitHub/objaverse-training/brainscore_vision/__main__.py", line 20, in <module>
    fire.Fire()
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/Documents/GitHub/objaverse-training/brainscore_vision/__main__.py", line 9, in score
    result = _score_function(model_identifier, benchmark_identifier, conda_active)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/brainscore_vision/__init__.py", line 103, in score
    return wrap_score(__file__,
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/brainscore_core/plugin_management/conda_score.py", line 88, in wrap_score
    result = score_function(model_identifier, benchmark_identifier)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/brainscore_vision/__init__.py", line 74, in _run_score
    model: BrainModel = load_model(model_identifier)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/brainscore_vision/__init__.py", line 65, in load_model
    import_plugin('brainscore_vision', 'models', identifier)
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/brainscore_core/plugin_management/import_plugin.py", line 105, in import_plugin
    importer = ImportPlugin(library_root=library_root, plugin_type=plugin_type, identifier=identifier,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/brainscore_core/plugin_management/import_plugin.py", line 32, in __init__
    self.plugin_dirname = self.locate_plugin()
                          ^^^^^^^^^^^^^^^^^^^^
  File "/Users/shreya/anaconda3/envs/myenv/lib/python3.11/site-packages/brainscore_core/plugin_management/import_plugin.py", line 59, in locate_plugin
    assert plugin_registrations_count > 0, f"No registrations found for {self.identifier}"
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: No registrations found for alexnet_no_variation_1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants