Skip to content
This repository has been archived by the owner on May 1, 2023. It is now read-only.

bug: logging to wandb does not work #487

Open
MartinBernstorff opened this issue Apr 19, 2023 · 1 comment
Open

bug: logging to wandb does not work #487

MartinBernstorff opened this issue Apr 19, 2023 · 1 comment

Comments

@MartinBernstorff
Copy link
Collaborator

MartinBernstorff commented Apr 19, 2023

🤖 This is the current blocker for a full hyperparameter search.

Currently have to write the manuscript, so unable to debug in detail. We need a permanent solution shared across projects, therefore the issue here.

Is this related to the latest internet shutdown? Maybe so, but we already open the URLs I could find from wandb https://api.wandb.ai.

  • Train single model - seemed to work fine
  • Run full model training using the train_models_in_parallel script
  • It only runs exactly 10 models, one for each lookahead/model architecture combination
    • I.e. it doesn't start more than one model for each combination, even though we get
      image
      • This implies there's something wrong with our hyperparameter search.
        • Running the command directly from the command line
          • Seems that this does not sync correctly. Stuck on the last line.
            image
        • Suggests that it's a wandb/hydra interaction that's causing the problem, but only when using --multirun
          • Testing without multirun
            • Huh, same problem!
        • Testing again, but just a simple "train model", i.e. circumventing Hydra's CLI interface
          • Still not working, that's weird!

Appears we might be hitting rate limiting, since training a single model worked fine the first time, and then didn't work?

Proposed next steps for debugging:

  • Check if we can train and upload even a single model (note that syncing continuous after image

  • If we can, but cannot train in parallel, appears it's a wandb problem. Potential next steps:

    • Write performance to disk and drop wandb support
    • Switch to another provider (local/remote)

If anyone is up for debugging, they're more than welcome to go ahead and collect thoughts here! @HLasse, @sarakolding, @signekb, @bokajgd, @erikperfalk.

@MartinBernstorff MartinBernstorff changed the title Wandb training does not work bug: logging to wandb does not work Apr 19, 2023
@MartinBernstorff
Copy link
Collaborator Author

Har skreve til wandb support for at høre, om der er andre hostnames vi skal åbne for.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant