Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make debug-test fails with 404 error #1954

Closed
mikesmit opened this issue Nov 7, 2024 · 5 comments · Fixed by PolicyEngine/policyengine-core#307 or #1962
Closed

make debug-test fails with 404 error #1954

mikesmit opened this issue Nov 7, 2024 · 5 comments · Fixed by PolicyEngine/policyengine-core#307 or #1962

Comments

@mikesmit
Copy link
Collaborator

mikesmit commented Nov 7, 2024

When running make debug-test on the policyengine-api package I get the following error:

==================================== ERRORS ====================================
________________________ ERROR collecting tests/python _________________________
/usr/lib64/python3.11/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
<frozen importlib._bootstrap>:1204: in _gcd_import
    ???
<frozen importlib._bootstrap>:1176: in _find_and_load
    ???
<frozen importlib._bootstrap>:1147: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:690: in _load_unlocked
    ???
../policyengine-app/.venv/lib64/python3.11/site-packages/_pytest/assertion/rewrite.py:184: in exec_module
    exec(co, module.__dict__)
tests/python/conftest.py:12: in <module>
    download_microdata()
policyengine_api/download_microdata.py:11: in download_microdata
    dataset.download()
../policyengine-app/.venv/lib64/python3.11/site-packages/policyengine_core/data/dataset.py:308: in download
    raise ValueError(
E   ValueError: Invalid response code 404 for url https://api.github.com/repos/PolicyEngine/ukda/releases/tags/1.9.0.
@mikesmit
Copy link
Collaborator Author

mikesmit commented Nov 7, 2024

From slack chat:
Nikhil Woodruff
Today at 4:43 AM
Ah right, ok: so what’s happening here is the API uses our UK microdata for UK impacts. We’re not allowed to share outside the UK org, and the GH actions and live server use a token to download it. I recently changed the API to downloading it at startup time rather than whenever the first UK impact is received

Nikhil Woodruff
Today at 4:44 AM
We should probably add either a try except clause or env variable to not do this when testing non-UK parts

@mikesmit
Copy link
Collaborator Author

mikesmit commented Nov 7, 2024

Looking to see if I can locate "I recently changed the API to downloading it at startup time rather than whenever the first UK impact is received"...

Looks like it's related to Split simulations into chunks #1938

Specifically this bit adding download_microdata.py

@mikesmit
Copy link
Collaborator Author

mikesmit commented Nov 8, 2024

Chatted with @nikhilwoodruff this morning. TLDR: we want to load the UK data on demand, but without causing a race condition.

What's going wrong and why

  1. We used to load the file when needed to disk and then into memory
  2. When we added more than one worker task this caused a race condition where more than one worker would try to get the file. The "looser" would find the file, but it would be partially downloaded causing an error
  3. The solution was to pre-download the file before running any workers BUT
  4. The UK data cannot be downloaded to any old host (I.e. my desktop) because of legal issues

SO
When you try to run the tests, then it immediately attempts to load UK data even though that is not necessary and fails because you can't access it without special permission.

Solutions

Preferred solution: We're loading this into memory anyway. Instead of caching it on disk we could just load it directly into memory once per worker

Pros:

  • no issues with multiple processes borking each other
  • simple to implement
    Cons:
  • Cost of network traffic to get the file X the number of workers (5). - since this is only when we restart the container... seems fine.

Alternative - Atomic download

Download to a randomly-named tmp file and then use a move (if you move within the same disk partition this is atomic, not a copy) to the name you want.

Then either the file is there or not. If two processes do it at the same time, they will all download it, but one will "win".

Pros:

  • We download the file slightly less
  • This provides a model for caching things to disk we don't want always in memory
    Cons:
  • slightly more complicated than just caching to memory.

@mikesmit
Copy link
Collaborator Author

mikesmit commented Nov 8, 2024

Looking a bit more closely... I'm not 100% confident that I know all the places file_path is used. I think I'll just solve this one problem of making the file download atomic (option 2) in this one case.

@mikesmit
Copy link
Collaborator Author

mikesmit commented Nov 8, 2024

put in a pull request to core here: PolicyEngine/policyengine-core#307 that will make the download/write atomic.

This should let us just back out the preloading of the microdata which will in turn mean we don't get errors for trying to download the UK microdata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant