Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the number of workers on CS from 5 to 3 #103

Merged
merged 1 commit into from
Apr 21, 2024
Merged

Conversation

kcreekdev
Copy link
Contributor

It looks like we are running into some memory issues. Attempting to resolve the following error:

async task https://workers.ofralabs.net/api/v1/jobs/callback/3e976dbf-0b59-4baa-bbaf-96d3c20f75b9/ <function sim at 0x79d699030d60> None
getting task_kwargs
got task_kwargs {'meta_param_dict': {'year': [{'value': 2024}], 'time_path': [{'value': False}], 'data_source': [{'value': 'CPS'}]}, 'adjustment': {'OG-USA Parameters': {}, 'Tax-Calculator Parameters': {}}}
Meta_param_dict =  {'year': [{'value': 2024}], 'time_path': [{'value': False}], 'data_source': [{'value': 'CPS'}]}
adjustment dict =  {'OG-USA Parameters': {}, 'Tax-Calculator Parameters': {}}
/opt/conda/lib/python3.11/site-packages/distributed/client.py:3157: UserWarning: Sending large graph of size 246.75 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
  warnings.warn(
/home/OG-USA/ogusa/macro_params.py:110: FutureWarning: The default fill_method='pad' in Series.pct_change is deprecated and will be removed in a future version. Either fill in any non-leading NA values prior to calling pct_change or specify 'fill_method=None' to not fill NA values.
  fred_data_q["GDP Per Capita"].pct_change(periods=4, freq="QE").mean()
Running current law policy baseline
Year:  2033
Running current law policy baseline
Year:  2028
year= 2030 age= all ages
year= 2029 age= all ages
Running current law policy baseline
Year:  2032
Running current law policy baseline
Year:  2029
Running current law policy baseline
Year:  2034
year= 2033 age= all ages
year= 2025 age= all ages
Running current law policy baseline
Year:  2024
Running current law policy baseline
Year:  2026
year= 2028 age= all ages
year= 2032 age= all ages
Running current law policy baseline
Year:  2031
Running current law policy baseline
Year:  2025
year= 2027 age= all ages
year= 2024 age= all ages
year= 2034 age= all ages
Running current law policy baseline
Year:  2030
Running current law policy baseline
Year:  2027
year= 2031 age= all ages
year= 2026 age= all ages
Using baseline tax parameters from  /home/OG-USA/cs-config/cs_config/OUTPUT_BASELINE/TxFuncEst_baseline.pkl
BW =  11 begin year =  2024 end year =  2034
Finished tax function loop through 11 years and 1 ages per year.
Tax function estimation time: 21.276 sec
2024-04-21 12:17:15,760 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.78 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:17:17,171 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.80 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:17:26,510 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.91 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:17:27,681 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.87 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:17:36,741 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 4.01 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:17:37,855 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.97 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:17:38,046 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.80 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:17:46,904 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 4.13 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:17:48,068 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 4.06 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:17:48,134 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.84 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:17:50,874 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.80 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:17:56,944 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 4.20 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:17:58,457 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.98 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:17:58,681 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 4.20 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:18:00,925 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.87 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:18:08,513 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 4.06 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:18:08,727 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 4.29 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:18:10,926 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.96 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:18:17,003 - distributed.worker.memory - WARNING - gc.collect() took 11.642s. This is usually a sign that some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help.
2024-04-21 12:18:17,003 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 4.34 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:18:17,005 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 4.34 GiB -- Worker memory limit: 5.40 GiB
2024-04-21 12:18:17,008 - distributed.core - ERROR - Exception while handling op scatter
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/distributed/core.py", line 970, in _handle_comm
    result = await result
             ^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/distributed/scheduler.py", line 6145, in scatter
    await self.replicate(keys=keys, workers=workers, n=n)
  File "/opt/conda/lib/python3.11/site-packages/distributed/scheduler.py", line 7041, in replicate
    for ws in random.sample(tuple(workers - ts.who_has), count):
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/random.py", line 456, in sample
    raise ValueError("Sample larger than population or is negative")
ValueError: Sample larger than population or is negative
2024-04-21 12:18:17,042 - distributed.worker.memory - WARNING - Worker is at 42% memory usage. Resuming worker. Process memory: 2.31 GiB -- Worker memory limit: 5.40 GiB
SS using initial guess factors for r and TR of 1.0 and 1.0 , respectively.
saving results...
resp 201 https://workers.ofralabs.net/api/v1/jobs/callback/3e976dbf-0b59-4baa-bbaf-96d3c20f75b9/
2024-04-21 12:18:17,300 - distributed.scheduler - ERROR - Removing worker 'tcp://127.0.0.1:41757' caused the cluster to lose scattered data, which can't be recovered: {'Specifications-0d5568a19df056c079a960f9db15690f'} (stimulus_id='handle-worker-cleanup-1713701897.300584')
2024-04-21 12:18:20,494 - distributed.nanny - WARNING - Worker process still alive after 3.1999992370605472 seconds, killing
2024-04-21 12:18:20,494 - distributed.nanny - WARNING - Worker process still alive after 3.199999389648438 seconds, killing
2024-04-21 12:18:20,495 - distributed.nanny - WARNING - Worker process still alive after 3.199999389648438 seconds, killing
2024-04-21 12:18:20,496 - distributed.nanny - WARNING - Worker process still alive after 3.199999389648438 seconds, killing

@codecov-commenter
Copy link

codecov-commenter commented Apr 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.41%. Comparing base (3a10258) to head (5036e85).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #103   +/-   ##
=======================================
  Coverage   75.41%   75.41%           
=======================================
  Files          11       11           
  Lines         850      850           
=======================================
  Hits          641      641           
  Misses        209      209           
Flag Coverage Δ
unittests 75.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

@jdebacker
Copy link
Member

@kcreekdev thanks for the PR. Merging.

@jdebacker jdebacker merged commit e4655db into master Apr 21, 2024
10 checks passed
@jdebacker jdebacker deleted the reduce-num-workers branch April 21, 2024 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants