-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segmentation fault #66
Comments
same here |
@ebrevdo Any ideas here? I don't know anything about the details of tf-agents but the only thing I can think of that could cause a segmentation fault would be if the shape of the data doesn't match expectations.. WDYT? |
@samarth-robo When you create the @tfboyd i wonder if part of our release/nightly build could be to create a debug version of reverb and upload it to a gcs bucket, so when people get segfaults like this we can point them to a debug build they can use? debug builds come out to ~200MB each. perhaps we can just do it for release versions? @samarth-robo @elhamAm do you have a small repro we can use to try and debug? |
@ebrevdo yes I give it the signature, following this example. Unfortunately I cannot share the env, but the rest should be OK. I will work on putting an example together. Repro is made more difficult by the fact that the segfault happens randomly after a few hours of training. |
Are you able to build a version of reverb? Worked with bazel before? If
so, I can give you instructions on how to get better stack traces. If not,
I may be able to provide you a pip wheel to install in a few days.
…On Tue, Aug 31, 2021 at 1:29 PM Samarth Brahmbhatt ***@***.***> wrote:
@ebrevdo <https://github.com/ebrevdo> yes I give it the signature,
following this example
<https://github.com/tensorflow/agents/blob/master/tf_agents/experimental/distributed/examples/sac/sac_reverb_server.py>.
Unfortunately I cannot share the env, but the rest should be OK. I will
work on putting an example together. Repro is made more difficult by the
fact that the segfault happens randomly after a few hours of training.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#66 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANWFG6IA3XXHYV7E4CUVWLT7U3QZANCNFSM5COT4VOQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@ebrevdo I made this repository to reproduce the issue. It uses a random environment. But (un)fortunately I have not seen a segfault using that code yet. When I examined the segfaults with GDB the backtraces pointed to reverb. But this new information indicates the issue is in the env (which uses MuJoCo and robosuite) or its interaction with tf-agents? I am not sure now. I have not worked with bazel before. Will appreciate a pip wheel. Thanks! |
Oh actually I ran the above repository again, and it crashed. Unfortunately I was not running it with GDB this time. Here is the error message: INFO:__main__:Train 4789/15000: reward=1.1835, episode_length=125.0000, steps=598625.0000, episodes=4789.0000, collect_speed=619.5698, train_speed=68.3351, loss=0.9424
Traceback (most recent call last):
File "/home/xxx/research/reverb_segfault_repro/trainer.py", line 238, in <module>
trainer.train()
File "/home/xxx/research/reverb_segfault_repro/trainer.py", line 183, in train
losses: LossInfo = learner.run(iterations=n_sgd_steps)
File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tf_agents/train/learner.py", line 246, in run
loss_info = self._train(iterations, iterator, parallel_iterations)
File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 885, in __call__
result = self._call(*args, **kwds)
File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 924, in _call
results = self._stateful_fn(*args, **kwds)
File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 3039, in __call__
return graph_function._call_flat(
File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 1963, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 591, in call
outputs = execute.execute(
File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot unpack column 4097 in chunk 2341602335940456937 which has 6 columns.
[[{{node while/body/_103/while/IteratorGetNext}}]] [Op:__inference__train_327771]
Function call stack:
_train
[reverb/cc/platform/default/server.cc:84] Shutting down replay server
Segmentation fault (core dumped) This is not the same error as the one I mentioned at the top, but I have experienced this one before. I will run it with GDB again and report a stack trace if possible. @ebrevdo is that repository useful to you? |
Interesting. I would have expected that you would have gotten a cleaner
error message if you created your Table with a signature. I'll try to
repro on monday as well.
…On Fri, Sep 3, 2021 at 12:46 PM Samarth Brahmbhatt ***@***.***> wrote:
Oh actually I ran the above repository again, and it crashed.
Unfortunately I was not running it with GDB this time. Here is the error
message:
INFO:__main__:Train 4789/15000: reward=1.1835, episode_length=125.0000, steps=598625.0000, episodes=4789.0000, collect_speed=619.5698, train_speed=68.3351, loss=0.9424
Traceback (most recent call last):
File "/home/xxx/research/reverb_segfault_repro/trainer.py", line 238, in <module>
trainer.train()
File "/home/xxx/research/reverb_segfault_repro/trainer.py", line 183, in train
losses: LossInfo = learner.run(iterations=n_sgd_steps)
File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tf_agents/train/learner.py", line 246, in run
loss_info = self._train(iterations, iterator, parallel_iterations)
File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 885, in __call__
result = self._call(*args, **kwds)
File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 924, in _call
results = self._stateful_fn(*args, **kwds)
File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 3039, in __call__
return graph_function._call_flat(
File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 1963, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 591, in call
outputs = execute.execute(
File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot unpack column 4097 in chunk 2341602335940456937 which has 6 columns.
[[{{node while/body/_103/while/IteratorGetNext}}]] [Op:__inference__train_327771]
Function call stack:
_train
[reverb/cc/platform/default/server.cc:84] Shutting down replay server
Segmentation fault (core dumped)
This is not the same error as the one I mentioned at the top, but I have
experienced this one before.
I will run it with GDB again and report a stack trace if possible.
@ebrevdo <https://github.com/ebrevdo> is that repository useful to you?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#66 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANWFGYFVO3DSOCDTDCQCODUAEQZNANCNFSM5COT4VOQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@ebrevdo it is not crashing after I removed the asynchronicity i.e. ensured that data was not being written to the reverb replay buffer while it was being sampled from. |
I'll leave this open until we can figure out what's going on or move everyone over to |
I think the error there is that we're reading from some bad memory, and likely related to the segfaults. The |
@qstanczyk could this be related to your PR "Batch sample responses"? |
the reason why I thought concurrent write/read would not work is this blue note in the documentation of If you want to test concurrent write/read revert the commit ebfba0ab7c474b3831279a07e5e65e8af98f4269. |
That note doesn't apply to Reverb, since the iterator for Reverb is aware
of updates to the service. Basically it should just work :)
…On Tue, Sep 7, 2021 at 11:05 AM Samarth Brahmbhatt ***@***.***> wrote:
I'll leave this open until we can figure out what's going on or move
everyone over to TrajectoryWriter. Thanks for the report and the repro,
and for the additional details about parallel write/read (that should work
just fine).
the reason why I thought concurrent write/read would not work is this blue
note in the documentation of ReverbReplayBuffer.as_dataset()
<https://www.tensorflow.org/agents/api_docs/python/tf_agents/replay_buffers/ReverbReplayBuffer#as_dataset>
[image: Screenshot from 2021-09-07 11-01-28]
<https://user-images.githubusercontent.com/2848070/132390408-d312e024-8d9a-4da2-b10c-a48ce7e92b29.png>
If you want to test concurrent write/read revert the commit
ebfba0ab7c474b3831279a07e5e65e8af98f4269
<samarth-robo/reverb_segfault_repro@ebfba0a>
.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#66 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANWFGYBVAFNQESDT5FSUKTUAZH5RANCNFSM5COT4VOQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I checked out your repo (the initial checkin, not the subsequent one) and installed the same versions of TF-Agents and built my own Reverb 0.4.0; using TF 2.6.0. I'm using python3.9. I run trainer.py and so far I haven't seen any errors or segfaults. I'm at Train step 1695/15000. How long until you saw the error? |
Thanks, I also have Python version 3.9.5 and TF 2.6.0. I'm assuming, tf-agents 0.9.0 pulls in reverb 0.4.0? It is difficult to find out the version of pip-installed reveb. I usually saw the segfault a little later, around iteration 5000. For example, the one above was at 4789. And it does not happen every time. But I have seen it at least twice. |
|
@ebrevdo I was wondering if you have been able to reproduce? I do see these segfaults intermittently. Here is another one which provides some more information. This one had non-concurrent read/write. *** SIGSEGV received at time=1631229295 on cpu 0 ***
PC: @ 0x7fcd62ab6a8f (unknown) deepmind::reverb::internal::UnpackChunkColumn()
@ 0x7fcdb2966980 1792 (unknown)
@ 0x7fcd62ab6e41 320 deepmind::reverb::internal::UnpackChunkColumnAndSlice()
@ 0x7fcd62ab7274 32 deepmind::reverb::internal::UnpackChunkColumnAndSlice()
@ 0x7fcd62a78165 880 deepmind::reverb::(anonymous namespace)::LocalSamplerWorker::FetchSamples()
@ 0x7fcd62a7014f 144 deepmind::reverb::Sampler::RunWorker()
@ 0x7fcdaf923039 (unknown) execute_native_thread_routine
@ 0x7fc66800f0a0 (unknown) (unknown)
@ 0x7fcd62abbca0 (unknown) (unknown)
@ 0x75058b4808ec8348 (unknown) (unknown)
Segmentation fault (core dumped) |
@samarth-robo are you running with a -g2 compiled reverb? any chance of you running this in gdb and getting a full stacktrace? i wonder if that would give more info. @tfboyd do we have debug build pip packages availble now? |
Looks like you're trying to access something that's been freed. gdb may help us identify what object it is. My guess is that either |
@ebrevdo here is a GDB session I had copy-pasted into a Google doc some time ago. It is not the exact same session mentioned in my last comment, but the errors are in the same |
To answer your other question, I get reverb from pip. I am not sure if that one has been compiled with -g2. |
checkpointing + daemontools/supervise is an effective work around for such crashes. |
Hello,
I am using reverb 0.4.0 in tf-agents 0.9.0 through the
ReverbReplayBuffer
andReverbAddTrajectoryObserver
. Ray actors push experience to the the reverb server. Currently though, it is configured to have just 1 actor, and experience pushing is completed in a blocking manner before agent training in the main loop.I am seeing segmentation faults happening at random times, always within the main process that samples from reverb to train the agent.
I was wondering if you have some hints about why this might be happening, or where I can start debugging?
The text was updated successfully, but these errors were encountered: