Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No source key being generated 1.1.0-rc2 #4909

Closed
4 tasks
kushaldas opened this issue Oct 14, 2019 · 25 comments
Closed
4 tasks

No source key being generated 1.1.0-rc2 #4909

kushaldas opened this issue Oct 14, 2019 · 25 comments

Comments

@kushaldas
Copy link
Contributor

kushaldas commented Oct 14, 2019

Description

The encryption keys for any source is not being generated.

Upgraded from 1.0.0 to 1.1.0-rc2 package

Steps to Reproduce

  • submit a message as a new source
  • Login as a journalist, and flag the new source for generating new key
  • relogin as the source
  • check the source submission as journalist and still no key has been generated and no way to reply to the source.

Expected Behavior

The journalist should be able to reply to any new source.

Actual Behavior

No option to the reply to the source. There is no key generated for the source.

@emkll
Copy link
Contributor

emkll commented Oct 15, 2019

I could not reproduce the error described above, with cron-apt upgrade from 1.0.0 to 1.1.0 in VMs and the STR provided above. When I submit the file through the source interface, I did not get the "flag for reply" flow.

The upgrade works as expected, and the Journalist can reply to the Source, and I can see the source's private keys in /var/lib/securedrop/keys in 1.1.0~rc2.

Are your VMs still available? If so, could you dig up the application server logs?

@redshiftzero
Copy link
Contributor

two other questions for @kushaldas regarding this scenario:

  1. was the source created very shortly after the system booted?
  2. if you leave the VM up for ~10 minutes and create a new source, is this reproducible (i.e. do you still get the flagged for reply flow)?

@kushaldas
Copy link
Contributor Author

  1. was the source created very shortly after the system booted?

No, I created more source to test

2.if you leave the VM up for ~10 minutes and create a new source, is this reproducible (i.e. do you still get the flagged for reply flow)?

Yes, I am wondering if I should remove that VM in the morning and try again via upgrade path. I have a snapshot of the 1.0.0 release state.

@redshiftzero
Copy link
Contributor

Adding source error logging (to the source apache config) so that we can inspect what is happening via this log line would be a good next step. Don't delete the VM as we might not be able to understand this situation if this is a rare event that's hard to reproduce.

@kushaldas
Copy link
Contributor Author

While submitting as a source, I can see the following log. I will dig more.

[Wed Oct 16 14:27:09.435324 2019] [wsgi:info] [pid 4096:tid 3866797000448] mod_wsgi (pid=4096): Adding '/var/www/securedrop' to path.
[Wed Oct 16 14:27:09.443408 2019] [wsgi:info] [pid 4096:tid 3866797000448] [remote 127.0.0.1:59040] mod_wsgi (pid=4096, process='source', application='localhost|'): Loading Python script file '/var/www/source.wsgi'.
[Wed Oct 16 14:27:35.622421 2019] [wsgi:error] [pid 4096:tid 3866371401472] Exception in thread Thread-11:
[Wed Oct 16 14:27:35.622955 2019] [wsgi:error] [pid 4096:tid 3866371401472] Traceback (most recent call last):
[Wed Oct 16 14:27:35.623271 2019] [wsgi:error] [pid 4096:tid 3866371401472]   File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
[Wed Oct 16 14:27:35.623528 2019] [wsgi:error] [pid 4096:tid 3866371401472]     self.run()
[Wed Oct 16 14:27:35.623818 2019] [wsgi:error] [pid 4096:tid 3866371401472]   File "/usr/lib/python3.5/threading.py", line 862, in run
[Wed Oct 16 14:27:35.624163 2019] [wsgi:error] [pid 4096:tid 3866371401472]     self._target(*self._args, **self._kwargs)
[Wed Oct 16 14:27:35.624554 2019] [wsgi:error] [pid 4096:tid 3866371401472]   File "/opt/venvs/securedrop-app-code/lib/python3.5/site-packages/pretty_bad_protocol/_meta.py", line 670, in _read_response
[Wed Oct 16 14:27:35.624644 2019] [wsgi:error] [pid 4096:tid 3866371401472]     result._handle_status(keyword, value)
[Wed Oct 16 14:27:35.625178 2019] [wsgi:error] [pid 4096:tid 3866371401472]   File "/opt/venvs/securedrop-app-code/lib/python3.5/site-packages/pretty_bad_protocol/_parsers.py", line 995, in _handle_status
[Wed Oct 16 14:27:35.625267 2019] [wsgi:error] [pid 4096:tid 3866371401472]     raise ValueError("Unknown status message: %r" % key)
[Wed Oct 16 14:27:35.625327 2019] [wsgi:error] [pid 4096:tid 3866371401472] ValueError: Unknown status message: 'ERROR'
[Wed Oct 16 14:27:35.625408 2019] [wsgi:error] [pid 4096:tid 3866371401472] 


@redshiftzero
Copy link
Contributor

interesting, that is coming from the handler for the GenKey object in python-gnupg - it looks like this is upstream issue isislovecruft/python-gnupg#265

@redshiftzero
Copy link
Contributor

can you confirm the version of gnupg being used here?

@redshiftzero
Copy link
Contributor

redshiftzero commented Oct 16, 2019

another step to debug would be to also confirm via inspecting the gpg keyring directly: is the source key not getting created?

you could try applying this patch https://github.com/isislovecruft/python-gnupg/pull/264/files and see if it resolves (it handles the 'ERROR' key so a ValueError doesn't get raised)

@kushaldas
Copy link
Contributor Author

Even with that upstream patch I am getting the ValueError.

gnupg                             1.4.20-1ubuntu3.3

@redshiftzero
Copy link
Contributor

hm, since crypto_util.py should be using the gpg2 binary on your system, I expect to see gnupg 2.1 or greater. Since we're no longer testing against gnupg 1.x (as of our removing the trusty test job), it's entirely possible that errors will occur with gnupg 1.x.

@kushaldas
Copy link
Contributor Author

Oh yes, gpg2 is also present on the system.

gnupg2                            2.1.11-6ubuntu2.1

@kushaldas
Copy link
Contributor Author

Finally some error log. The following command is causing the error while generating the source keys.

['/usr/bin/gpg2', '--no-options', '--no-emit-version', '--no-tty', '--status-fd', '2', '--homedir', '/var/lib/securedrop/keys', '--no-default-keyring', '--keyring', '/var/lib/securedrop/keys/pubring.gpg', '--secret-keyring', '/var/lib/securedrop/keys/secring.gpg', '--pinentry-mode', 'loopback', '--gen-key', '--batch', '--cert-digest-algo', 'SHA512']
gpg: invalid size of lockfile '/var/lib/securedrop/keys/gnupg_spawn_agent_sentinel.lock'
gpg: cannot read lockfile
gpg: can't connect to the agent: Invalid argument
gpg: agent_genkey failed: No agent running
gpg: key generation failed: No agent running

@redshiftzero
Copy link
Contributor

interesting.. do you get this same error for other gpg actions? (i assume no and that it's just key generation). what happens if you delete the lockfile?

@kushaldas
Copy link
Contributor Author

interesting.. do you get this same error for other gpg actions?

Only for that one key generation command. I will test in the morning (after the vm restore) what happens when I remove the lock file.

@redshiftzero
Copy link
Contributor

I tried making an invalid size lockfile (looks like from the gnupg 2.1 source code that a valid size for the lockfile is 11 bytes) and killing gpg-agent, but this did not reproduce the issue: I was still able to generate keys without issue.

@kushaldas
Copy link
Contributor Author

Moved to rc4 from 1.0.0 snapshot, the error still persists. But, if I remove the lock file, all problem is solved.

@redshiftzero
Copy link
Contributor

ok this must be an upstream gnupg bug (though I didn't see anything currently filed), specifically with assuan where the logic handling this lock file is - maybe an error due to interrupted gnupg initialization at some point? e.g. what happens if the writing of the lock file is interrupted?

I'm going to close this for now and we can reopen if we see it again and file upstream if/when we have more info. If any SecureDrop users see this issue, let us know (and as a temporary stopgap you can try to delete the lock).

@redshiftzero
Copy link
Contributor

I've seen this now for myself in a staging environment when testing something unrelated, reopening.

@eloquence
Copy link
Member

Bumping priority until we have more clarity whether this can arise in prod or not.

@eloquence
Copy link
Member

Next steps:

  • Review gnupg lists/tracker for issues that could help identify root cause
  • Look at gpg source code to determine when lock is being generated (timebox)

@zenmonkeykstop will investigate today

@redshiftzero
Copy link
Contributor

btw yesterday I scanned through the bugtracker and the only ticket that I found was https://dev.gnupg.org/T4146 which doesn't seem directly related

@zenmonkeykstop
Copy link
Contributor

zenmonkeykstop commented Nov 5, 2019

Haven't had any success in reproducing this, or in finding similar cases in bugtracker. Relevant code (checking for lock, not creating it per se) is here. There is a FIXME note indicating that the dotlock_take() function needs a timeout added.

@eloquence
Copy link
Member

For the 11/6-11/20 sprint, we've agreed that @zenmonkeykstop will continue to attempt to get a clean repro (time-box 3-4 hours), and @redshiftzero will help w/ upstream coordination as needed.

@eloquence
Copy link
Member

We've not done formal investigation on this since but we'll consider it as part of QA for 1.2.0. Keeping this on the board for visibility for now.

@eloquence
Copy link
Member

This was part of the test plan for 1.2.0, but we've not been able to reproduce the issue. Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants