Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError with 1.3.1 and python2.7 #816

Closed
bf opened this issue Jun 1, 2014 · 9 comments · May be fixed by #988
Closed

UnicodeDecodeError with 1.3.1 and python2.7 #816

bf opened this issue Jun 1, 2014 · 9 comments · May be fixed by #988

Comments

@bf
Copy link

bf commented Jun 1, 2014

When using nosetests on failing tests with output which contains non-ascii characters, I get the following error:

  File "/usr/bin/nosetests-2.7", line 9, in <module>
    load_entry_point('nose==1.3.1', 'console_scripts', 'nosetests-2.7')()
  File "/usr/lib/python2.7/site-packages/nose/core.py", line 121, in __init__
    **extra_args)
  File "/usr/lib/python2.7/unittest/main.py", line 95, in __init__
    self.runTests()
  File "/usr/lib/python2.7/site-packages/nose/core.py", line 207, in runTests
    result = self.testRunner.run(self.test)
  File "/usr/lib/python2.7/site-packages/nose/core.py", line 62, in run
    test(result)
  File "/usr/lib/python2.7/site-packages/nose/suite.py", line 176, in __call__
    return self.run(*arg, **kw)
  File "/usr/lib/python2.7/site-packages/nose/suite.py", line 223, in run
    test(orig)
  File "/usr/lib/python2.7/site-packages/nose/suite.py", line 176, in __call__
    return self.run(*arg, **kw)
  File "/usr/lib/python2.7/site-packages/nose/suite.py", line 223, in run
    test(orig)
  File "/usr/lib/python2.7/site-packages/nose/case.py", line 45, in __call__
    return self.run(*arg, **kwarg)
  File "/usr/lib/python2.7/site-packages/nose/case.py", line 138, in run
    result.addError(self, err)
  File "/usr/lib/python2.7/site-packages/nose/proxy.py", line 128, in addError
    formatted = plugins.formatError(self.test, err)
  File "/usr/lib/python2.7/site-packages/nose/plugins/manager.py", line 99, in __call__
    return self.call(*arg, **kw)
  File "/usr/lib/python2.7/site-packages/nose/plugins/manager.py", line 141, in chain
    result = meth(*arg, **kw)
  File "/usr/lib/python2.7/site-packages/nose/plugins/capture.py", line 74, in formatError
    test.capturedOutput = output = self.buffer
  File "/usr/lib/python2.7/site-packages/nose/plugins/capture.py", line 112, in _get_buffer
    return self._buf.getvalue()
  File "/usr/lib/python2.7/StringIO.py", line 271, in getvalue
    self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 308: ordinal not in range(128)
@jszakmeister
Copy link
Contributor

Does this still exist in 1.3.3? Do you have a small test case that can reproduce this?

@haavikko
Copy link

I've experienced the same issue. Here's a minimal test case that triggers the problem for me.

# -*- coding: utf-8 -*-
class NoseEncodingTestCase(TestCase):
    def test_crash_nose(self):
        print u'äää'
        print '\xe4'
        self.fail()

So the problem is caused by mixing unicode and 8-bit str output in a test that fails.

StringIO code contains this comment:
The StringIO object can accept either Unicode or 8-bit strings,
but mixing the two may take some care. If both are used, 8-bit
strings that cannot be interpreted as 7-bit ASCII (that use the
8th bit) will cause a UnicodeError to be raised when getvalue()
is called.

nose==1.3.4 and django-nose==1.2

@jszakmeister
Copy link
Contributor

What do you propose the solution to be? What should be captured? How should it be coerced? Yes, mixing non-unicode and unicode is a problem, but what do you think the correct behavior should be and why?

@jszakmeister
Copy link
Contributor

BTW, thanks for the test case!

@haavikko
Copy link

I guess that a test case (especially a failing one) might output anything in stdout. Often such outputs do come from inside 3rd party code, so nose should be able to accept a mix of any outputs

I'm not very knowledgeable on this issue, but one (maybe totally unworkable) idea:

  • Implement a subclass of io.TextIOBase that uses io.BytesIO to store the data. Use this to wrap stdout.
  • Make TextIOBase subclass accept both str and unicode instances. Unicode is encoded with sys.stdout.encoding before being added to BytesIO.
  • When the contents need to be displayed, try to decode contents of the buffer as sys.stdout.encoding, but use errors=replace so that decoding doesn't choke on invalid byte sequences.
  • Maybe have a way for the end user to configure the encoding/decoding method used.

io library was added in Python 2.6, if older versions need to be supported the current StringIO solution can be preserved

@bf bf closed this as completed Mar 17, 2015
@jszakmeister
Copy link
Contributor

None of what you proposed works with Python 2.5 or 2.4. And I'm not real interested in maintaining yet another place where everything differs. Is there a solution that works across the board?

@haavikko
Copy link

Problem could be solved by implementing a subclass of StringIO that overrides just getvalue().
Idea is to preserve current functionality, except when encoding error is detected, force the contents of the buffer into ascii encoding.

Pseudocode something like:

def getvalue(self):
  try:
    return super(self, ...).getvalue()
  except UnicodeDecodeError:
    implement basically the same thing as StringIO.getvalue except
    for each buffer, check if it is unicode or str, and force it into ascii,
    use errors=replace to ignore invalid byte sequences.

This is not a nice and general solution (as using io.TextIOBase would be), but should be workable in all Python versions. Note: I've only checked Python 2.7 StringIO code, if internal implementation of StringIO is a lot different in earlier Python versions, this will fail. Also don't know about Python 3, possibly this bug does not even occur there.

@jszakmeister
Copy link
Contributor

Doesn't this approach corrupt the output?

@haavikko
Copy link

Yes it does, if you have a list of buffers in who-knows-what encoding and you combine them into one string, that's what happens (but output containing some question marks is preferable to tests not running at all).

But there may be another option. The problem is caused by forcing a list of strings to use the same encoding, so don't do that. Skip calling getvalue() altogether, implement another method that goes through the internal StringIO buffer list and prints each buffer in turn. No need to force any encoding on them. Even then, depending on value of sys.stdout.encoding, sometimes the output will be corrupted in any case, but I don't see a way around that.

jmoldow added a commit to jmoldow/nose that referenced this issue Mar 23, 2016
On Python 2, `sys.stdout` and `print` can normally handle any
combination of `str` and `unicode` objects. However,
`StringIO.StringIO` can only safely handle one or the other. If
the program writes both a non-ASCII `unicode` string, and a
non-ASCII `str` string, then the `getvalue()` method will fail
with `UnicodeDecodeError` [1].

In nose, that causes the script to suddenly abort, with the
cryptic `UnicodeDecodeError`.

This fix catches `UnicodeError` when trying to get the captured
output, and will replace the captured output with a warning
message.

Fixes nose-devs#816

[1] <https://github.com/python/cpython/blob/2.7/Lib/StringIO.py#L258>
jmoldow added a commit to jmoldow/nose that referenced this issue Mar 23, 2016
On Python 2, `sys.stdout` and `print` can normally handle any
combination of `str` and `unicode` objects. However,
`StringIO.StringIO` can only safely handle one or the other. If
the program writes both a `unicode` string, and a non-ASCII
`str` string, then the `getvalue()` method will fail with
`UnicodeDecodeError` [1].

In nose, that causes the script to suddenly abort, with the
cryptic `UnicodeDecodeError`.

This fix catches `UnicodeError` when trying to get the captured
output, and will replace the captured output with a warning
message.

Fixes nose-devs#816

[1] <https://github.com/python/cpython/blob/2.7/Lib/StringIO.py#L258>
Sdrkun pushed a commit to openEuler-BaseService/nose that referenced this issue Dec 10, 2020
On Python 2, `sys.stdout` and `print` can normally handle any
combination of `str` and `unicode` objects. However,
`StringIO.StringIO` can only safely handle one or the other. If
the program writes both a `unicode` string, and a non-ASCII
`str` string, then the `getvalue()` method will fail with
`UnicodeDecodeError` [1].

In nose, that causes the script to suddenly abort, with the
cryptic `UnicodeDecodeError`.

This fix catches `UnicodeError` when trying to get the captured
output, and will replace the captured output with a warning
message.

Fixes nose-devs#816

[1] <https://github.com/python/cpython/blob/2.7/Lib/StringIO.py#L258>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants