Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError (str/bytes) in warc.py error path #23

Open
bnewbold opened this issue Sep 4, 2018 · 0 comments
Open

TypeError (str/bytes) in warc.py error path #23

bnewbold opened this issue Sep 4, 2018 · 0 comments

Comments

@bnewbold
Copy link

bnewbold commented Sep 4, 2018

In production at IA, probably caused by petabox downtime or network error, I got a the following exception and stack trace:

TypeError: sequence item 0: expected str instance, bytes found
  File "extraction_ungrobided.py", line 272, in <module>
    MRExtractUnGrobided.run()
  File "mrjob/job.py", line 424, in run
    mr_job.execute()
  File "mrjob/job.py", line 433, in execute
    self.run_mapper(self.options.step_num)
  File "mrjob/job.py", line 517, in run_mapper
    for out_key, out_value in mapper(key, value) or ():
  File "extraction_ungrobided.py", line 228, in mapper
    info, status = self.extract(info)
  File "extraction_ungrobided.py", line 143, in extract
    info['file:cdx']['c_size'])
  File "extraction_ungrobided.py", line 126, in fetch_warc_content
    gwb_record = rstore.load_resource(warc_uri, offset, c_size)
  File "wayback/resourcestore.py", line 65, in load_resource
    return create_resource(loader.load_block(bstart, blen))
  File "wayback/resource.py", line 583, in create_resource
    record, errors, offset = parser.parse(rs, 0, line)
  File "hanzo/warctools/warc.py", line 223, in parse
    % (",".join(self.KNOWN_VERSIONS)),

self.KNOWN_VERSIONS is defined as bytes at https://github.com/internetarchive/warctools/blob/master/hanzo/warctools/warc.py#L177, but is being joined with a string.

One fix, though i'm not sure it would work in Python 2.7, would be:

(",".join([s.decode('utf-8') for s in self.KNOWN_VERSIONS])

There's probably a more idiomatic way, but I can submit a patch for that.

While we're at it, might want to make it a join on ", ", not ","?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant