error parsing TSV: 'utf-8' codec can't decode byte 0xed in position 855: invalid continuation byte #109

gfairchild · 2015-04-15T19:02:41Z

I have some really simple code to search the tweets for ngrams I've built (using Python 3.4):

#!/usr/bin/env python

import sys
import argparse
import json
import re

argparser = argparse.ArgumentParser(description='Pull relevant tweets out of a file')
argparser.add_argument('ngrams', type=argparse.FileType('r'), help='file containing keywords for which to search')
argparser.add_argument('tweets', type=argparse.FileType('r'), help='file containing tweets to use when searching')
args = argparser.parse_args()

# read ngrams, alter for regex search
with args.ngrams as ngrams_file:
    ngrams = json.loads(ngrams_file.read().lower())
ngrams = [r'\s+'.join(x) for x in ngrams]

# now search through each tweet and print out matches
try:
    with args.tweets as tweets_file:
        for tweet in tweets_file:
            tweet = tweet.strip()
            tweet_parts = tweet.split('\t')
            tweet_text = tweet_parts[2].lower()

            for ngram in ngrams:
                if re.search(ngram, tweet_text): # I could use re.IGNORECASE here, but lower() is much faster
                    print(tweet + '\t' + json.dumps(ngram.split(r'\s+')))
                    break
except Exception as e:
    print('file %s raised an exception: %s' % (args.tweets, e), file=sys.stderr)

This works for the vast majority of files, but it fails for certain files. Here's an example error I'm seeing:

file <_io.TextIOWrapper name='/data/tw/pre/2010-12-27.all.tsv' mode='r' encoding='UTF-8'> raised an exception: 'utf-8' codec can't decode byte 0xed in position 5004: invalid continuation byte

I've never seen this before. Looking at /data/tw/pre/2010-12-27.all.tsv, I don't see anything visually wrong with the file. Additionally, the unix file command doesn't either:

> file /data/tw/pre/2010-12-27.all.tsv
/data/tw/pre/2010-12-27.all.tsv: UTF-8 Unicode text, with very long lines

I'm not done exploring what might be wrong here, but I wanted to open this issue in case anyone else has seen this or might know what's going on.

The text was updated successfully, but these errors were encountered:

ChristosChristofidis · 2015-08-03T11:10:35Z

fixed?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error parsing TSV: 'utf-8' codec can't decode byte 0xed in position 855: invalid continuation byte #109

error parsing TSV: 'utf-8' codec can't decode byte 0xed in position 855: invalid continuation byte #109

gfairchild commented Apr 15, 2015

ChristosChristofidis commented Aug 3, 2015

error parsing TSV: 'utf-8' codec can't decode byte 0xed in position 855: invalid continuation byte #109

error parsing TSV: 'utf-8' codec can't decode byte 0xed in position 855: invalid continuation byte #109

Comments

gfairchild commented Apr 15, 2015

ChristosChristofidis commented Aug 3, 2015