Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error parsing TSV: 'utf-8' codec can't decode byte 0xed in position 855: invalid continuation byte #109

Open
gfairchild opened this issue Apr 15, 2015 · 1 comment

Comments

@gfairchild
Copy link
Collaborator

I have some really simple code to search the tweets for ngrams I've built (using Python 3.4):

#!/usr/bin/env python

import sys
import argparse
import json
import re

argparser = argparse.ArgumentParser(description='Pull relevant tweets out of a file')
argparser.add_argument('ngrams', type=argparse.FileType('r'), help='file containing keywords for which to search')
argparser.add_argument('tweets', type=argparse.FileType('r'), help='file containing tweets to use when searching')
args = argparser.parse_args()

# read ngrams, alter for regex search
with args.ngrams as ngrams_file:
    ngrams = json.loads(ngrams_file.read().lower())
ngrams = [r'\s+'.join(x) for x in ngrams]

# now search through each tweet and print out matches
try:
    with args.tweets as tweets_file:
        for tweet in tweets_file:
            tweet = tweet.strip()
            tweet_parts = tweet.split('\t')
            tweet_text = tweet_parts[2].lower()

            for ngram in ngrams:
                if re.search(ngram, tweet_text): # I could use re.IGNORECASE here, but lower() is much faster
                    print(tweet + '\t' + json.dumps(ngram.split(r'\s+')))
                    break
except Exception as e:
    print('file %s raised an exception: %s' % (args.tweets, e), file=sys.stderr)

This works for the vast majority of files, but it fails for certain files. Here's an example error I'm seeing:

file <_io.TextIOWrapper name='/data/tw/pre/2010-12-27.all.tsv' mode='r' encoding='UTF-8'> raised an exception: 'utf-8' codec can't decode byte 0xed in position 5004: invalid continuation byte

I've never seen this before. Looking at /data/tw/pre/2010-12-27.all.tsv, I don't see anything visually wrong with the file. Additionally, the unix file command doesn't either:

> file /data/tw/pre/2010-12-27.all.tsv
/data/tw/pre/2010-12-27.all.tsv: UTF-8 Unicode text, with very long lines

I'm not done exploring what might be wrong here, but I wanted to open this issue in case anyone else has seen this or might know what's going on.

@ChristosChristofidis
Copy link

fixed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants