You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have some really simple code to search the tweets for ngrams I've built (using Python 3.4):
#!/usr/bin/env pythonimportsysimportargparseimportjsonimportreargparser=argparse.ArgumentParser(description='Pull relevant tweets out of a file')
argparser.add_argument('ngrams', type=argparse.FileType('r'), help='file containing keywords for which to search')
argparser.add_argument('tweets', type=argparse.FileType('r'), help='file containing tweets to use when searching')
args=argparser.parse_args()
# read ngrams, alter for regex searchwithargs.ngramsasngrams_file:
ngrams=json.loads(ngrams_file.read().lower())
ngrams= [r'\s+'.join(x) forxinngrams]
# now search through each tweet and print out matchestry:
withargs.tweetsastweets_file:
fortweetintweets_file:
tweet=tweet.strip()
tweet_parts=tweet.split('\t')
tweet_text=tweet_parts[2].lower()
forngraminngrams:
ifre.search(ngram, tweet_text): # I could use re.IGNORECASE here, but lower() is much fasterprint(tweet+'\t'+json.dumps(ngram.split(r'\s+')))
breakexceptExceptionase:
print('file %s raised an exception: %s'% (args.tweets, e), file=sys.stderr)
This works for the vast majority of files, but it fails for certain files. Here's an example error I'm seeing:
file <_io.TextIOWrapper name='/data/tw/pre/2010-12-27.all.tsv' mode='r' encoding='UTF-8'> raised an exception: 'utf-8' codec can't decode byte 0xed in position 5004: invalid continuation byte
I've never seen this before. Looking at /data/tw/pre/2010-12-27.all.tsv, I don't see anything visually wrong with the file. Additionally, the unix file command doesn't either:
> file /data/tw/pre/2010-12-27.all.tsv
/data/tw/pre/2010-12-27.all.tsv: UTF-8 Unicode text, with very long lines
I'm not done exploring what might be wrong here, but I wanted to open this issue in case anyone else has seen this or might know what's going on.
The text was updated successfully, but these errors were encountered:
I have some really simple code to search the tweets for ngrams I've built (using Python 3.4):
This works for the vast majority of files, but it fails for certain files. Here's an example error I'm seeing:
I've never seen this before. Looking at
/data/tw/pre/2010-12-27.all.tsv
, I don't see anything visually wrong with the file. Additionally, the unixfile
command doesn't either:I'm not done exploring what might be wrong here, but I wanted to open this issue in case anyone else has seen this or might know what's going on.
The text was updated successfully, but these errors were encountered: