Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assignment 2 Issues/Questions #12

Open
neuralinfo opened this issue Jun 8, 2015 · 36 comments
Open

Assignment 2 Issues/Questions #12

neuralinfo opened this issue Jun 8, 2015 · 36 comments

Comments

@neuralinfo
Copy link
Owner

Write your issues/questions about assignment 2 here

@nickhamlin
Copy link

At office hours tonight a question came up that Luis asked us to share via this forum. Specifically, if we need to gather tweets occurring within a week, does that mean that A.) We should use the REST api and search within a given week, accepting the fact that this will give us a subset of all the possible tweets with our hashtags that OR B.) That we need to use the streaming API to literally gather all the tweets that occur with either hashtag. This approach would imply that we need to set it up and let it run for the full week. Could you please clarify which paradigm we should focus on, since that will have major implications on other the other design decisions. Thanks!

Answer: The assignment requires you to do B. However, as mentioned in class, you can also utilize REST api but you need to mention this in your submitted readme file.

@jamesgray007
Copy link

I am able to get my Twitter query to return results manually using https://twitter.com/search and in the Twitter REST API testing console using:

https://api.twitter.com/1.1/search/tweets.json?q=%23Warriors%20since%3A2015-06-04%20until%3A2015-06-11

but it does not return any results using the Tweepy API so I was curious if Tweepy uses the same exact "q" or not. I also may have something incorrect in my code but it uses the base example from the Data Acquisition activity a few weeks ago.

q="%23Warriors%20since%3A2015-06-04%20until%3A2015-06-11"

for tweet in tweepy.Cursor(api.search,q=q).items(200):

FYI: JSON is in tweet._json

print tweet._json

Answer: Tweepy uses the same query format. However, you can't query data older than one week through search api as mentioned in class. You always need to know the limitations of data inquiry. For more information about other limitations of search api, check this: https://dev.twitter.com/docs/using-search

@vincentchio
Copy link

To produce the histogram of words, should hashtag(#abc) and handle (@abc) be taken into account?

Answer: Only hashtag(#abc)

@sarmehta88
Copy link

Another question:
Should we collect tweets with only the hashtags #Warriors and NOT #NBAFinals2015 - for the part where we collect tweets for each of the hashtags. I did not do it that way because the instructions did not say when you search for one of the has tags, it should not include the other one... this is my assumption

However, there may be repeat tweets for the part where we collect both hashtags together and either hashtag. ( I hope I am making sense when explaining this scenario?)

Answer: As mentioned in class, you need to create 3 output files: 1- tweets with only the hashtags #Warriors and NOT #NBAFinals2015, 2- tweets with only the hashtags #NBAFinals2015 and NOT #Warriors, and 3- tweets with both hashtags together.

@vincentchio
Copy link

@sarmehta88 you can use OR in the search term https://dev.twitter.com/rest/public/search

@bspooner
Copy link

Just wondering when we make a "historgram of words" what exactly are we graphing? search term word counts by day?

Answer: Search term words counts in your entire data that you acquired

@cu8blank
Copy link

I'm not sure if my acquisition code is running properly. Approximately how many tweets should we be capturing for this assignment? 10K+? 100K+? more?

Answer: The number tweets varies depending on various factors such as dates. Your code should gather tweets based on the assignment instructions and you should not worry about the exact number.

@bspooner
Copy link

When I do a 1 week pull for #NBAfinals2015 I get about 1900 tweets, but I think that's because it gives a random sample.

@vincentchio
Copy link

I am getting more than 100K between the week Jun 7 to 14. If it is a game day, you are gonna have a tough time waiting for the program to pull it all because of the rate limit. 8th Jun alone has more than 80k.

@bspooner
Copy link

So I'm getting this error. I looks like Tweepy reaches its rate limit and sleeps. But then when Twitter fails to receive a request within a certain period it forces the connection closed. Is there someway to keep the connection open or paused? Or maybe something else is going on.

@bspooner
Copy link

Rate limit reached. Sleeping for: 866
Traceback (most recent call last):
File "C:\Python27\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", line 323, in RunScript
debugger.run(codeObject, main.dict, start_stepping=0)
File "C:\Python27\Lib\site-packages\pythonwin\pywin\debugger__init__.py", line 60, in run
_GetCurrentDebugger().run(cmd, globals,locals, start_stepping)
File "C:\Python27\Lib\site-packages\pythonwin\pywin\debugger\debugger.py", line 655, in run
exec cmd in globals, locals
File "C:\Users\Benjamin\Documents\Assignments_Berkeley\W205\Assignment2\search.py", line 1, in
import sys
File "C:\Python27\lib\site-packages\tweepy\cursor.py", line 197, in next
self.current_page = self.page_iterator.next()
File "C:\Python27\lib\site-packages\tweepy\cursor.py", line 108, in next
data = self.method(max_id=self.max_id, parser=RawParser(), _self.args, *_self.kargs)
File "C:\Python27\lib\site-packages\tweepy\binder.py", line 239, in _call
return method.execute()
File "C:\Python27\lib\site-packages\tweepy\binder.py", line 189, in execute
raise TweepError('Failed to send request: %s' % e)
TweepError: Failed to send request: ('Connection aborted.', error(10054, 'An existing connection was forcibly closed by the remote host'))

Answer: When the connection is forcibly closed, it can be due to many reasons. It is possible either the internet/network connection is going up/down (frequently) or you are sending malformed query data which is rejected by twitter. It also can happen when tweepy crashes due to lack of system resources.

@JaimeVL
Copy link

JaimeVL commented Jun 15, 2015

I'm having some issues with Tweepy when I try to use the max_id query parameter (an exception is thrown), which is suggested here: https://dev.twitter.com/rest/public/timelines. Also, I'm not getting any results back when I URL encode my query (using either urllib.quote_plus() or quote() function).

Let me first elaborate on the 2nd issue which is easier to explain. When I specify my query in plain text (1), I get results back. If I URL encode the query (2), I get zero results. Take a look:

query = "#NBAFinals2015 since:2015-06-11"

# (1) Returns 10 results
for tweet in tweepy.Cursor(api.search, q = query).items(10): count += 1
# (2) Returns 0 results
for tweet in tweepy.Cursor(api.search, q = urllib.quote_plus(query)).items(10): count += 1

For this reason I'm not doing any URL encoding on my queries, and I'm mostly getting results as expected, except when I use the max_id parameter. I only get up to 10 results when I use the max_id parameter, if I ask for more an exception gets thrown. Here's the code I use:

query = "#NBAFinals2015 since:2015-06-11 max_id:609510488829853696"

# (3) Returns 10 results
for tweet in tweepy.Cursor(api.search, q = query).items(10): count += 1
# (4) Throws exception...
for tweet in tweepy.Cursor(api.search, q = query).items(50): count += 1

And here's the exception I get:

Traceback (most recent call last):
  ....
File "/home/jaime/Documents/Git/MIDS/W205/MIDS-W205-Assignment2/main.py", line 146, in testTweepy
    for tweet in tweepy.Cursor(api.search, q = query2).items(50): count += 1
  File "/usr/local/lib/python2.7/dist-packages/tweepy/cursor.py", line 197, in next
    self.current_page = self.page_iterator.next()
  File "/usr/local/lib/python2.7/dist-packages/tweepy/cursor.py", line 108, in next
    data = self.method(max_id=self.max_id, parser=RawParser(), *self.args, **self.kargs)
  File "/usr/local/lib/python2.7/dist-packages/tweepy/binder.py", line 239, in _call
    return method.execute()
  File "/usr/local/lib/python2.7/dist-packages/tweepy/binder.py", line 223, in execute
    raise TweepError(error_msg, resp)
tweepy.error.TweepError: {"errors":[{"code":195,"message":"Missing or invalid url parameter."}]}
Process finished with exit code 1

I tried to URL encode the query, but as you can see above, that is not working for me regardless of whether I use max_id or not. I've already checked that I have the latest version of Python and Tweepy. Any ideas what could be going on? Is anyone else getting this?

P.S. I've changed my environment to Linux (Ubuntu) so this can't be a Windows issue!

@JaimeVL
Copy link

JaimeVL commented Jun 15, 2015

I found the issue, which is that you shouldn't specify max_id in the query section (q), but rather separately, like this:

query = "#NBAFinals2015 since:2015-06-11"
for tweet in tweepy.Cursor(api.search, q = query, max_id='609510488829853696').items(1000): count += 1

The documentation doesn't even mention it, but I found it by debugging the Tweepy modules. :$

@saffrydaffry
Copy link

I'm getting an error 32 message: [{"code":32,"message":"Could not authenticate you."}]. The weird thing is, my credentials work fine with the hello-twitter.py test. The only difference major difference that I think might be related to this error is that we are using tweepy.Cursor wrapped around api.search in search.py.

Has anyone come across this issue?

Follow up: Have you had access to twitter website through web browser(or another scripts using your credentials) at the same time?

Yeah I was able to use the hello-twitter.py. I added a bunch of extra stuff from the example files and now I'm going through and adding them bit by bit to see where the error might occur as per Luis' recommendation.

@bspooner
Copy link

Note on my previous error TweepError: Failed to send request: ('Connection aborted.', error(10054, 'An existing connection was forcibly closed by the remote host'))"

This ended up being a Rate limit problem. I solved it by catching rate limits with Tweepy.TweepError exception. Then I slept for 15 minutes and then reconnected starting after the last received tweetid using "since_id=lasttweetid" in the Cursor method.

Answer: Typically, you should get ['message': 'Rate limit exceeded', 'code': 88}] when you pass your limit. Also, you do not need to wait since tweepy can do that for you. Simply set the wait_on_rate_limit attribute to True as follows:

api = tweepy.API(auth_handler=auth,wait_on_rate_limit=True,wait_on_rate_limit_notify=True)

Note: I was still getting the error even with the wait_on_rate_limit=True attribute. The problem wasn't resolved until I reconnected after the wait.

@brk21
Copy link

brk21 commented Jun 17, 2015

How do I get the while statement to end the loop when the current date is past the stop date? Code snippet:

track = [u'#Warriors', u'#NBAFinals']
current = datetime.datetime.now()
stopDate = datetime.datetime(2015, 6, 21, 22, 50, 00)
listen = SListener(api, 'Warriors')
stream = tweepy.Stream(auth, listen)

print "Streaming started..."

try:
    while current <= stopDate:
        stream.filter(track = track)
        current = datetime.datetime.now()

Followup: The loop should break once the current date is greater than the stopDate. Is it not breaking out of the loop?

@brk21
Copy link

brk21 commented Jun 17, 2015

How do I run this for 7 straight days without relying on my computer being on the whole time? How do we keep the code running when our computer is off or disconnected from the Internet?

Answer: Note that you need to gather tweets with in a week not running a program for a week. You can schedule a task on your local machine using a task scheduler or you can use AWS EC2 instances if you need a reliable online machine .

@brk21
Copy link

brk21 commented Jun 17, 2015

Can I just pull the status and created_at date? Or do I really need to pull all the JSON as well for each tweet?

Answer: It is recommended to gather all the JSONs ( raw data) as you may need these tweets for the future assignments. However, if you decide to filter the JSONs based on some attributes, make sure that you indicate these in the readme file as part of your submission.

@brk21
Copy link

brk21 commented Jun 17, 2015

Will down periods during debugging be held against us? If we have to shut down our computers? Changing the hashtags?

Answer: check the answer given for your first post above.

@brk21
Copy link

brk21 commented Jun 17, 2015

Do we need to separate the pulls for each hashtag or can we pull both hashtags simultaneously?

Answer: As mentioned in class, you need to create 3 output files: 1- tweets with only the hashtags #Warriors and NOT #NBAFinals2015, 2- tweets with only the hashtags #NBAFinals2015 and NOT #Warriors, and 3- tweets with both hashtags together.

@presquepartout
Copy link

I'm using:
wait_on_rate_limit_notify=True
and want to grep my output for any notification messages. But I don't see any. Does anyone know the text of such messages? (I just wrote to stdout and piped to a file, so I'd expect to see any notifications in the file).

Answer: You can get the messages by extracting them from tweepy exception instance as follows:

except tweepy.TweepError as ex:
print ex.response.status
print ex.message[0]['code']
print ex.args[0][0]['code']

@saffrydaffry
Copy link

Has anyone gotten the additional q parameters to work? I'm trying to run this code (below) and I get no tweets

q = q + urllib.quote_plus(" since 2015-06-07 until 2015-06-07")

There was a commented hint in the example that I tried to follow, but maybe I don't understand it.

Additional query parameters:
since: {date}
until: {date}
Just add them to the 'q' variable: q+" since: 2014-01-01 until: 2014-01-02"

@bspooner
Copy link

here is what I am using: tweepy.Cursor(api.search, querystring, since = "2015-06-07", until="2015-06-14", since_id = index, lang="en", monitor_rate_limit=True).items()

where querystring is my search query string and index is the tweet_id of where I left off last

@saffrydaffry
Copy link

@bspooner thank you!

@dunmireg
Copy link

In that vein, I've gotten the "since" and "until" parameter working. I just tried including hour, has anyone gotten that to work?

since = "2015-06-17 00:00:00"
until = "2015-06-17 12:00:00"

I was going to use that for my chunking . . .

Answer: According to Luis and the documentation there isn't a native way to do this, you would have to mine the created_at field

@neuralinfo
Copy link
Owner Author

For the histogram, you need to consider all the words in the tweets (i.e text of the tweets) that you gather. However, for the graph you can filter the least frequent words and only show the top 30 most frequent words and their counts.

@neuralinfo
Copy link
Owner Author

Answers to some questions:

  1. Do we need to query twitter three times ( 3 cases: one for “#NBAFinals2015", one for " #Warriors", and one for “#NBAFinals2015 AND #Warriors”) and store three different files or can we just query once using “#NBAFinals2015 OR #Warriors”, store the tweets and filter the results afterwards to create the three output files associated with the three cases?

Answer: Both approaches are fine. However, the later approach imposes more overhead since you have to do additional post processing to filter all the tweets gathered based on the three cases.

  1. Do we need to create the histogram using matplotlib or can we use excel?

Answer: It is not required to use matplotlib. You can find more information on how to create a histogram using matplotlib at http://matplotlib.org/examples/pylab_examples/histogram_demo_extended.html

You can not use excel. If you do not want to use matplotlib, you can create a simple CSV (called histogram.csv) containing words in one column and their frequencies in another column. Make sure that you mention this in your readme file as part of your submission.

  1. How many histograms do we need to create?

Answer: Three different histograms for the three different cases:
One for tweets containing “#NBAFinals2015" but not #Warriors, one for tweets containing " #Warriors" but not “#NBAFinals2015" , and one for tweets containing both “#NBAFinals2015 AND #Warriors”.

@rocket-ron
Copy link

Are we "turning in" the results of Assignment #2 via pull request the same as we did for Assignment #1? In any case, in what format should we post the histogram - as a .png file? Apologies if this was covered elsewhere...

@neuralinfo
Copy link
Owner Author

All the assignments should be submitted via pull request. Though the preferred method is a png file, there is no specific required format as long as the file can be opened using a standard viewer.

@hdanish
Copy link

hdanish commented Jun 21, 2015

Some of the comments above reference the histogram as a graph or as just a csv with the words in one column and the frequency in the other. Is it fine to do the latter with the 30 most common words as long as we specify our logic in the readme?

Answer: The 30 limit was given to avoid having an unreadable graph. If you choose to go with the CSV version, you need to include all the words for each histogram of the three cases.

@hdanish
Copy link

hdanish commented Jun 21, 2015

As a brief follow up to the previous question, is it ok to use stopwords to filter out certain words?

Answer: if you choose to do so, make sure you mentioned this in your readme file.

@neuralinfo
Copy link
Owner Author

Quick Assignment 2 Recap:

  1. Only submit assignment in a private repo
  2. All the assignments should be submitted via pull request to your own repository. Please follow the instructions here: https://github.com/MIDS-W205/Assignments/blob/master/README.md
  3. Please follow the Table 1: Grading Standard for on page 8 of the Course Syllabus
  4. You must create three different histograms or simple CSVs for the three different cases:
    One for tweets containing “#NBAFinals2015" but not #Warriors, one for tweets containing " #Warriors" but not “#NBAFinals2015" , and one for tweets containing both “#NBAFinals2015 AND #Warriors”.
  5. For the histogram, you need to consider all the words in the tweets (i.e text of the tweets) that you gather. However, for the graph you can filter the least frequent words and only show the top 30 most frequent words and their counts.

@hdanish
Copy link

hdanish commented Jun 25, 2015

What is the expectation for the link for the s3 bucket? I see that individual files having working URLs, is there a way to get a URL for an entire bucket that would resolve properly? I was able to add a policy to make the bucket public but I'm not sure what to do for the link.

Answer: There is no specific expectation for the link as there might be various structures such as having all the files in the root of bucket or having them inside different folders within a bucket. However, you should have correct permissions (public access) in place.
The URL of your bucket is http://s3.amazonaws.com/YOURBUCKETNAME/

@d8davies
Copy link

I signed up for the package which includes GIT level with ability to create private repositories, but it is still asking me for money when I go to create it.

Answer: You should first get your education account and then you would be able to create private repositories without paying.

@jamesgray007
Copy link

I am using the nltk freqdist.plot to create a plot. What is the best method to export this to disk? From the documentation it appears that it may be a matplotlib object but my code says others.

Answer: If you are using matplotlib, you can use savefig() method to save the figure on disk.

@jamesgray007
Copy link

@d8davies -> I did the Education request a number of weeks ago and they never responded so I emailed support and they activated it immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests