-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assignment 3 Issues/Questions #20
Comments
The way the question is posed implies that we should calculate the lexical diversity for each individual tweet and store that information in Mongo. Is that correct and, if so, what information should the plot of lexical diversity contain? A histogram representing the lexical diversities of all the tweets in the corpus? |
More information is added to the instructions related to your question. If you still need further clarifications, please post your question here. |
Thanks, that explanation helps. Another question that came up in office hours with Luis just how: For those of us that chose to use the firehose for assignment 2 and didn't necessarily store all the metadata for each tweet, how should we think about part 1? My inclination would be to use the REST api to go back and re-gather some tweets, store them in S3 and MongoDB, and then use that corpus for the rest of the assignment. However, since we're now over a week removed from the NBA finals and the REST api won't allow us to go back in time beyond that, there are going to be very few tweets with those hashtags. If that's the case, should we use another hashtag about a more current event (I've been using #pride to test things out thus far) to ensure we get a good amount of data? |
In that case, you can use two different related hashtags about a more current event and use that corpus for the rest of the assignment. Make sure, you indicate these changes in your readme file as part of your submission. |
For parts 1-1 and 2-2, does it mean we need to continuously obtain tweets, store them and do the analysis as the tweets come in? How long should we run this? |
You do need to gather all the available tweets associated with the hashtags as indicated in the instructions. In most cases you can get one week worth of tweets. |
Two clarifying questions regarding part 2.3: 2.) In the event that our users from 2.1 contain celebrities that have millions of followers, are we expected to store them all? For example, if @BarackObama supplies one of our top 30 retweets, do we need to store all 61.4 million of his followers? |
|
I am a bit confused about the difference between 1.1 and 1.2. In 1.1, are we storing the raw JSON returned from Twitter API to MongoDB. Whereas, in 1.2, we are storing just the tweet text to MongoDB? Also, in 1.1, do we need to fetch the data again by calling the Twitter API or can we just simply reuse the raw data that we gathered from Assignment 2. |
1- In 1.1 you are loading the data from the source as you retrieve the data (realtime data storage) whereas in 1.2 you are loading the data from a gathered/chunked dataset (offline storage). If you look at 2.1, you should be able to find out what needs to be stored in 1.2 2-in 1.1, you need to fetch the data again by calling the Twitter API. |
Follow-up to @vincentchio question: Task 1.1 -> I don't see anything in the instructions that sets a requirement for real-time given that it states the use of the REST API. My interpretation of this requirement was loading the JSON I stored locally on my device and loading into MongoDB. Please confirm that loading JSON stored locally into MongoDB meets the expectation or not. Task 1.2 -> My interpretation is loading data from S3 into MongoDB. I do not see an answer to @vincentchio question on whether this is the entire JSON for each tweet or just the tweet "text". The instructions also state we can re-use our JSON's from Assignment 2. |
In task 1.1, the instruction says: "write a a python program to automatically retrieve and store the JSON files returned by the twitter REST api. If you are using the JSONs that you gathered for Assignment 2, you only need to load JSON stored locally into MongoDB since we do not want you to repeat the same task. However, if you are using different hashtags, you probably may want to store the JSONs directly as it is not a good design decision to store them locally and load them later. Task 1.2, you need to load the data from S3 into MongoDB. |
For question 2.3 since the tweets are based from question 2.1, which is in turn based off of 1.2 (assignment 2), will it be fine to essentially re-run a similar process at this point in time and compare the followers? It may be longer than a week's difference but I would think the logic should be the same whether it's a one week difference or multiple week difference? |
That is fine. Make sure that you mention this in your readme file. |
When we are retrieving the 30 top retweets, how are we supposed to handle items that have been retweeted by many different users. It seems as though the retweet_count property of the tweet refers to whatever the original tweet was. So for example, if Lebron had a tweet that was retweeted 5000 times, including by users A, B and C, then the retweet count will show up as 5000 for each of the retweets by those users. If it's a particularly popular tweet, it's possible that the 30 retweets all reference the same original tweet. How are we meant to handle a situation such as that or is it fine that we will have the same tweet text over and over again but for different users? |
It is not acceptable to have the same tweets text over and over again. You need to find a way to check/filter such tweets. You may want to look at RT at the front of a tweet or other features that twitter provides. |
For 2-2, are we just looking at the tweets we have collected or do we have to go to twitter and get all the tweets for each user? |
The instruction says "you need to find all the tweets of a particular user". If you have already stored them during the acquisition phase, you can use them. Otherwise, you need to get them from Twitter (tweets that are available to you through REST api) |
Do we need to include retweets as part of a users tweets or is this a design decision we can indicate in the readme? |
I guess I should clarify the question. Does "all the tweets of a user" mean all the tweets of that user with hashtags 'NBAFinals2015" and "Warriors" (i.e. the ones we have already collected) or ALL tweets by that user (i.e. do a new search with the user as the only search param) |
@hdanish: it is your design decision. |
@kchoi01: ALL tweets by that user. The majority of the users only have 1 or 2 tweets with hashtags 'NBAFinals2015" and "Warriors". |
Just to confirm on the last point, if a user has 50k tweets then we would need to retrieve all of them? Lots of users have tens of thousands of tweets so we might be retrieving millions of tweets for all our users combined? |
Adding to @hdanish 's comment. I have 200000 unique users in my db_restT. If I retrieve 1000 tweets per user it will take 34 days to complete the search using the current max rate limit (60000 tweets/15 mins). I hope that the number of tweets per user that we fetch can be adjusted according to one's situation. In my case, I only plan to fetch 100 tweets max per user and it would take 3.5 days. I hope the number of tweets per user is not a hard requirement but a soft requirement per student cases. |
@hdanish and @vincentchio : Though your program should be able to retrieve all of the users' tweets, the number of tweets per user that you pull is not a hard requirement. You can come up with a reasonable upper-bound for the maximum number of tweets that you pull. Make sure you mention this in your readme. |
In a similar vein is it ok to limit the followers for Task 2.3 to say the first 10,000 followers retrieved? |
This has already been answered. Check the answers to @nickhamlin question in one of the above posts. |
Running mongod gives an error 'MongoDB Insufficient free space for journal files' |
Try running MongoDB without journaling: ./bin/mongod --config=myconf.conf --nojournal & |
for 2.3 some of us will be unable to complete the data collection in time to be able to compare a weeks time lag. is the "one week" a hard requirement ? or could the task be implement with the best time lag we can manage within the time limitations ? |
You can implement 2.3 with the best time lag you can and make sure you indicate this in your readme file |
Write your issues/questions about assignment 3 here
The text was updated successfully, but these errors were encountered: