TwitterCleaner

Solution for the InsightDataEngineering coding-challenge

The Main.py contains the main program, while tweetprocessor.py contains the subroutine to process the tweets (i.e. feature 1), and to count the rolling average of the graph nodes (i.e. feature 2)

Upon calling run.sh, the code will ask whether user wants to use the example file [ex ] (tweets.txt in tweet_input folder), or the live stream data from Twitter [st], using the twitter API.

a .twitter-example file with the credentials need to exist in the src folder for the twitter API to work.

Use streaming data [st], or example file [ex]? [st/ex] :

When choosing to use the live streaming data, the software will ask whether user wants to append the streams into the tweets.txt file:

Store tweet streams to tweets.txt? [y/n]:

Modules Imported:

Except for tweepy, these imported modules are usually standard in a python distribution

re
datetime
time
json
collections
copy
os
tweepy [Please refer to tweepy website for installation]

Solution

Feature 1

The solution for cleaning the tweets [feature 1] employs two strategies:

Searching for the text using internal string module
Searching for unicodes, hashtags and escape sequences using Regular Expression (re module)

While one can employ json module to extract the text, I believe that string provides better performance

Feature 2

The graph is stored as dictionary in python. In addition, I keep the number of hashtags associated with the connected nodes in the dictionary. Adding and removing the hashtags will increment or decrement the counter, and once it reaches zero, the connected node will be removed.

Another counter is used to keep tabs of the graph nodes, and the total sum of the connected vertices.

This way, the subroutine avoids recalculating the graph for every tweets.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
tweet_input		tweet_input
tweet_output		tweet_output
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TwitterCleaner

Solution

About

Releases

Packages

Languages

jprawiharjo/Twitter_Cleaner

Folders and files

Latest commit

History

Repository files navigation

TwitterCleaner

Solution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages