The ultimate twitter streaming data collector
At Unnati we're a bunch of Data Scientists solving important business problems.
The crux of solving these data related problems is being able to collect the data itself. In some cases, data might already be available in the way we want it, but in most cases, it has to either be procured or transformed to fit our needs.
Social media analytics is turning out to be a very important aspect of a business. End users are quick to praise and even quicker to shame a brand or a product on Social Media. This has resulted in a rapid investment being put into being able to monitor and act on inputs received from Social Media.
But to begin, we need data.
Consider Twitter. Their RESTful APIs are quite stringent in terms of Rate limits. What we really want to use is their streaming API. The streaming API doesn't have rate limits and grants us the power of processing these things in near real time.
The problem is in reinventing the wheel, most of the times, we end up writing the data collection layer time and again with minor changes to the codebase. The crux of the collection layer though, largely remains the same.
To solve this problem, we built Tweezer. With Tweezer, you can start collecting data in under 5 mins. All you need is a twitter authorized app created at your end and an instance of MongoDB.
We have a handy configuration file to manage the workings of the app. This configuration file will have the authorized app credentials, data store credentials and the keywords/hashtags to track.
Very recently, HasGeek conducted their annual conference on JavaScript in India - JsFoo. To test run Tweezer, we left it running for 3 days monitoring the hashtags and keywords relevant to the event.
Using this data, we have even put together a dashboard visualizing the various angles of JsFoo: here
Use the Dockerfile
and build the docker image
The docker image comes with jdk-8
and mongodb
$ sudo docker build -t mytwitterstream .
Once the image is built, make sure you add the credentials in application.conf
- specify the twitter API credentials
- specify mongo db credentials
Run the docker image.
$ sudo docker run -t -i mytwitterstream
This internally starts mongodb, starts the twitter streamer app which writes tweets to the local db.
$ sbt build
Set the required credentials in application.conf
$ sbt run
If you have Java 7+ and mongodb 3 installed and do not want the docker setup,
you can pickup the jar from dropbox
and run tweezer. Make sure you configure the application.conf
(here is a sample)
and set an environment variable HARATE_CONF
pointing to the location of the
configuration file
export HARATE_CONF=/path/to/application.conf
Once we have the path configured, we are ready to run the jar
$ java -jar tweezer-0.2.0.jar