-
Notifications
You must be signed in to change notification settings - Fork 54
Pulling Large GDELT Data
This quick tutorial gives some code on how to pull GDELT data using gdeltPyR
that is large or covers long periods of time.
We need a tutorial on pulling data because a single day's worth of GDELT data can consume 400+ MBs of RAM; it's easy to see how a query of 10s or dozens of days (not to mention months or years) will exhaust your RAM.
First, let's cover the tools you will need:
- concurrent futures. (This library is included in Python 3, need to install for Python 2)
- gdeltPyR
- dask
- pandas
concurrent futures
will help us run parallel process on our gdeltPyR
queries. gdeltPyR
is how we query the data. And, once our queries are complete, we can use dask
to load all the data into an out-of-core dataframe and perform pandas
like operations on the data.
Because the data can be large, our plan is to pull a single day's worth of GDELT data and write that data to disk. Because dask can load multiple files from disk, we will take advantage of this. In short, this allows us to do operations on data that is larger than RAM (real big data problems).
Version 2 GDELT data is more extensive, but it only provides data from February 2015 to present. Therefore, you need to set the version to 1 if you need data before February 2015. The gdeltPyR
query would look like this:
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import gdelt
# set up gdeltpyr for version 1
gd = gdelt.gdelt(version=1)
# multiprocess the query
e = ProcessPoolExecutor()
# generic function to pull and write data to disk based on date
def getter(x):
try:
date = x.strftime('%Y%m%d')
d = gd.Search(date)
d.to_csv("{}_gdeltdata.csv".format(date),encoding='utf-8',index=False)
except:
pass
# now pull the data; this will take a long time
results = list(e.map(getter,pd.date_range('2015 Apr 21','2018 Apr 21')))
In the first steps, we're just importing the libraries we need. These were mentioned earlier. This line, e = ProcessPoolExecutor()
is setting up the multiprocessing job so that we can map operations to each core. If we have 8 cores, we could run 8 gdeltPyR
queries simultaneously. Therefore, having more cores will make a longer query return faster. I suggest making use of AWS or Google Compute Engine instances with multiple cores if you want to save time. As you can see, you can have a machine with 90+ cores to run 90 days worth of gdeltPyR
queries at one time. That will save you time!!
getter
is a utility function to pull data and fail gracefully if no data is returned or something is corrupted from the GDELT server. Then, we use the pandas.DataFrame.to_csv
method to export the returned data to a csv
file on disk. Because I am pythonically converting the date to a string with this line,date = x.strftime('%Y%m%d')
, I use that to build the file name.
Finally, this line, pd.date_range('2015 Apr 21','2018 Apr 21')
, uses the pandas.date_range
method to build an array of dates between two dates.
The results
variable maps our getter
function to each date in the array of dates. Each file is written to disk!
If you need version 2 data (and pull data after February 2015), you can modify the code above slightly. Specifically, you will add version=2
to the gdelt
object and add a coverage=True
to the gdelt.gdelt.Search
method. The modified code will look like this:
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import gdelt
# set up gdeltpyr for version 2
gd = gdelt.gdelt(version=2)
# multiprocess the query
e = ProcessPoolExecutor()
# generic function to pull and write data to disk based on date
def getter(x):
try:
date = x.strftime('%Y%m%d')
d = gd.Search(date, coverage=True)
d.to_csv("{}_gdeltdata.csv".format(date),encoding='utf-8',index=False)
except:
pass
# now pull the data; this will take a long time
results = list(e.map(getter,pd.date_range('2015 Apr 21','2018 Apr 21')))
If you need to write the data to disk because it's larger than RAM, it's also likely you'll need a convenient workflow to work with data larger than RAM. dask
makes is possible to handle inconveniently large data sets, whether that data set resides on a single laptop or across hundreds or thousands of clusters. We use dask
to load multiple csv
files with our gdeltPyR
data into a single dataframe. We can perform pandas
like operations on this larger data set as well! So, the code to load the data:
import dask.dataframe as dd
# read all the gdelt csvs into one dataframe
df = dd.read_csv('*_gdeltdata.csv')
That's it! Whether you have hundreds or thousands of csv
files in your target folder, this will read it.
This wiki provided a tutorial on how to pull and process large GDELT data using gdeltPyR
. To learn more about using dask
, visit their documentation page here.