Spatio-temporal clustering methods developed as part of the Deliverable 3.3.1 of the [Trendminer FP7 project] (http://www.trendminer-project.eu)
- Java v1.6+
- Mallet v2.0.7+
This application implements three main functionalities:
- transformation of the data from the Trendminer format to Mallet format
- creation of train and test instance files
- execution of spatio-temporal clustering using the Dirichlet-multinomial regression model (DMR)
Transforms data from the Trendminer format to the Mallet format. The data is in a folder containing the following files:
- dictionary: the dictionary used in the format 'word_id word', one word/line, starts with 0
- dates: the map to the actual calendar days in the format 'day_id YYYY-MM-DD', one date/line, starts with 0
- cities: the map to the city names in the format 'city_id city', one city/line, starts with 0
- GEO: the map with details of each city in the format 'city_id city latitude longitude country', one city/line, starts with 0
- sora_vs: the data transformed in the following format 'word_id day_id city_id[TAB]frequency', one word/line, representing the frequency of word_id on day_id and in city_id
Several setups are available for transforming the data for clustering, based on the methods decribed in D3.3.1:
-
Monthly indicator features (Mid):
java -Xmx6G -jar dist/trendminer-sptempclustering-importer.jar --startMonth 06 --endMonth 06 --startYear 2012 --endYear 2013 --useMonthlyIndicatorFeatures true --mainDir data/
mainDir represents the data directory; startMonth, startYear, endMonth, endYear indicate the time interval of the data to be processed
-
Temporal smoothing with RBF kernels (TimeRBF):
java -Xmx6G -jar dist/trendminer-sptempclustering-importer.jar --sigma 30 --startMonth 06 --endMonth 06 --startYear 2012 --endYear 2013 --mainDir data/
RBF kernels are situated equidistant at the middle of every month. The sigma parameter indicates the RBF width.
-
To use city indicator features add the parameter:
--useCityFeatures true
-
To use country indicator features add the parameter:
--useCountryFeatures true
-
To use spatial smoothing features add the parameters:
--geokernel true --sigma_GEO 2
This represents the width of the RBF kernel. RBF kernels are situated with the center in each city in the city list.
This stage splits the data into two disjoint files, one for training and one for testing. Input is the file created at the previous stage and the proportion of training data.
java -Xmx6G -jar dist/trendminer-sptempclustering-instancecreator.jar --instancesMalletFile data/mallet_file --trainingportion 0.7
The spatio-temporal clustering can be run using the files generated at the previous step as input.
java -Xmx6G -jar dist/trendminer-sptempclustering.jar --trainInstanceList data/mallet_train_file --testInstanceList data/mallet_test_file --outputFolder output/ --nrTopics 100 --topWords 10
where mallet_train_file and mallet_test_file represent paths to the training and test files, output/ is the output folder of the model, nrTopics is the number of topics and topWords (optional) represents the top number of words which describe a topic in the output files.
The output folder contains:
- perplexity.txt: the perplexity on the held-out test data
- topics.txt: the top topWords words in each topic, one topic/line
- _parameters_: the coeficients (weights) for each tempora/spatial feature, one file/topic
Daniel Preotiuc-Pietro, Sina Samangooei, Andrea Varga, Douwe Gelling, Trevor Cohn, Mahesan Niranjan Tools for mining non-stationary data - v2. Clustering models for discovery of regional and demographic variation - v2. Public Deliverable for Trendminer Project, 2014.