Skip to content

spatio-temporal clustering methods for the D3.3.1 (Trendminer FP7 project)

Notifications You must be signed in to change notification settings

fk3/trendminer-sptempclustering

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Trendminer Spatio-temporal clustering

Spatio-temporal clustering methods developed as part of the Deliverable 3.3.1 of the [Trendminer FP7 project] (http://www.trendminer-project.eu)

Prerequisites

This application implements three main functionalities:

  1. transformation of the data from the Trendminer format to Mallet format
  2. creation of train and test instance files
  3. execution of spatio-temporal clustering using the Dirichlet-multinomial regression model (DMR)

1) Data tansformation

Transforms data from the Trendminer format to the Mallet format. The data is in a folder containing the following files:

  • dictionary: the dictionary used in the format 'word_id word', one word/line, starts with 0
  • dates: the map to the actual calendar days in the format 'day_id YYYY-MM-DD', one date/line, starts with 0
  • cities: the map to the city names in the format 'city_id city', one city/line, starts with 0
  • GEO: the map with details of each city in the format 'city_id city latitude longitude country', one city/line, starts with 0
  • sora_vs: the data transformed in the following format 'word_id day_id city_id[TAB]frequency', one word/line, representing the frequency of word_id on day_id and in city_id

Several setups are available for transforming the data for clustering, based on the methods decribed in D3.3.1:

  1. Monthly indicator features (Mid):

    java -Xmx6G -jar dist/trendminer-sptempclustering-importer.jar --startMonth 06 --endMonth 06 --startYear 2012 --endYear 2013 --useMonthlyIndicatorFeatures true --mainDir data/

mainDir represents the data directory; startMonth, startYear, endMonth, endYear indicate the time interval of the data to be processed

  1. Temporal smoothing with RBF kernels (TimeRBF):

    java -Xmx6G -jar dist/trendminer-sptempclustering-importer.jar --sigma 30 --startMonth 06 --endMonth 06 --startYear 2012 --endYear 2013 --mainDir data/

RBF kernels are situated equidistant at the middle of every month. The sigma parameter indicates the RBF width.

  1. To use city indicator features add the parameter:

    --useCityFeatures true

  2. To use country indicator features add the parameter:

    --useCountryFeatures true

  3. To use spatial smoothing features add the parameters:

    --geokernel true --sigma_GEO 2

This represents the width of the RBF kernel. RBF kernels are situated with the center in each city in the city list.

2) Splitting the data into training and test set

This stage splits the data into two disjoint files, one for training and one for testing. Input is the file created at the previous stage and the proportion of training data.

java -Xmx6G -jar dist/trendminer-sptempclustering-instancecreator.jar --instancesMalletFile data/mallet_file --trainingportion 0.7

3) Spatio-temporal clustering

The spatio-temporal clustering can be run using the files generated at the previous step as input.

java -Xmx6G -jar dist/trendminer-sptempclustering.jar --trainInstanceList data/mallet_train_file --testInstanceList data/mallet_test_file --outputFolder output/ --nrTopics 100 --topWords 10

where mallet_train_file and mallet_test_file represent paths to the training and test files, output/ is the output folder of the model, nrTopics is the number of topics and topWords (optional) represents the top number of words which describe a topic in the output files.

The output folder contains:

  • perplexity.txt: the perplexity on the held-out test data
  • topics.txt: the top topWords words in each topic, one topic/line
  • _parameters_: the coeficients (weights) for each tempora/spatial feature, one file/topic

References

Daniel Preotiuc-Pietro, Sina Samangooei, Andrea Varga, Douwe Gelling, Trevor Cohn, Mahesan Niranjan Tools for mining non-stationary data - v2. Clustering models for discovery of regional and demographic variation - v2. Public Deliverable for Trendminer Project, 2014.

About

spatio-temporal clustering methods for the D3.3.1 (Trendminer FP7 project)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 76.3%
  • Java 23.7%