Adding activity patterns to synthetic population

1. Matching - Household level

1.1 Categorical matching

This is exact matching using a join

Columns to match on

What columns should be matched on? We want to do matching in a way that minimises the number of households in the SPC that are not matched. I am using the following columns

Variable	Name (NTS)	Name (SPC)	Transformation (NTS)	Transformation (SPC)
Household income	`HHIncome2002_BO2ID`	`salary_yearly`	NA	Group by household ID and sum
Number of adults	`HHoldNumAdults`	`age_years`	NA	Group by household ID and count
Number of children	`HHoldNumChildren`	`age_years`	NA	Group by household ID and count
Employment status	`HHoldEmploy_B01ID`	`pwkstat`	NA	a) match to NTS categories. b) group by household ID
Car ownership	`NumCar`	`num_cars`	SPC is capped at 2. We change all entries > 2 to 2	NA
Type of tenancy	`Ten1_B02ID`	`tenure`	??	??

Other columns to match in the future

Variable	Name (NTS)	Name (SPC)	Transformation (NTS)	Transformation (SPC)
Urban-Rural classification of residence	`Settlement2011EW_B04ID`	NA	NA	Spatial join between layer and SPC

I have tried a number of different combinations of columns in the statistical matching notebook.
It seems like household income cannot be used, as there are many NA values in the SPC for salary_yearly

Options for reducing number of unmatched household

I can experiment with using all combinations of columns and seeing which provides best results (brute force approach)
Match based on a smaller number of variables
Propensity Score Matching: each household in the SPC will be assigned to at least one household in the NTS.
- PSM can keep the n closest matches (for each household). It can also limit matches to those within a certain distance

Validation

In categorical matching, a household in the SPC can be matched to multiple households in the NTS. By extension, an individual in the SPC can be matched to multiple activity chains from the NTS. How do we decide which activity chain is the most accurate for an individual?

Sequence Alignment

One approach is to see how close the breakdown in the matched population is to that in the travel survey (i.e. are the proportion of people doing x trips the same? Are the proportion of people doing similar activity chains the same?)

SAMs are able to extract patterns of behaviour from large spatiotemporal datasets. These patterns can then be used to group data by similarity
Two types of analysis are common using SAMs (Shoval and Isaacson 2007)
- Construct groups based on their overall activity patterns. Clustering algorithms produce trees (heirarchical clustering)
- Detect patterns of behaviour in the sequences

Run Sequence Alignment on activity chains in NTS → get clusters of activity participation / time use
1. Each activity chain is assigned to a cluster
2. Get relative size of clusters: How big is each cluster (as % of total)? This assumes that we have a representative population
Individuals in SPC are assigned to a cluster based on the activity chain matched to them
1. With categorical matching, an SPC individual can be matched to multiple activity chains, so they are in a different cluster depending on the activity chain we choose for them.
For each SPC individual, we randomly assign an activity chain from the pool of activity chains matched to it
1. Get relative size of clusters. Do they match with the results in the NTS (1b)?
2. Should we do a brute force analysis where we run (all) different combinations and then find the one where the cluster composition most closely matches with the NTS? Is there a clever way to do this?

1.2 Propensity Score Matching (TODO)

PSM can match an individual in dataset A to the closest matching individual in dataset B. The match does not have to be exact.

Pros

PSM can be carried out to match households to ensure that each household in the SPC is matched to a household in the NTS
PSM can return the closest n matches, not just 1
It could allow us to use more variables for matching (e.g. household income), but it's unclear how to handle NA values

Cons

A household with 5 people in the SPC could be matched to a household with 3, 4, 7 etc people in the NTS. If that is the case, how do we then do matching at the individual level?
One option, NearestNeighbor with replacement

2. Matching - Individual level

After matching each household in the synthetic population (SPC) to a household in the travel survey (NTS), the next step is to match at the individual level. If our household level matching matches on number of individuals, then for each matched household we have n people in the SPC and n people in the NTS. I am currently matching on age_group and sex

We can use different approaches for matching

2.1 Nearest neighbour (propensity score matching)

2.2 Nearest neighbour without replacement

R approach: matchit library
is there a python library? I am using the functions in acbm/matching.py for now

2.3 Bipartate matching (TODO)

used to solved the assignment problem

2.4 Improvements

propensity score matching can match a child to an adult or vise versa. It can also match a 40 year old to a 70 year old
Treat adults and children differently. I don't care if a child is matched to M or F as their diary will probably be the same. This may not the case for adults
Pensioners: these should be a seperate category to adults as most do not commute
matching with replacement if household from the population df has less rows than household from survey df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding activity patterns to synthetic population

1. Matching - Household level

1.1 Categorical matching

Columns to match on

Validation

Sequence Alignment

1.2 Propensity Score Matching (TODO)

2. Matching - Individual level

2.1 Nearest neighbour (propensity score matching)

2.2 Nearest neighbour without replacement

2.3 Bipartate matching (TODO)

2.4 Improvements

3. References

Sequence Alignment

Propensity Score Matching

Clone this wiki locally