Adding activity patterns to synthetic population

1. Matching - Household level

1.1 Categorical matching

This is exact matching using a join

Columns to match on

What columns should be matched on? We want to do matching in a way that minimises the number of households in the SPC that are not matched. I am using the following columns

Variable	Name (NTS)	Name (SPC)	Transformation (NTS)	Transformation (SPC)
Household income	`HHIncome2002_BO2ID`	`salary_yearly`	NA	Group by household ID and sum
Number of adults	`HHoldNumAdults`	`age_years`	NA	Group by household ID and count
Number of children	`HHoldNumChildren`	`age_years`	NA	Group by household ID and count
Employment status	`HHoldEmploy_B01ID`	`pwkstat`	NA	a) match to NTS categories. b) group by household ID
Car ownership	`NumCar`	`num_cars`	SPC is capped at 2. We change all entries > 2 to 2	NA
Type of tenancy	`Ten1_B02ID`	`tenure`	??	??

Other columns to match in the future

Variable	Name (NTS)	Name (SPC)	Transformation (NTS)	Transformation (SPC)
Urban-Rural classification of residence	`Settlement2011EW_B04ID`	NA	NA	Spatial join between layer and SPC

I have tried a number of different combinations of columns in the statistical matching notebook.
It seems like household income cannot be used, as there are many NA values in the SPC for salary_yearly

Options for reducing number of unmatched household

I can experiment with using all combinations of columns and seeing which provides best results (brute force approach)
Match based on a smaller number of variables
Propensity Score Matching: each household in the SPC will be assigned to at least one household in the NTS.
- PSM can keep the n closest matches (for each household). It can also limit matches to those within a certain distance

Validation

In categorical matching, a household in the SPC can be matched to multiple households in the NTS. By extension, an individual in the SPC can be matched to multiple activity chains from the NTS. How do we decide which activity chain is the most accurate for an individual?

Sequence Alignment

One approach is to see how close the breakdown in the matched population is to that in the travel survey (i.e. are the proportion of people doing x trips the same? Are the proportion of people doing similar activity chains the same?)

SAMs are able to extract patterns of behaviour from large spatiotemporal datasets. These patterns can then be used to group data by similarity
Two types of analysis are common using SAMs (Shoval and Isaacson 2007)
- Construct groups based on their overall activity patterns. Clustering algorithms produce trees (heirarchical clustering)
- Detect patterns of behaviour in the sequences

Run Sequence Alignment on activity chains in NTS → get clusters of activity participation / time use
1. Each activity chain is assigned to a cluster
2. Get relative size of clusters: How big is each cluster (as % of total)? This assumes that we have a representative population
Individuals in SPC are assigned to a cluster based on the activity chain matched to them
1. With categorical matching, an SPC individual can be matched to multiple activity chains, so they are in a different cluster depending on the activity chain we choose for them.
For each SPC individual, we randomly assign an activity chain from the pool of activity chains matched to it
1. Get relative size of clusters. Do they match with the results in the NTS (1b)?
2. Should we do a brute force analysis where we run (all) different combinations and then find the one where the cluster composition most closely matches with the NTS? Is there a clever way to do this?

Other notes:

cluster on activity chains AND household properties
You need evaluation metrics
Output should match other datasets (e.g. number of GP visits should be similar to reported numbers - similar to constraints on commuting)

1.2 Propensity Score Matching (TODO)

PSM can match an individual in dataset A to the closest matching individual in dataset B. The match does not have to be exact.

Pros

PSM can be carried out to match households to ensure that each household in the SPC is matched to a household in the NTS
PSM can return the closest n matches, not just 1
It could allow us to use more variables for matching (e.g. household income), but it's unclear how to handle NA values

Cons

A household with 5 people in the SPC could be matched to a household with 3, 4, 7 etc people in the NTS. If that is the case, how do we then do matching at the individual level?
One option, NearestNeighbor with replacement

1.3 Other

Multilevel regression with poststratification (MRP). Explanation of method here, here, here

2. Matching - Individual level

After matching each household in the synthetic population (SPC) to a household in the travel survey (NTS), the next step is to match at the individual level. If our household level matching matches on number of individuals, then for each matched household we have n people in the SPC and n people in the NTS. I am currently matching on age_group and sex

We can use different approaches for matching

2.1 Nearest neighbour (propensity score matching)

2.2 Nearest neighbour without replacement

R approach: matchit library
is there a python library? I am using the functions in acbm/matching.py for now

2.3 Bipartate matching (TODO)

used to solved the assignment problem

2.4 Improvements

propensity score matching can match a child to an adult or vise versa. It can also match a 40 year old to a 70 year old
Treat adults and children differently. I don't care if a child is matched to M or F as their diary will probably be the same. This may not the case for adults
Pensioners: these should be a seperate category to adults as most do not commute
matching with replacement if household from the population df has less rows than household from survey df
Current matching does not ensure that commuting numbers match census data. Some employed people may not be matched to an NTS individual who is employed, and vise versa. How do we fix this?
PSM: It would be good to match only if the row in dataset 2 is within a threshold distance of dataset 1. Distance should not be an issue for most case, but for corner cases that have no close match, it may be better to discard them than match them to a distant household. In traditional psm, we have a parameter called a caliper link1, link2 that restricts matching to those within a certain distance. The distance is based on the overall propensity score. Can I use this? How do I define a threshold distance? If we are restricting to a fraction of the std, what is the std in this case?

3. Notes

The approach in this paper uses psm to:

Step 1: match each household from the population to households in the sample (multiple households from the sample can be matched to a population household.
Step 2: match individuals from the population household to the closest individual from all matched households. They do not match all individuals from the population household to individuals from one sample household. This means that they do not maintain household integrity. Different individuals in the same population household could be matched to individuals from different sample households

Our approach maintains household integrity

4. References

Sequence Alignment

Propensity Score Matching

This youtube video gives a nice overview of psm. Check from 13:42 for:

greedy vs optimal algorithms
caliper vs nearest neighbor
One to one vs one to many

Packages

R

The matchit r package is very comprehensive. It has different matching algorithms, and also allows you to specify different calipers for each covariate. This is very handy because we might want to be stricter on some covariates than others (e.g. for households, we may want the household size to match exactly, but be more forgiving on household income)

Python

I didn't find a python library that has the same functionality as MatchIt.

In psmpy, you can only provide one caliper based on the overall distance
Dame-Flame - Useful blogpost

Other resources

Different psm algorithms: https://pubmed.ncbi.nlm.nih.gov/24123228/
https://matheusfacure.github.io/python-causality-handbook/11-Propensity-Score.html
https://dimewiki.worldbank.org/Propensity_Score_Matching#:~:text=Propensity%20score%20matching%20(PSM)%20is,the%20impact%20of%20an%20intervention.
https://stats.stackexchange.com/questions/206832/matched-pairs-in-python-propensity-score-matching
From Sam "Regarding matching and Propensity Score Matching that we discussed today, a resource I found helpful when looking at this previously was Chapter 20 of this book (code with examples). It's from a causal inference perspective (i.e. matching control and treatment groups to measure some treatment effect of a study) but the same approach can be applied to matching between datasets (NTS and SPC). In the chapter there is some discussion on imbalance and overlap, as well as diagnostics/plots (e.g. Figure 20.9) that can be used to evaluate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding activity patterns to synthetic population

1. Matching - Household level

1.1 Categorical matching

Columns to match on

Validation

Sequence Alignment

1.2 Propensity Score Matching (TODO)

1.3 Other

2. Matching - Individual level

2.1 Nearest neighbour (propensity score matching)

2.2 Nearest neighbour without replacement

2.3 Bipartate matching (TODO)

2.4 Improvements

3. Notes

4. References

Sequence Alignment

Propensity Score Matching

Packages

R

Python

Other resources

Clone this wiki locally