-
Notifications
You must be signed in to change notification settings - Fork 1
Adding activity patterns to synthetic population
This is exact matching using a join
What columns should be matched on? We want to do matching in a way that minimises the number of households in the SPC that are not matched. I am using the following columns
Variable | Name (NTS) | Name (SPC) | Transformation (NTS) | Transformation (SPC) |
---|---|---|---|---|
Household income | HHIncome2002_BO2ID |
salary_yearly |
NA | Group by household ID and sum |
Number of adults | HHoldNumAdults |
age_years |
NA | Group by household ID and count |
Number of children | HHoldNumChildren |
age_years |
NA | Group by household ID and count |
Employment status | HHoldEmploy_B01ID |
pwkstat |
NA | a) match to NTS categories. b) group by household ID |
Car ownership | NumCar |
num_cars |
SPC is capped at 2. We change all entries > 2 to 2 | NA |
Type of tenancy | Ten1_B02ID |
tenure |
?? | ?? |
Other columns to match in the future
Variable | Name (NTS) | Name (SPC) | Transformation (NTS) | Transformation (SPC) |
---|---|---|---|---|
Urban-Rural classification of residence | Settlement2011EW_B04ID |
NA | NA | Spatial join between layer and SPC |
- I have tried a number of different combinations of columns in the statistical matching notebook.
- It seems like household income cannot be used, as there are many NA values in the SPC for
salary_yearly
Options for reducing number of unmatched household
- I can experiment with using all combinations of columns and seeing which provides best results (brute force approach)
- Match based on a smaller number of variables
- Propensity Score Matching: each household in the SPC will be assigned to at least one household in the NTS.
- PSM can keep the n closest matches (for each household). It can also limit matches to those within a certain distance
In categorical matching, a household in the SPC can be matched to multiple households in the NTS. By extension, an individual in the SPC can be matched to multiple activity chains from the NTS. How do we decide which activity chain is the most accurate for an individual?
One approach is to see how close the breakdown in the matched population is to that in the travel survey (i.e. are the proportion of people doing x trips the same? Are the proportion of people doing similar activity chains the same?)
- SAMs are able to extract patterns of behaviour from large spatiotemporal datasets. These patterns can then be used to group data by similarity
- Two types of analysis are common using SAMs (Shoval and Isaacson 2007)
- Construct groups based on their overall activity patterns. Clustering algorithms produce trees (heirarchical clustering)
- Detect patterns of behaviour in the sequences
- Run Sequence Alignment on activity chains in NTS → get clusters of activity participation / time use
- Each activity chain is assigned to a cluster
- Get relative size of clusters: How big is each cluster (as % of total)? This assumes that we have a representative population
- Individuals in SPC are assigned to a cluster based on the activity chain matched to them
- With categorical matching, an SPC individual can be matched to multiple activity chains, so they are in a different cluster depending on the activity chain we choose for them.
- For each SPC individual, we randomly assign an activity chain from the pool of activity chains matched to it
- Get relative size of clusters. Do they match with the results in the NTS (1b)?
- Should we do a brute force analysis where we run (all) different combinations and then find the one where the cluster composition most closely matches with the NTS? Is there a clever way to do this?
Other notes:
- cluster on activity chains AND household properties
- You need evaluation metrics
- Output should match other datasets (e.g. number of GP visits should be similar to reported numbers - similar to constraints on commuting)
PSM can match an individual in dataset A to the closest matching individual in dataset B. The match does not have to be exact.
Pros
- PSM can be carried out to match households to ensure that each household in the SPC is matched to a household in the NTS
- PSM can return the closest n matches, not just 1
- It could allow us to use more variables for matching (e.g. household income), but it's unclear how to handle NA values
Cons
- A household with 5 people in the SPC could be matched to a household with 3, 4, 7 etc people in the NTS. If that is the case, how do we then do matching at the individual level?
- One option, NearestNeighbor with replacement
Multilevel regression with poststratification (MRP). Explanation of method here, here, here
After matching each household in the synthetic population (SPC) to a household in the travel survey (NTS), the next step is to match at the individual level. If our household level matching matches on number of individuals, then for each matched household we have n people in the SPC and n people in the NTS. I am currently matching on age_group
and sex
We can use different approaches for matching
- R approach: matchit library
- is there a python library? I am using the functions in acbm/matching.py for now
used to solved the assignment problem
- propensity score matching can match a child to an adult or vise versa. It can also match a 40 year old to a 70 year old
- Treat adults and children differently. I don't care if a child is matched to M or F as their diary will probably be the same. This may not the case for adults
- Pensioners: these should be a seperate category to adults as most do not commute
- matching with replacement if household from the population df has less rows than household from survey df
- Current matching does not ensure that commuting numbers match census data. Some employed people may not be matched to an NTS individual who is employed, and vise versa. How do we fix this?
- PSM: It would be good to match only if the row in dataset 2 is within a threshold distance of dataset 1. Distance should not be an issue for most case, but for corner cases that have no close match, it may be better to discard them than match them to a distant household. In traditional psm, we have a parameter called a caliper link1, link2 that restricts matching to those within a certain distance. The distance is based on the overall propensity score. Can I use this? How do I define a threshold distance? If we are restricting to a fraction of the std, what is the std in this case?
The approach in this paper uses psm to:
- Step 1: match each household from the population to households in the sample (multiple households from the sample can be matched to a population household.
- Step 2: match individuals from the population household to the closest individual from all matched households. They do not match all individuals from the population household to individuals from one sample household. This means that they do not maintain household integrity. Different individuals in the same population household could be matched to individuals from different sample households
Our approach maintains household integrity
This youtube video gives a nice overview of psm. Check from 13:42 for:
- greedy vs optimal algorithms
- caliper vs nearest neighbor
- One to one vs one to many
- The matchit r package is very comprehensive. It has different matching algorithms, and also allows you to specify different calipers for each covariate. This is very handy because we might want to be stricter on some covariates than others (e.g. for households, we may want the household size to match exactly, but be more forgiving on household income)
I didn't find a python library that has the same functionality as MatchIt.
- In psmpy, you can only provide one caliper based on the overall distance
- Dame-Flame - Useful blogpost
- Different psm algorithms: https://pubmed.ncbi.nlm.nih.gov/24123228/
- https://matheusfacure.github.io/python-causality-handbook/11-Propensity-Score.html
- https://dimewiki.worldbank.org/Propensity_Score_Matching#:~:text=Propensity%20score%20matching%20(PSM)%20is,the%20impact%20of%20an%20intervention.
- https://stats.stackexchange.com/questions/206832/matched-pairs-in-python-propensity-score-matching
- From Sam "Regarding matching and Propensity Score Matching that we discussed today, a resource I found helpful when looking at this previously was Chapter 20 of this book (code with examples). It's from a causal inference perspective (i.e. matching control and treatment groups to measure some treatment effect of a study) but the same approach can be applied to matching between datasets (NTS and SPC). In the chapter there is some discussion on imbalance and overlap, as well as diagnostics/plots (e.g. Figure 20.9) that can be used to evaluate.