-
Notifications
You must be signed in to change notification settings - Fork 1
Adding activity patterns to synthetic population
This is exact matching using a join
What columns should be matched on? We want to do matching in a way that minimises the number of households in the SPC that are not matched. I am using the following columns
Variable | Name (NTS) | Name (SPC) | Transformation (NTS) | Transformation (SPC) |
---|---|---|---|---|
Household income | HHIncome2002_BO2ID |
salary_yearly |
NA | Group by household ID and sum |
Number of adults | HHoldNumAdults |
age_years |
NA | Group by household ID and count |
Number of children | HHoldNumChildren |
age_years |
NA | Group by household ID and count |
Employment status | HHoldEmploy_B01ID |
pwkstat |
NA | a) match to NTS categories. b) group by household ID |
Car ownership | NumCar |
num_cars |
SPC is capped at 2. We change all entries > 2 to 2 | NA |
Type of tenancy | Ten1_B02ID |
tenure |
?? | ?? |
Other columns to match in the future
Variable | Name (NTS) | Name (SPC) | Transformation (NTS) | Transformation (SPC) |
---|---|---|---|---|
Urban-Rural classification of residence | Settlement2011EW_B04ID |
NA | NA | Spatial join between layer and SPC |
- I have tried a number of different combinations of columns in the statistical matching notebook.
- It seems like household income cannot be used, as there are many NA values in the SPC for
salary_yearly
Options for reducing number of unmatched household
- I can experiment with using all combinations of columns and seeing which provides best results (brute force approach)
- Match based on a smaller number of variables
- Propensity Score Matching: each household in the SPC will be assigned to at least one household in the NTS.
- PSM can keep the n closest matches (for each household). It can also limit matches to those within a certain distance
In categorical matching, a household in the SPC can be matched to multiple households in the NTS. By extension, an individual in the SPC can be matched to multiple activity chains from the NTS. How do we decide which activity chain is the most accurate for an individual?
One approach is to see how close the breakdown in the matched population is to that in the travel survey (i.e. are the proportion of people doing x trips the same? Are the proportion of people doing similar activity chains the same?)
- SAMs are able to extract patterns of behaviour from large spatiotemporal datasets. These patterns can then be used to group data by similarity
- Two types of analysis are common using SAMs (Shoval and Isaacson 2007)
- Construct groups based on their overall activity patterns. Clustering algorithms produce trees (heirarchical clustering)
- Detect patterns of behaviour in the sequences
- Run Sequence Alignment on activity chains in NTS → get clusters of activity participation / time use
- Each activity chain is assigned to a cluster
- Get relative size of clusters: How big is each cluster (as % of total)? This assumes that we have a representative population
- Individuals in SPC are assigned to a cluster based on the activity chain matched to them
- With categorical matching, an SPC individual can be matched to multiple activity chains, so they are in a different cluster depending on the activity chain we choose for them.
- For each SPC individual, we randomly assign an activity chain from the pool of activity chains matched to it
- Get relative size of clusters. Do they match with the results in the NTS (1b)?
- Should we do a brute force analysis where we run (all) different combinations and then find the one where the cluster composition most closely matches with the NTS? Is there a clever way to do this?
PSM can match an individual in dataset A to the closest matching individual in dataset B. The match does not have to be exact.
Pros
- PSM can be carried out to match households to ensure that each household in the SPC is matched to a household in the NTS
- PSM can return the closest n matches, not just 1
- It could allow us to use more variables for matching (e.g. household income), but it's unclear how to handle NA values
Cons
- A household with 5 people in the SPC could be matched to a household with 3, 4, 7 etc people in the NTS. If that is the case, how do we then do matching at the individual level?
- One option, NearestNeighbor with replacement
After matching each household in the synthetic population (SPC) to a household in the travel survey (NTS), the next step is to match at the individual level. If our household level matching matches on number of individuals, then for each matched household we have n people in the SPC and n people in the NTS. I am currently matching on age_group
and sex
We can use different approaches for matching
- R approach: matchit library
- is there a python library? I am using the functions in acbm/matching.py for now
used to solved the assignment problem
- propensity score matching can match a child to an adult or vise versa. It can also match a 40 year old to a 70 year old
- Treat adults and children differently. I don't care if a child is matched to M or F as their diary will probably be the same. This may not the case for adults
- Pensioners: these should be a seperate category to adults as most do not commute
- matching with replacement if household from the population df has less rows than household from survey df
- Different psm algorithms: https://pubmed.ncbi.nlm.nih.gov/24123228/
- https://matheusfacure.github.io/python-causality-handbook/11-Propensity-Score.html
- https://dimewiki.worldbank.org/Propensity_Score_Matching#:~:text=Propensity%20score%20matching%20(PSM)%20is,the%20impact%20of%20an%20intervention.
- https://stats.stackexchange.com/questions/206832/matched-pairs-in-python-propensity-score-matching