Skip to content

Adding activity patterns to synthetic population

Hussein Mahfouz edited this page Mar 22, 2024 · 8 revisions

1. Matching - Household level

1.1 Categorical matching

This is exact matching using a join

Columns to match on

What columns should be matched on? We want to do matching in a way that minimises the number of households in the SPC that are not matched. I am using the following columns

Variable Name (NTS) Name (SPC) Transformation (NTS) Transformation (SPC)
Household income HHIncome2002_BO2ID salary_yearly NA Group by household ID and sum
Number of adults HHoldNumAdults age_years NA Group by household ID and count
Number of children HHoldNumChildren age_years NA Group by household ID and count
Employment status HHoldEmploy_B01ID pwkstat NA a) match to NTS categories. b) group by household ID
Car ownership NumCar num_cars SPC is capped at 2. We change all entries > 2 to 2 NA
Type of tenancy Ten1_B02ID tenure ?? ??

Other columns to match in the future

Variable Name (NTS) Name (SPC) Transformation (NTS) Transformation (SPC)
Urban-Rural classification of residence Settlement2011EW_B04ID NA NA Spatial join between layer and SPC
  • I have tried a number of different combinations of columns in the statistical matching notebook.
  • It seems like household income cannot be used, as there are many NA values in the SPC for salary_yearly

Options for reducing number of unmatched household

  • I can experiment with using all combinations of columns and seeing which provides best results (brute force approach)
  • Match based on a smaller number of variables
  • Propensity Score Matching: each household in the SPC will be assigned to at least one household in the NTS.
    • PSM can keep the n closest matches (for each household). It can also limit matches to those within a certain distance

Validation

In categorical matching, a household in the SPC can be matched to multiple households in the NTS. By extension, an individual in the SPC can be matched to multiple activity chains from the NTS. How do we decide which activity chain is the most accurate for an individual?

Sequence Alignment

One approach is to see how close the breakdown in the matched population is to that in the travel survey (i.e. are the proportion of people doing x trips the same? Are the proportion of people doing similar activity chains the same?)

  • SAMs are able to extract patterns of behaviour from large spatiotemporal datasets. These patterns can then be used to group data by similarity
  • Two types of analysis are common using SAMs (Shoval and Isaacson 2007)
    • Construct groups based on their overall activity patterns. Clustering algorithms produce trees (heirarchical clustering)
    • Detect patterns of behaviour in the sequences
  1. Run Sequence Alignment on activity chains in NTS → get clusters of activity participation / time use
    1. Each activity chain is assigned to a cluster
    2. Get relative size of clusters: How big is each cluster (as % of total)? This assumes that we have a representative population
  2. Individuals in SPC are assigned to a cluster based on the activity chain matched to them
    1. With categorical matching, an SPC individual can be matched to multiple activity chains, so they are in a different cluster depending on the activity chain we choose for them.
  3. For each SPC individual, we randomly assign an activity chain from the pool of activity chains matched to it
    1. Get relative size of clusters. Do they match with the results in the NTS (1b)?
    2. Should we do a brute force analysis where we run (all) different combinations and then find the one where the cluster composition most closely matches with the NTS? Is there a clever way to do this?

Other notes:

  • cluster on activity chains AND household properties
  • You need evaluation metrics
  • Output should match other datasets (e.g. number of GP visits should be similar to reported numbers - similar to constraints on commuting)

1.2 Propensity Score Matching (TODO)

PSM can match an individual in dataset A to the closest matching individual in dataset B. The match does not have to be exact.

Pros

  • PSM can be carried out to match households to ensure that each household in the SPC is matched to a household in the NTS
  • PSM can return the closest n matches, not just 1
  • It could allow us to use more variables for matching (e.g. household income), but it's unclear how to handle NA values

Cons

  • A household with 5 people in the SPC could be matched to a household with 3, 4, 7 etc people in the NTS. If that is the case, how do we then do matching at the individual level?
  • One option, NearestNeighbor with replacement

2. Matching - Individual level

After matching each household in the synthetic population (SPC) to a household in the travel survey (NTS), the next step is to match at the individual level. If our household level matching matches on number of individuals, then for each matched household we have n people in the SPC and n people in the NTS. I am currently matching on age_group and sex

We can use different approaches for matching 

2.1 Nearest neighbour (propensity score matching)

2.2 Nearest neighbour without replacement

2.3 Bipartate matching (TODO)

used to solved the assignment problem

2.4 Improvements

  • propensity score matching can match a child to an adult or vise versa. It can also match a 40 year old to a 70 year old
  • Treat adults and children differently. I don't care if a child is matched to M or F as their diary will probably be the same. This may not the case for adults
  • Pensioners: these should be a seperate category to adults as most do not commute
  • matching with replacement if household from the population df has less rows than household from survey df
  • Current matching does not ensure that commuting numbers match census data. Some employed people may not be matched to an NTS individual who is employed, and vise versa. How do we fix this?
  • PSM: It would be good to match only if the row in dataset 2 is within a threshold distance of dataset 1. Distance should not be an issue for most case, but for corner cases that have no close match, it may be better to discard them than match them to a distant household. In traditional psm, we have a parameter called a caliper link1, link2 that restricts matching to those within a certain distance. The distance is based on the overall propensity score. Can I use this? How do I define a threshold distance? If we are restricting to a fraction of the std, what is the std in this case?

3. Notes

The approach in this paper uses psm to:

  • Step 1: match each household from the population to households in the sample (multiple households from the sample can be matched to a population household.
  • Step 2: match individuals from the population household to the closest individual from all matched households. They do not match all individuals from the population household to individuals from one sample household. This means that they do not maintain household integrity. Different individuals in the same population household could be matched to individuals from different sample households

Our approach maintains household integrity

4. References

Sequence Alignment

Propensity Score Matching

Clone this wiki locally