-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of mexico-city survey data for scenario generation #26
base: master
Are you sure you want to change the base?
Conversation
@rakow What do you think about merging this branch into master? To be able to handle the mexican dataset I had to perform some changes on the general scripts (preparation.py, init.py ...). So it would cost us / me some more work to make the changes on the general scripts modifiable or better said to make the general data handling script more flexible -> able to handle a wider spectrum of specific datasets, which are not assuming the application of german law (like MID and SrV). |
Thank you, I really like the idea to make the scripts more generally applicable. I will take a look at what you did in the next weeks. |
Whenever you find the time feel free to contact me about this as I already have some ideas on what segments to generalize. |
@@ -17,11 +17,13 @@ def prepare_persons(hh, pp, tt, augment=5, max_hh_size=5, core_weekday=False, re | |||
|
|||
# Augment data using p_weight | |||
if augment > 1: | |||
df = augment_persons(df, augment) | |||
# in the cdmx case we do not need to do p_weight * augment = 5 (see method augment_persons) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prepare_persons should probably be split into multiple function so you can only use these parts that you want in your scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you do this by defining sub-methods / -functions inside of prepare_persons? I can try to do that if that's the way you want to go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to do it, as it requires changing the API and design a little bit.
matsim/scenariogen/data/__init__.py
Outdated
@@ -309,6 +313,7 @@ class Person: | |||
present_on_day: bool | |||
reporting_day: int | |||
n_trips: int | |||
home_district: str = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should belong to the household ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Household already has location and geometry. Is an additional attribute needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, BUT for the simple routing in the next activity sampling step (because survey data does not provide leg length) this information is needed. It is added to the persons, because I do not want to have to read the whole households.csv in the next step just for one parameter (as the persons / activities datasets already are huge files).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the problem, but I generally don't like duplicating information. CSV reading should be superfast, is it really a concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are talking about 4GB combined only for persons.csv and activities.csv already.. Therefore I cannot run it on my hardware and have to run it on the math cluster, which is annoying for debugging and testing. You have to take into account that we are talking about an area with about 20 million inhabitants, which is way above what we are usually handling (Berlin Brandenburg e.g.)
With this PR a new dataformat for surveys "eodmx" is added. The code, which uses the data formats is adapted, such that it can handle the new data format. The survey EOD2017 (Encuesta Origen Destino) is undertaken for the metropolitan area of Mexico City (ZMVM) by INEGI (Instituto Nacional de Estadística y Geografía), the mexican secretary for statistics and geography.