Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Id<Stop> instead of Arc<Stop>: Id as string #126

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
Draft

Conversation

antoine-de
Copy link
Collaborator

@antoine-de antoine-de commented Mar 23, 2022

Same as #123 but without generational_indexes.

I think the indexes are really great for performance since:

  • all structures are then Vector (great random access and cache friendly iterations)
  • the ids are short (integer vs string)

I don't think we need those performance in this crate though. Having only typed index (a think wrapper over the real gtfs identifier) would be more convenient I think.
It would:

  • ease debug (easy to print the gtfs identifier, instead of having a meaningless integer)
  • ease serialisation (same, we can serialize the string right away)
  • be a more more close to the gtfs representation

both approached would need benchmark/real use case to see if they are worth it.

Note: I quite like this approach.

If we want to go there, there is still work to:

  • change all gtfs access to support getter by ID instead of by &str.
  • rework the readme
  • better doc

Note: One canvas of the approach is that this would enforce ALL ids to be valid. I think this could be quite a good thing, but this will break the reading of many datasets out there.

Same as #123 but
without generational_indexes.

I think the indexes are really great for performance since:
* all structures are then `Vector` (great random access and cache friendly
iterations)
* the ids are short (integer vs string)

I don't think we need those performance in this crate though. Having
only typed index (a think wrapper over the real gtfs identifier) would
be more convenient I think.
It would:
* ease debug (easy to print the gtfs identifier, instead of having a
  meaningless integer)
* ease serialisation (same, we can serialize the string right away)
* be a more more close to the gtfs representation

This is still *very* early WIP, I'm not convinced at all by the
ergonomics, I'd like to keep the property of the Index in
#123 that the `Id`
have at least existed at one point (even if I don't plan to ensure this
if an object is deleted).
@antoine-de antoine-de marked this pull request as draft March 23, 2022 16:55
@antoine-de
Copy link
Collaborator Author

I found that I could decrease drastically the number of String creation by implementing Borrow<str>, now I find the ergonomics not that bad

@antoine-de
Copy link
Collaborator Author

I think the API is quite nice now. I updated an example to show how we can use the stop_id now:

    // you can access a stop by a &str
    let _ = gtfs.get_stop_by_raw_id("stop1")?;

    let trip = gtfs.trips.get("trip1")?;
    let stop_id: &gtfs_structures::Id<gtfs_structures::Stop> =
        &trip.stop_times.first()?.stop;

    // or with a typed id if you have one

    // if no stops have been removed from the gtfs, you can safely access the stops by it's id
    let s = &gtfs.stops[stop_id];
    println!("stop name: {}", &s.name);

    // if some removal have been done, you can also you those method to get an Option<Stop>
    let s = gtfs.get_stop(stop_id)?;
    println!("stop description: {}", &s.description);

    // or you can access it via `stops.get`
    let s = gtfs.stops.get(stop_id)?;
    println!("stop location type: {:?}", &s.location_type);

    let mut s = gtfs.stops.get_mut(stop_id)?;
    s.code = Some("code".into());

@antoine-de
Copy link
Collaborator Author

Note: One canvas of the approach is that this would enforce ALL ids to be valid. I think this could be quite a good thing, but this will break the reading of many datasets out there.

RawGTFS reading would still be working, but not GTFS anymore.

Do anybody have some thoughts about this?

@fchabouis
Copy link
Contributor

Does it mean that if an ID is invalid, the gtfs validator would output a fatal error ?
I would not like it, because all other errors and warnings would be hidden by this fatal error. I prefer an error on Ids to be a simple error.

@antoine-de
Copy link
Collaborator Author

hum, there it's a library to read a GTFS. the change of id would however lead to GTFS not loading (since the ID<Stop> cannot be invalid). It would still be possible to read RawGTFS though, since they would still have raw String Ids.

For transport.data.gouv.fr's validator usecase, you'd keep all the warnings/error on the rawGTFS, but loose the error needing the GTFS (and that's a lot 😱 https://github.com/etalab/transport-validator/blob/9a04e6/src/validate.rs#L65-L78)

I agree it's a pity since it would make a lot of dataset unreadable, but at the same time foreign keys are quite important, and I don't feel good about having Option<Id<T>> (or bake an invalid state into the ID) everywhere to express the fact that they can be invalid.

Any ideas / thoughts on this?

@fchabouis
Copy link
Contributor

In our case, the business requirements are to make the validator "tolerant" (see this comment ) so that we get as many validation information as possible, and not just a single fatal error...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants