Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move database.bin generation to an xtask #62

Closed
bheylin opened this issue Nov 6, 2023 · 11 comments
Closed

Move database.bin generation to an xtask #62

bheylin opened this issue Nov 6, 2023 · 11 comments

Comments

@bheylin
Copy link

bheylin commented Nov 6, 2023

The database.bin is currently generated in the build.rs.
This is unnecessary and has a few downsides.

  • The database.bin only needs to be generated when the assets/PhoneNumberMetadata.xml is updated.
  • build.rs files can be seen as suspicious. A build.rs file executing on my local machine may be outside of the risks I'm willing to take
  • The dependencies used in build.rs are required to build the project.

I propose that the generation of database.bin is moved into an xtask.

cargo-xtask is way to add free-form automation to a Rust project, a-la make, npm run or bespoke bash scripts.

The two distinguishing features of xtask are:

It doesn't require any other binaries besides cargo and rustc, it fully bootstraps from them
Unlike bash, it can more easily be cross platform, as it doesn't use the shell.

This means restructuring the project into a workspace where one of the members is the xtask package.
But the benefits are:

  • The xtask is independent from the published package
    • When it's run is independent, it only needs to be run manually when assets/PhoneNumberMetadata.xml is updated.
    • it's dependencies are independent, the published package has fewer dependencies
    • There's no need for a build.rs anymore
@fabricedesre
Copy link
Contributor

  • build.rs files can be seen as suspicious. A build.rs file executing on my local machine may be outside of the risks I'm willing to take

Maybe playing devil's advocate (I'm not familiar with xtask) but how is that different from running the task from a risk perspective?

@bheylin
Copy link
Author

bheylin commented Nov 6, 2023

  • build.rs files can be seen as suspicious. A build.rs file executing on my local machine may be outside of the risks I'm willing to take

Maybe playing devil's advocate (I'm not familiar with xtask) but how is that different from running the task from a risk perspective?

Fair question. From a risk perspective...

  • The build.rs is run automatically at least once on every machine that downloads the phonenumber package.
  • The xtask is run manually by one of the project maintainers when the assets/PhoneNumberMetadata.xml is updated. The bincode generated by the xtask would then be checked in to source control. As an added bonus the commit record shows when the assets/PhoneNumberMetadata.xml was updated and when the bincode was updated.

@rubdos
Copy link
Member

rubdos commented Nov 6, 2023

I think I disagree with your security model. If you mistrust a rather simple and straightforward build.rs file supplied by the rust-phonenumber developers, then I would also mistrust a checked in binary file that is manually updated (although verifiable), and I would also mistrust anything that remotely touched proc-macros.

I really don't think the benefit in terms of your security model and build time outweighs the mental overhead required for us to maintain an xtask package and to run the xtask whenever the xml changes. If I need to weigh "manually running and maintaining a separate package for the sake of loading a file" against "38 lines of build.rs", I think I know which one wins.

That said, I agree with the sentiment that a bincode-formatted file that is used for nearly all users is not optimal, which you report in #63. I'll comment on that tomorrow!

@bheylin
Copy link
Author

bheylin commented Nov 6, 2023

Hi @rubdos

I get that the build.rs reduces the maintenance work, and I take what you say about this seriously.
I appreciate the work you've put into developing this library and my aim is not to trivialize your time
or make your life harder 😸

My issue with the build.rs is more that it's work that's needlessly happening on my machine,
and to do that work, it pulls in dependencies that are not critical to the domain problem, parsing a phone-number.

We can talk back and forth about security, but that's more a symptom of the issue.
My aim here is to only reference the critical dependencies in the published lib.

This ties in with issue #64 too. As depending on the outcome of that conversation, maybe we can shift the dependencies needed for creating the database into another package. We certainly don't require quick_xml to parse a phonenumber once the database exists. 😸

In short, I would like to split the existing lib code into two parts;

  • database/bincode generation (maybe it's Rust, maybe it isn't)
  • phone-number parsing

Let's withhold talking about the implementation details of the data processing task
and instead imagine it as a task much like dependabot.
That's automatically triggered when the source xml is out of date.
It makes a PR and generates a bincode file for you to review and merge.

In that case I think your maintenance burden is not drastically increased.

  • The dependencies for data generation can be defined in a package that runs on Github's servers (and also locally to create custom databases?).
  • And the published rust-phonenumber package can be stripped down to it's simplest form.

What do you think?

@fabricedesre
Copy link
Contributor

My issue with the build.rs is more that it's work that's needlessly happening on my machine,
and to do that work, it pulls in dependencies that are not critical to the domain problem, parsing a phone-number.

Can't that be fixed by outputting cargo:rerun-if-changed=assets/PhoneNumberMetadata.xml and splitting out the meta-data loader to be only part of the build stage?

@bheylin
Copy link
Author

bheylin commented Nov 6, 2023

My issue with the build.rs is more that it's work that's needlessly happening on my machine,
and to do that work, it pulls in dependencies that are not critical to the domain problem, parsing a phone-number.

Can't that be fixed by outputting cargo:rerun-if-changed=assets/PhoneNumberMetadata.xml and splitting out the meta-data loader to be only part of the build stage?

From what I understand about build.rs, it will always be run once wherever it's being built. Using cargo:rerun-if-changed will trigger if to be run again if the given file changes.

I just noticed the assets/upgrade.sh script. I presume one of the maintainers is running this every few months.

Is this not the perfect place to trigger the generation of the database?

@rubdos
Copy link
Member

rubdos commented Nov 7, 2023

My issue with the build.rs is more that it's work that's needlessly happening on my machine, and to do that work, it pulls in dependencies that are not critical to the domain problem, parsing a phone-number.

I still find your argumentation ill-posed. Compilation of the library itself is also work that's needlessly happening on your machine. Very many libraries have build.rs scripts.

I still feel like we have an x-y problem here. Do you take issue with the build time that the library poses? Because that, in my eyes, would be a rather good argument to look closely at how we can further optimize the build process. If you indeed take issue with the build time, then I would also like to see numbers and a benchmark, such that we can track the build script performance over time.

As an aside: after #63, I imagine we get rid of the bincode outputted file altogether. Maybe the database can even be put behind a feature flag.

We certainly don't require quick_xml to parse a phonenumber once the database exists. 😸

I did not know we still depended on quick_xml at runtime, that's indeed something to look at. That too could become a feature flag.

Can't that be fixed by outputting cargo:rerun-if-changed=assets/PhoneNumberMetadata.xml and splitting out the meta-data loader to be only part of the build stage?

That fixes reruns, but those shouldn't really happen once the crate has been build as a dependency for your own project anyway. It would probably be a good thing to implement either way, because it speeds up the development of rust-phonenumber by reducing the number of build.rs runs.

Is this not the perfect place to trigger the generation of the database?

It would be the correct place to trigger it. But still:

  • Checking in binary files needlessly into version control feels wrong.
  • We create a security issue because checking in binary files into version control obfuscates the source tree for easy verification. This could be partially remedied by having CI check the consistency between the XML and the bincode file.

My suggested approach to clean things up:

  • Evaluate whether loading a bincode file is the most optimal way to load the database (it probably is not), remove the bincode dependency altogether, especially at runtime.
  • We find a nice way to fix Refactoring Database: const and faster loading, dropping bincode-loading #63: refactor the Database structure, make sure it can be loadable in const time and at compile time, without the need for quick_xml.
  • Evaluate whether Database loading (and thus the runtime quick_xml dependency) is used by the community (it probably is, let's not make any assumptions), and consider putting this behind a feature flag
  • Add cargo:rerun-if-changed to the build.rs (Add rerun-if-changed clause to build.rs on Metadata #64)

@bheylin
Copy link
Author

bheylin commented Nov 7, 2023

My issue with the build.rs is more that it's work that's needlessly happening on my machine, and to do that work, it pulls in dependencies that are not critical to the domain problem, parsing a phone-number.

I still find your argumentation ill-posed. Compilation of the library itself is also work that's needlessly happening on your machine. Very many libraries have build.rs scripts.

I still feel like we have an x-y problem here. Do you take issue with the build time that the library poses? Because that, in my eyes, would be a rather good argument to look closely at how we can further optimize the build process. If you indeed take issue with the build time, then I would also like to see numbers and a benchmark, such that we can track the build script performance over time.

My problem is the amount of dependencies being brought into a project I maintain. Not build times per se, but of course more dependencies mean longer build times.

We review our dependencies periodically, and the rust-phonenumber came up because it actually caused an issue in a script I use to upgrade deps. This is the only dependency we have the uses metadata in it's semver. But that's another story 😆
And my script was at fault there.

As an aside: after #63, I imagine we get rid of the bincode outputted file altogether. Maybe the database can even be put behind a feature flag.

Yea I find the Database as part of the public API to be a strange use case. Neither the C++ or Java impl have this feature from what I can see.

I think it's a good idea to at least feature gate it, and consider deprecating it and seeing if you get feedback on usage in the wild.

We certainly don't require quick_xml to parse a phonenumber once the database exists. 😸

I did not know we still depended on quick_xml at runtime, that's indeed something to look at. That too could become a feature flag.

👍 much appreciated!

Is this not the perfect place to trigger the generation of the database?

It would be the correct place to trigger it. But still:

* Checking in binary files needlessly into version control feels wrong.

* We create a security issue because checking in binary files into version control obfuscates the source tree for easy verification. This could be partially remedied by having CI check the consistency between the XML and the bincode file.

Yes I agree that checking in a binary is opaque. The C++ impl uses protobufs from what I see, which has the sames downsides.
I think the most transparent solution is generating Rust code. You could even move the runtime processing of the Vec<Metadata> into a Database into build time. This would presumably solve #26.
As the default database is simply embedded into the binary in it's final form.

My suggested approach to clean things up:

* [ ]  Evaluate whether loading a `bincode` file is the most optimal way to load the database (it probably is not), remove the `bincode` dependency altogether, especially at runtime.

👍

* [ ]  We find a nice way to fix [Make database static #63](https://github.com/whisperfish/rust-phonenumber/issues/63): refactor the `Database` structure, make sure it can be loadable in `const` time and at compile time, without the need for `quick_xml`.

👍 If you remove the Database from the API you can simplify the Database even further, by not requiring lazy_static or std::sync::Once. And I think the Database may not even need Arcs or Mutexes if restructured a little.
But that's a different issue as such.

* [ ]  Evaluate whether `Database` loading (and thus the runtime `quick_xml` dependency) is used by the community (it probably is, let's not make any assumptions), and consider putting this behind a feature flag

👍 Like I said above, I would go so far as to deprecate it and gather feedback on resistance to that.
If the database loads quickly then I don't see a use case for providing a custom database.

* [ ]  Add `cargo:rerun-if-changed` to the `build.rs`

👍

@rubdos
Copy link
Member

rubdos commented Nov 7, 2023

Cool, I think we reached an agreement about basically all points then. I'm in favour of getting the dependency tree smaller, but it should really not go at the cost of a maintenance burden.

I suggest I close this specific issue, and I create issues for the tasks we discussed. Would you agree with that?

@bheylin
Copy link
Author

bheylin commented Nov 7, 2023

Cool, I think we reached an agreement about basically all points then. I'm in favour of getting the dependency tree smaller, but it should really not go at the cost of a maintenance burden.

I suggest I close this specific issue, and I create issues for the tasks we discussed. Would you agree with that?

Yep let's close this issue, the issues you outlined are more focused.

Thanks for your openness 😉

@rubdos
Copy link
Member

rubdos commented Nov 8, 2023

Closing in favour of #64 and #63

@rubdos rubdos closed this as completed Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants