Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow importing complete datasets #783

Merged
merged 2 commits into from
Sep 12, 2024

Conversation

ctron
Copy link
Contributor

@ctron ctron commented Sep 11, 2024

This adds an endpoint for ingesting an archive of documents. The idea is to have a a simpler version of ingesting "datasets", like the "dataset 1", that we had from Trustification.

This also adds a "dataset 3", which consists of "dataset 1", but adding the matching CVE files, so that all information in the UI is present.

The original idea was to ingest this using the existing "detect format" code path. However, it doesn't fix that pattern:

  • Because there's actually no document that will be stored. So the result isn't an IngestResult, and doesn't have IDs or digests
  • Because it actually returns multiple IngestResult, one for each file in the archive. That information should be reported back.

That's why this PR adds a new endpoint, for this specific use case. In the future, if we have more endpoints with a similar pattern, which make this more generic.

@ctron
Copy link
Contributor Author

ctron commented Sep 11, 2024

@carlosthe19916 It might make sense to provide a UI for that too. So that a user interested in testing the system can be instructed to just upload ds1.zip and play with it.

@ctron ctron force-pushed the feature/dataset_import_1 branch 4 times, most recently from 7f984aa to 2baecaf Compare September 11, 2024 10:36
@ctron ctron force-pushed the feature/dataset_import_1 branch 2 times, most recently from 13d2a0a to 5bc3ced Compare September 11, 2024 13:52
@@ -302,7 +302,7 @@ impl InitData {
fn configure(
svc: &mut web::ServiceConfig,
db: db::Database,
storage: impl Into<DispatchBackend>,
storage: impl Into<DispatchBackend> + Clone,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh interesting, I did not know one could do this

Copy link
Collaborator

@JimFuller-RedHat JimFuller-RedHat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find any tests, though I suspect this feature is mostly adhoc and helps data movement for testing purposes ... LGTM

@ctron ctron added this pull request to the merge queue Sep 12, 2024
Merged via the queue into trustification:main with commit 9722fdc Sep 12, 2024
4 checks passed
@ctron ctron deleted the feature/dataset_import_1 branch September 12, 2024 15:15
Copy link
Contributor

@jcrossley3 jcrossley3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish you had broken out your ingestion changes in a separate commit. Or even a separate PR. Those are pretty significant changes to bury in a "datasets" feature PR, I think.

Comment on lines -169 to -175
let stream = self
.storage
.retrieve(result.key())
.await
.map_err(Error::Storage)?
.ok_or_else(|| Error::Storage(anyhow!("file went missing during upload")))?;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were intentionally retrieving the doc after storing it to ensure that we could. If we don't find out now, we'll never know. Are we sure we want to eliminate this failsafe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for retrieving it was actually because the stream was consumed at that point. From my side, it was never intended as a check.

Since none of the internal APIs makes actual use of that stream, it didn't seem reasonable to keep that pattern. We anyway load the full document in memory. So why store megabytes of SBOMs, and then load them again? If the don't trust S3 when it says "ok, I stored it" I think we have a different problem.

@ctron ctron mentioned this pull request Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants