feat: allow importing complete datasets #783

ctron · 2024-09-11T09:26:28Z

This adds an endpoint for ingesting an archive of documents. The idea is to have a a simpler version of ingesting "datasets", like the "dataset 1", that we had from Trustification.

This also adds a "dataset 3", which consists of "dataset 1", but adding the matching CVE files, so that all information in the UI is present.

The original idea was to ingest this using the existing "detect format" code path. However, it doesn't fix that pattern:

Because there's actually no document that will be stored. So the result isn't an IngestResult, and doesn't have IDs or digests
Because it actually returns multiple IngestResult, one for each file in the archive. That information should be reported back.

That's why this PR adds a new endpoint, for this specific use case. In the future, if we have more endpoints with a similar pattern, which make this more generic.

ctron · 2024-09-11T09:27:11Z

@carlosthe19916 It might make sense to provide a UI for that too. So that a user interested in testing the system can be instructed to just upload ds1.zip and play with it.

JimFuller-RedHat · 2024-09-12T14:40:57Z

server/src/lib.rs

@@ -302,7 +302,7 @@ impl InitData {
 fn configure(
 svc: &mut web::ServiceConfig,
 db: db::Database,
- storage: impl Into<DispatchBackend>,
+ storage: impl Into<DispatchBackend> + Clone,


oh interesting, I did not know one could do this

JimFuller-RedHat

I did not find any tests, though I suspect this feature is mostly adhoc and helps data movement for testing purposes ... LGTM

jcrossley3

I wish you had broken out your ingestion changes in a separate commit. Or even a separate PR. Those are pretty significant changes to bury in a "datasets" feature PR, I think.

jcrossley3 · 2024-09-12T15:29:25Z

modules/ingestor/src/service/mod.rs

- let stream = self
- .storage
- .retrieve(result.key())
- .await
- .map_err(Error::Storage)?
- .ok_or_else(|| Error::Storage(anyhow!("file went missing during upload")))?;
-


We were intentionally retrieving the doc after storing it to ensure that we could. If we don't find out now, we'll never know. Are we sure we want to eliminate this failsafe?

The reason for retrieving it was actually because the stream was consumed at that point. From my side, it was never intended as a check.

Since none of the internal APIs makes actual use of that stream, it didn't seem reasonable to keep that pattern. We anyway load the full document in memory. So why store megabytes of SBOMs, and then load them again? If the don't trust S3 when it says "ok, I stored it" I think we have a different problem.

ctron requested a review from bobmcwhirter September 11, 2024 09:26

chore: cleanup imports

c10a092

ctron force-pushed the feature/dataset_import_1 branch 4 times, most recently from 7f984aa to 2baecaf Compare September 11, 2024 10:36

feat: allow importing complete datasets

5bc3ced

ctron force-pushed the feature/dataset_import_1 branch 2 times, most recently from 13d2a0a to 5bc3ced Compare September 11, 2024 13:52

ctron requested review from dejanb and JimFuller-RedHat September 12, 2024 06:13

JimFuller-RedHat reviewed Sep 12, 2024

View reviewed changes

JimFuller-RedHat approved these changes Sep 12, 2024

View reviewed changes

ctron added this pull request to the merge queue Sep 12, 2024

Merged via the queue into trustification:main with commit 9722fdc Sep 12, 2024
4 checks passed

ctron deleted the feature/dataset_import_1 branch September 12, 2024 15:15

jcrossley3 reviewed Sep 12, 2024

View reviewed changes

ctron mentioned this pull request Sep 16, 2024

Allow loading a "dataset" #779

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow importing complete datasets #783

feat: allow importing complete datasets #783

ctron commented Sep 11, 2024

ctron commented Sep 11, 2024

JimFuller-RedHat Sep 12, 2024

JimFuller-RedHat left a comment

jcrossley3 left a comment

jcrossley3 Sep 12, 2024

ctron Sep 13, 2024

feat: allow importing complete datasets #783

feat: allow importing complete datasets #783

Conversation

ctron commented Sep 11, 2024

ctron commented Sep 11, 2024

JimFuller-RedHat Sep 12, 2024

Choose a reason for hiding this comment

JimFuller-RedHat left a comment

Choose a reason for hiding this comment

jcrossley3 left a comment

Choose a reason for hiding this comment

jcrossley3 Sep 12, 2024

Choose a reason for hiding this comment

ctron Sep 13, 2024

Choose a reason for hiding this comment