Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discuss] Support (fairly large) sample data set package #346

Open
8 tasks
majagrubic opened this issue May 25, 2022 · 32 comments
Open
8 tasks

[discuss] Support (fairly large) sample data set package #346

majagrubic opened this issue May 25, 2022 · 32 comments
Labels
discuss Issue needs discussion Team:Ecosystem Label for the Packages Ecosystem team

Comments

@majagrubic
Copy link

majagrubic commented May 25, 2022

Context:

The shared-ux team is in the process of re-architecturing the way sample data works inside Kibana. We want to support larger datasets (1GB+ in size), and bundling them with Kibana distributable is not scalable. Our immediate goal is to support large Observability data set. We have considered a few options so far:

  • host data externally, either make it a full-blown remote service (similar to what Maps does today) or simply load from a remote endpoint (like an S3 bucket)
  • dynamically generate as much of the data as we can on install
  • load the sample data from a side-loaded plugin, (e.g. not in the distro, have it available only on Cloud)
  • have this as part of EPR

Is EPR a viable solution here?

Chatting with @joshdover, it seems that EPR already solved some of the problems we'd encounter with hosting data externally (scalability / latency / monitor). Also, seems like downloading a zip file with all the assets is exactly what we need here. However, it was pointed out that adding a new package of 1GB+ would cause serious performance concerns with the current Docker image. One thing to keep in mind is we'd likely like to add more datasets in the future (perhaps not every one will be as large though).

Naming scheme

One dataset we'd like to support immediately is Observability data set, which is a combination of filebeat + metricbeat indices and data views. Its naming scheme does not comply with the recommended elastic data stream naming scheme. Also, I am not sure if every sample data set we'd like to add in the future would need to follow this naming scheme?

Opening this issue so we can discuss if leveraging EPR here would be an option in the first place and how much effort it would require to workaround those problems.

Implementation plan

@majagrubic majagrubic added the discuss Issue needs discussion label May 25, 2022
@jlind23 jlind23 added the Team:Ecosystem Label for the Packages Ecosystem team label May 25, 2022
@joshdover
Copy link
Contributor

However, it was pointed out that adding a new package of 1GB+ would cause serious performance concerns with the current Docker image. One thing to keep in mind is we'd likely like to add more datasets in the future (perhaps not every one will be as large though).

Since we're not targeting air-gapped customers with this sample data UX, I think we should consider creating a way to exclude these packages from the Docker image we ship for on-prem EPR. This would allow us to ship these packages in the near term in the production registry and leverage v2 storage as a long-term solution without creating a poor experience for on-prem EPR.

Its naming scheme does not comply with the recommended elastic data stream naming scheme. Also, I am not sure if every sample data set we'd like to add in the future would need to follow this naming scheme?

I do think we need to update this data set regardless to use the new naming scheme, though I understand why that may not be an option in the short-term. In general, we're building a lot of tooling and application features around this new naming scheme and I think it would best to not allow exceptions in the long-term. I can't think of any reason why we couldn't extend this naming scheme to accommodate future data sets as well.

If we want to move forward with the current data set to install index templates and ingest data for metricbeat and filebeat indices, I think we should only do this as a one-time temporary exception that will not be supported once we've updated this dataset for the data stream naming scheme.


More generally, I think we need to create a new package type for this use case that is separate from integration and input packages. This would allow us to offer a different UX in the Integrations UI in Kibana, such as:

  • Installing and uninstalling data
  • Hiding all of the integration policy UX (policy editor, policies tab, etc.)
  • Excluding this package from UX that doesn't apply
  • Dependencies on other packages (eg. nginx package)
  • Providing standard naming conventions around sample data sets (eg. logs-nginx.access-sample.observability)
  • Provide standard UI elements for including or excluding sample data based on this naming scheme

@ruflin
Copy link
Contributor

ruflin commented May 31, 2022

Its naming scheme does not comply with the recommended elastic data stream naming scheme.

I consider it a must to convert any observability data in packages to the data stream naming scheme. Can you elaborate on why we would not convert it now?

For the large data, I think there is also an option to have the package but that the data itself is pulled down separately. The data is referenced in the package and still stored in EPR somehow. In the early days of the packages we considered adding short movies to each package to explain the integration and for this scenario these would also not have been directly part of the package but reference. I'm sure we can come up with good ideas here.

@ruflin
Copy link
Contributor

ruflin commented May 31, 2022

As reference, data packages were discussed in the past here: #37

@LucaWintergerst
Copy link

we did not want to convert an old dataset into the new schema as we worried that this would lead to issues and be more work, but we already agreed that we will just re-record the dataset on the latest version, so we will have datastreams for all sources as well as an up to date mapping for everything

@ruflin
Copy link
Contributor

ruflin commented Jun 1, 2022

There is a difference between having data streams and the data stream naming scheme. As this is new data and we want to teach our users about the new and more efficient way of managing data, it should not just be data streams but data stream naming scheme.

@majagrubic
Copy link
Author

majagrubic commented Jun 2, 2022

If I'm understanding this correctly:

  1. with some work around reorganizing the Observability's dataset to use data streams and those data streams would conform to Elastic's naming policy. with this work, naming would no longer be an issue
  2. sample data would be a package, but the actual data would live in another package, to avoid building the data in a Docker image. a new package type would need to be added to support this.

@joshdover
Copy link
Contributor

2. sample data would be a package, but the actual data would live in another package, to avoid building the data in a Docker image. a new package type would need to be added to support this.

I believe the idea is there would still be a single package for each sample data set, but the data itself would not be contained in the package's zip file, but instead it'd be served on a separate endpoint from EPR. I'm not sure what the benefits of this would be. It seems to me it would make distributing the package more complicated and having everything contained in a single file is quite useful.

@jsoriano do you have any input here? How could we optimize the memory overhead of serving such packages from the registry?

@jsoriano
Copy link
Member

jsoriano commented Jun 3, 2022

  1. sample data would be a package, but the actual data would live in another package, to avoid building the data in a Docker image. a new package type would need to be added to support this.

I believe the idea is there would still be a single package for each sample data set, but the data itself would not be contained in the package's zip file, but instead it'd be served on a separate endpoint from EPR. I'm not sure what the benefits of this would be. It seems to me it would make distributing the package more complicated and having everything contained in a single file is quite useful.

@jsoriano do you have any input here? How could we optimize the memory overhead of serving such packages from the registry?

With the package storage v2 design, serving large files from current EPR endpoints shouldn't be a problem, at some point the package zip files will be served directly from CDNs and/or redirected to S3-like object storages.

I also don't see any advantage on serving these datasets from a different endpoint, when we are already designing EPR to support large files. We would only need to define a new package type for these assets, and we could reuse everything we are building to publish and distribute packages.
There can be an advantage during development, that is to don't need to keep the large files in the same repository where the code of the package is, but this could be solved with build-time dependencies, as we do with fields imported from ECS.

The only [temporary] blocker I see with including large files in packages, is that we would need to wait to have the storage v2 up and running. If serving datasets through packages is required before that, then serving the large files from the registry is not an option. But I would recommend to wait for storage v2, so we have a simpler solution for this (simpler because it doesn't require to deploy another service to resolve install-time dependencies).

Regarding the package-registry distributions in docker images, we know that at some point we have to start selecting the packages included, it doesn't scale to include everything. We could exclude from the beginning the dataset packages, what I think should be acceptable for air-gapped production environments, the main use case of these docker images.
If someone in an air-gapped environment wants to try some dataset, there can be still options for them (as downloading the zip from the public EPR, and use elastic/kibana#70582).

Btw, I opened #348 for different use cases of sample data sets, more specifically coupled to integrations, but the solution may be the same for both issues at the end.

@jsoriano
Copy link
Member

jsoriano commented Jun 3, 2022

Summarizing, I think that this could be done with these overall efforts:

@majagrubic
Copy link
Author

The only [temporary] blocker I see with including large files in packages, is that we would need to wait to have the storage v2 up and running

What is your timeline for v2?

@jsoriano
Copy link
Member

jsoriano commented Jun 3, 2022

The only [temporary] blocker I see with including large files in packages, is that we would need to wait to have the storage v2 up and running

What is your timeline for v2?

This is currently one of our priorities, we expect it to be ready "soon", but it can still take some weeks. We keep track of progress about this on this (internal) issue: https://github.com/elastic/ingest-dev/issues/1040.

@majagrubic
Copy link
Author

If it's weeks, does that mean you are aiming 8.4.0 then?

@mtojek
Copy link
Contributor

mtojek commented Jun 3, 2022

Not exactly, as we have other tasks with higher priority. We will try to meet the 8.4.0, but I would rather aim for the next iteration.

@joshdover
Copy link
Contributor

@jsoriano Thanks for the summary, could you update the issue description with the full implementation plan?

  • Add support for this new package type [This would be blocked till storage v2 is ready].

We don't need to block working on the spec changes, but I agree we can't ship any new packages until storage v2 is done. @majagrubic if you plan to leverage EPR for this feature (which I think you should), we can likely guide you on how to contribute to these spec changes.

Let's also note that we need to make changes in Kibana's Fleet plugin to make this work too, which should be included in the implementation plan. I suspect this is an area that @majagrubic and her team will be able to contribute to easily.

@jsoriano
Copy link
Member

jsoriano commented Jun 9, 2022

Added implementation plan to the description, including support in Fleet and elastic-package.

@majagrubic
Copy link
Author

majagrubic commented Jun 10, 2022

We are definitely planning to leverage EPR and the timeframe described by @mtojek works for us. We will wait until v2 is ready.

@majagrubic
Copy link
Author

Just checking-in: what is the status of storage v2? When can we expect to move forward with this?

@mtojek
Copy link
Contributor

mtojek commented Jul 12, 2022

We're preparing for the switch and started working with package developers to adopt their pipelines. The timeline strongly depends on team cooperation and any unexpected issues.

Technically speaking, the new endpoint is ready, but we need to adapt existing users. Based on their timelines and availability, we should launch it during 8.5.

Keep in mind that for us the top priority is to not break any existing packages and harm the user experience, even if we need to postpone the integration of new packages.

@mtojek
Copy link
Contributor

mtojek commented Jul 13, 2022

@jsoriano Side question: do you think that we should prepare a special Docker distribution without large packages or we should be fine withdistribution:production? Elastic-hosted Package Registry will be package-less, but I'm concerned about air-gapped customers forced to download relatively big images.

@jsoriano
Copy link
Member

@jsoriano Side question: do you think that we should prepare a special Docker distribution without large packages or we should be fine withdistribution:production? Elastic-hosted Package Registry will be package-less, but I'm concerned about air-gapped customers forced to download relatively big images.

We will have to think about the different use cases.

For air-gapped environments, by now, I think that it would be fine to have full images, hopefully there are going to be users of big packages on these environments. In the future, when the image grows more, we may also think on different ways of mirroring production, so only new packages need to be downloaded. Or offer something to build custom images with selected sets of packages, similar to what you did for the lite registry image.

For development environments, and getting-started use cases we probably want small images, but maybe we still want some big package included, as something to give ML a try.

@mtojek
Copy link
Contributor

mtojek commented Jul 14, 2022

For air-gapped environments, by now, I think that it would be fine to have full images, hopefully there are going to be users of big packages on these environments. In the future, when the image grows more, we may also think on different ways of mirroring production, so only new packages need to be downloaded. Or offer something to build custom images with selected sets of packages, similar to what you did for the lite registry image.

Yes, I had a similar idea. Maybe we can offer another distribution standard with packages < 50MB? Do you think it's the moment or we should wait a bit?

For development environments, and getting-started use cases we probably want small images, but maybe we still want some big package included, as something to give ML a try.

Development environments can go with the bare Package Registry image + feature to combine local search results with Elastic-hosted ones. I think that's one of tasks in the next milestone.

@jsoriano
Copy link
Member

Yes, I had a similar idea. Maybe we can offer another distribution standard with packages < 50MB? Do you think it's the moment or we should wait a bit?

Let's wait to see if we need it.

@jsoriano
Copy link
Member

As discussed in the meeting today, and being that storage v2 is close to be ready, the implementation plan in the description could be unblocked, ccing @jen-huang @kpollich @nimarezainia for priorization.

I have added a reference to #406 in the implementation plan.

@jen-huang
Copy link
Contributor

@jsoriano, is #351 fully applicable here? As the sample data use case here is not a "DLC" but rather standalone.

Another question that comes to mind is how uninstalling and upgrading of a sample data package will work (or not work!). I know that currently we have some "special" handling for certain types of assets (ML jobs, security rules?) where the old version of the assets do not get removed/overwritten on upgrade.. possibly even on uninstall? @kpollich Could you verify?

@jsoriano
Copy link
Member

@jsoriano, is #351 fully applicable here? As the sample data use case here is not a "DLC" but rather standalone.

I guess that we could allow these packages with additional content to be standalone too. I don't see a problem with relaxing the requirement of this dependency.

But as #351 is being defined now, these "DLC" packages would be basically like integration packages, but without anything related to data streams or ingestion. If data streams need to be defined for this data, it could make sense to define both, a normal integration package with the definition of the data streams, that maybe can also be used to ingest additional data, and the "DLC" package with the example data. It could be also possible to have different sets of sample data for the same data streams, with multiple "DLC" packages for one integration package.

Another question that comes to mind is how uninstalling and upgrading of a sample data package will work (or not work!). I know that currently we have some "special" handling for certain types of assets (ML jobs, security rules?) where the old version of the assets do not get removed/overwritten on upgrade.. possibly even on uninstall?

As "DLC" packages are being described now, they are mostly like integration packages, but without data streams. I guess that assets defined there could be installed, uninstalled and upgraded the same way as the assets of any other package.

@joshdover
Copy link
Contributor

+1 on supporting both options, where a package can contain data stream definitions and actual data as well as being able to depend on another package's data streams to ingest data into. I think we will we need both.

It's not clear to me if these need to be separate package types or if they should just be enhancements to the integration package type. Especially if we want to support both use cases, I think starting with enhancements to the existing integration package may be a simpler place to start.

@kpollich
Copy link
Member

Another question that comes to mind is how uninstalling and upgrading of a sample data package will work (or not work!). I know that currently we have some "special" handling for certain types of assets (ML jobs, security rules?) where the old version of the assets do not get removed/overwritten on upgrade.. possibly even on uninstall? @kpollich Could you verify?

For ML jobs, we install them via putTrainedModel and swallow any resource already exists errors. So Fleet does attempt to overwrite the model on update, it's just that Elasticsearch doesn't support that operation if the model was already created.

https://github.com/elastic/kibana/blob/main/x-pack/plugins/fleet/server/services/epm/elasticsearch/ml_model/install.ts#L71-L110

For security roles, I think we do overwrite them on update

https://github.com/elastic/kibana/blob/main/x-pack/plugins/fleet/server/services/epm/kibana/assets/install.ts#L289-L294

We also run a cleanup process on update to remove package assets that reference the now-outdated previous version of a given package. Any asset that still exists on the new version of the package will have been created during the installation process, so the old ones are essentially orphaned.

https://github.com/elastic/kibana/blob/main/x-pack/plugins/fleet/server/services/epm/packages/cleanup.ts

For sample data, we could define a new "asset type" that allows us to run a cleanup process that deletes the sample documents from Elasticsearch during uninstalls/updates.

@jsoriano
Copy link
Member

+1 on supporting both options, where a package can contain data stream definitions and actual data as well as being able to depend on another package's data streams to ingest data into. I think we will we need both.

It's not clear to me if these need to be separate package types or if they should just be enhancements to the integration package type. Especially if we want to support both use cases, I think starting with enhancements to the existing integration package may be a simpler place to start.

It would be actually good to confirm the use cases. In the conversation that originated #351, I think that we discussed that in the future we want more sample data sets for real integrations, in opposition to other sample data that we had like the flight data and so on.

The need of separate package types comes from this use case. If one wants to provide sample data sets for example for the Apache package, including it in the same package would increase its size unnecessarily for the majority of users.
If we only include in packages references to data sets available in other endpoints, we have to solve the problem of hosting these data sets.

An additional package type solves these problems. It can be installed and unistalled separately, and it can be distributed through the registry. It also serves as building block to solve additional problems, as having multiple data sets for the same integration, or distributing additional optional content, as dashboards or ML jobs.

If we have to support the use case of data sets like the flight data, that is not linked to any integration, I also think that it'd be better to depart from something like the "DLC" package type, as it is closer to that. An integration package is intended to configure policies for agents, but the flight data set wouldn't have that.

@spong
Copy link
Member

spong commented Jan 13, 2024

Would anyone be so kind to provide an update with regard to including sample data as described here and in #351?

We would love to ship Knowledge Base content for the Elastic Assistant this way, and having a package that can install a data stream and be primed with static data would be extremely powerful for us. We could make things work with the above requirements moreso than the #351 'DLC' variant, which proposes a sub-package and doesn't include data streams (we'd want a single package with data stream + data).

Initial use case would be shipping embeddings of our ES|QL docs to be used for the Assistant's query generation capabilities, but I did an internal POC demoing how they could be used to create 'little bundles of RAG', essentially mirroring custom GPT's. So as I said above, really powerful. 🙂

We're willing to help however we can to expedite whatever may be left of this project, just let us know!

@jsoriano
Copy link
Member

Would anyone be so kind to provide an update with regard to including sample data as described here and in #351?

I think there hasn't been further progress on these issues.

having a package that can install a data stream and be primed with static data would be extremely powerful for us

Would this mean to have a data stream with some initial data, and then later the user can ingest more data in the same data stream? Would this initial data be eventually removed or it would need to be always there?

@spong
Copy link
Member

spong commented Jan 17, 2024

Would this mean to have a data stream with some initial data, and then later the user can ingest more data in the same data stream?

I can see utility in both allowing users to ingest more data into the data stream, or locking it to prevent additions or mutation of the initial data. No hard requirement here though, whatever is the default/easiest is fine for an MVP.

Would this initial data be eventually removed or it would need to be always there?

The initial data should stick around for as long as the package is installed. If the package is removed, the data stream + initial data should be removed as well. When the package is updated, it would be fine to wipe the data stream/initial data and treat it as a fresh install. Again, whatever is easiest/most resilient would be fine for the first iteration here. No need to worry about appending new data on upgrade, or dealing with mapping changes, just delete the data streams and re-install/re-ingest the initial data.


If it's best to open a new issue with detailed requirements I can go ahead and do that, but to be honest the requirements here for supporting sample data appear to be more than sufficient to provide the base needs of this use case, so I see no reason to diverge efforts. Both Security and Observability Assistants are currently bundling our ES|QL docs with the Kibana distribution each release, so being able to deliver those via a stack-versioned package would be a win right out of the gate!

FWIW, I've been eager to ship security app sample data (internal) for some time now, so it would be great to help see this feature to fruition 🙂

@jsoriano
Copy link
Member

@spong I think it'd be better to open a different issue for this. It is not only about big data, but also about the additional behaviour of installing another data stream whose data should persist during the life of the package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs discussion Team:Ecosystem Label for the Packages Ecosystem team
Projects
None yet
Development

No branches or pull requests

10 participants