Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[filebeat] Elasticsearch state storage for httpjson and cel inputs #41446

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

aleksmaus
Copy link
Member

@aleksmaus aleksmaus commented Oct 24, 2024

Proposed commit message

[filebeat] Elasticsearch state storage for httpjson input

This is a POC for Elasticsearch as State Store Backend for Security Integrations for Agentless solution.

The scope of this change was narrowed down to supporting only httpjson inputs in order to support Okta integration for the initial release. All the other integrations inputs still use the file storage as before.
This is a short term solution for the state storage for k8s environment.

This is the first cut and the details can change depending on the feedback.

Current feature currently could be enabled AGENTLESS_ELASTICSEARCH_STATE_STORE_ENABLED, to be decided how this would be configurable in k8s.

This change currently contains the hacky approach to the AGENTLESS_ELASTICSEARCH_APIKEY overwrite. This allows to the user to provide the ApiKey with elevated permissions that are required in order to be able to create/write/read the state index per input. THIS IS FOR DEVELOPMENT/TESTING ONLY. REMOVE BEFORE THE MERGE.

The existing code relied on the inputs state storage to be fully configurable before the main beat managers runs. The change delays the configuration of httpjson input to the time when the actual configuration is received from the Agent.

There is an assumption that the index template for the state storage indices is already in place before the storage is used

PUT _index_template/agentless_state_template
{
  "index_patterns": [
    "agentless-state-*"
  ],
  "priority": 300,
  "template": {
    "mappings": {
      "properties": {
        "v": {
          "type": "object",
          "enabled": false
        },
        "updated_at": {
          "type": "date",
          "format": "strict_date_optional_time||epoch_millis"
        }
      }
    },
    "settings": {
      "number_of_shards": 1
    }
  }
}

Example of the state storage index content for Okta integration:

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "agentless-state-httpjson-okta.system-028ecf4b-babe-44c6-939e-9e3096af6959",
        "_id": "httpjson::httpjson-okta.system-028ecf4b-babe-44c6-939e-9e3096af6959::https://dev-36006609.okta.com/api/v1/logs",
        "_seq_no": 39,
        "_primary_term": 1,
        "_score": 1,
        "_source": {
          "v": {
            "ttl": 1800000000000,
            "updated": "2024-10-24T20:21:22.032Z",
            "cursor": {
              "published": "2024-10-24T20:19:53.542Z"
            }
          }
        }
      }
    ]
  }
}

The naming convention for all state store is agentless-state-<input id>, since the expectation for agentless we would have only one agent per policy and the agents are ephemeral.

Currently in order to run the agent with Elasticsearch state storage a couple of environment variables would be required:

sudo AGENTLESS_ELASTICSEARCH_STATE_STORE_ENABLED=1 AGENTLESS_ELASTICSEARCH_APIKEY=xxxxxxxx-xvpDXfB:jVMRsW7SRIxxxxxxxxx ./elastic-agent -e

where the ApiKey in the

DEPENDENCIES / TODOS:

  • Approval of teams for this approach
  • Kibana (?) side change is required for the agentless-state index template boostrapping
  • Kibana or the intergration package (or both) change is required in order to include the permissions for agentless-state- with the Elasticsearch ApiKey (Remove the hack). I suspect that Kibana fleet code could be modified to recognize agentless supporting integration and include the proper index name for the agentless-state for the ApiKey permissions.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

The change should have no impact, and without the feature enabled the filebeat should work as before using the file system storage for the state.

@aleksmaus aleksmaus self-assigned this Oct 24, 2024
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 24, 2024
@aleksmaus aleksmaus added the Team:Security-Deployment and Devices Deployment and Devices Team in Security Solution label Oct 24, 2024
Copy link
Contributor

mergify bot commented Oct 24, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @aleksmaus? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Oct 24, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 24, 2024
@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Oct 24, 2024
// Injecting the ApiKey that has enough permissions to write to the index
// TODO: need to figure out how add permissions for the state index
// agentless-state-<input id>, for example httpjson-okta.system-028ecf4b-babe-44c6-939e-9e3096af6959
apiKey := os.Getenv("AGENTLESS_ELASTICSEARCH_APIKEY")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will collaborate with agentless team on addressing this part

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running under Elastic agent, every change of the output configuration results in a restart of the Beat process, in case that simplifies anything here for you.

@aleksmaus aleksmaus changed the title [filebeat] Elasticsearch state storage for httpjson input [filebeat] Elasticsearch state storage for httpjson and cel inputs Oct 30, 2024
@aleksmaus
Copy link
Member Author

@belimawr @cmacknz (or whoever wants/have time to be involved)
I need your feedback on this draft, if this approach is something that we could eventually merge (the ApiKey workaround will be removed once we adjust the kibana fleet).
I think this is an ok solution given the circumstances:

  1. This is fully backwards compatible. If the feature is not enabled, everything would work as before.
  2. The only inputs that are enabled for Elasticsearch backed state storage are the httpjson and cel. We only enabling the limited number of integration relying on httpjson or cel inputs for the first release.
  3. The state initialization for the inputs is delayed until we get the configuration only if the feature is enabled for the input.
  4. The agent logs monitoring will still use the local storage, since we agreed that loosing the agent log when the pod relocated is acceptable.

@cmacknz
Copy link
Member

cmacknz commented Nov 1, 2024

@leehinman I'd appreciate a review here to make sure this can co-exist with Beats receivers in agent since that would be the long term way we plan to run agentless inputs.


// TODO: REMOVE THIS HACK BEFORE MERGE. LEAVING FOR TESTING FOR DRAFT
// Injecting the ApiKey that has enough permissions to write to the index
// TODO: need to figure out how add permissions for the state index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fleet knows when something is an agentless package and that is probably what would hook into this to generate the key.

We could add a new state storage section to an agent policy (agent.storage?) that Fleet knows how to template when this happens.

Agent could then send it down as another output unit with a new type (or we could define a new type of unit but that is even more work).

This would allow the key to update on the fly through Fleet and control protocol.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also possibly be handled in the agentless api / controller and hidden from Fleet if we just inject it in as an env var. No opposition to that either really.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also possibly be handled in the agentless api / controller and hidden from Fleet if we just inject it in as an env var. No opposition to that either really.

I brought this up during the meeting today as an option. IMHO it's just one thing to manage, might be cleaner if all in one place in the policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of details we need to think about with respect to these keys is what the process should be for rotating and/or revoking them.


// List of input types Elasticsearch state store is enabled for
var esTypesEnabled = map[string]void{
"httpjson": {},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be configuration instead of in the code, maybe another env var?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure can do. Something like this?
AGENTLESS_ELASTICSEARCH_STATE_STORE_INPUT_TYPES=httpjson,cel

}

func (s *store) get(key string, to interface{}) error {
status, data, err := s.cli.Request("GET", fmt.Sprintf("/%s/%s/%s", s.index, docType, url.QueryEscape(key)), "", nil, nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These requests should all be tied to a context.

Also, they probably need some minimum amount of retries.

The biggest design difference with ES is now the requests can fail. A file on disk doesn't give us 429 errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a very high level, it feels like the way we deal with this is:

  1. Don't start or allow the input to progress until it has successfully initialized the state at least once to avoid massively duplicating data.
  2. Writes are asynchronous from the caller's perspective and the latest state is continuously retried.

Copy link
Member Author

@aleksmaus aleksmaus Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These requests should all be tied to a context.

Looks like the current implementation of the client uses the context

req, err := http.NewRequestWithContext(conn.reqsContext, method, url, body)

that is set when the client is constructed

for _, client := range clients {

conn.reqsContext = ctx

@cmacknz
Copy link
Member

cmacknz commented Nov 1, 2024

The state initialization for the inputs is delayed until we get the configuration only if the feature is enabled for the input.

To simplify the PR, is there any simplification in pulling this part out and/or just always delaying the store initialization when run under Elastic agent?

@cmacknz
Copy link
Member

cmacknz commented Nov 1, 2024

For the rest of the PR, I think reviewing this would be easier if we had a design doc that addressed the following questions:

  1. Where the API key and ES configuration is going to come from. I imagine we are going to need things like rate limit configuration eventually in addition to the basics of a host+API key.
  2. How we are going to deal with the fact that the store operations are much more likely to fail or could experience brief or prolonged unavailability.
  3. How we expect this to integrate with the Beats receivers work. Probably the agent team is best positioned to help with this.

@leehinman
Copy link
Contributor

@leehinman I'd appreciate a review here to make sure this can co-exist with Beats receivers in agent since that would be the long term way we plan to run agentless inputs.

Still reviewing, but I wanted to point out that this won't work at all for a beat receiver. For a beat receiver the output (in the beat configuration part) will always be otelconsumer. The beat receiver never "sees" any of the exporter configuration (elasticsearch, kafka, redist, etc). I think for a beat receiver we would want to use the otel storage extension and pass that in.

@cmacknz
Copy link
Member

cmacknz commented Nov 1, 2024

Yes an explicit storage extension in Beats itself would make this much easier to do. Unfortunately we don't have that.

@leehinman
Copy link
Contributor

Yes an explicit storage extension in Beats itself would make this much easier to do. Unfortunately we don't have that.

It would. But I was more thinking that we could modify the signature of NewBeatReceiver, so we could pass in a storage extension and store it in the beat.Info like we do for the LogConsumer. The filebeat Run function would then have access to this, so if it was present it could use it.

This would make the state store more like logging and the consumer, where configuration is handled at the otel level.

@aleksmaus
Copy link
Member Author

Added AGENTLESS_ELASTICSEARCH_STATE_STORE_INPUT_TYPES as requested in PR review

example AGENTLESS_ELASTICSEARCH_STATE_STORE_INPUT_TYPES="httpjson,cel"

Now no input types are enabled by default for Elasticsearch state storage.
Example how to run the agent with the new flag:

sudo AGENTLESS_ELASTICSEARCH_STATE_STORE_ENABLED=1 AGENTLESS_ELASTICSEARCH_STATE_STORE_INPUT_TYPES="httpjson,cel" AGENTLESS_ELASTICSEARCH_APIKEY=fsOitZIBVlcA-mvxxxxx:jVMRsW7SRIOc-U6VHxxxxx ./elastic-agent -e

Switching this PR from draft.

@aleksmaus aleksmaus marked this pull request as ready for review November 5, 2024 17:35
@aleksmaus aleksmaus requested review from a team as code owners November 5, 2024 17:35
@elasticmachine
Copy link
Collaborator

Pinging @elastic/sec-deployment-and-devices (Team:Security-Deployment and Devices)

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Nov 5, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Security-Deployment and Devices Deployment and Devices Team in Security Solution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants