-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: EKS e2e test using eksctl #667
base: main
Are you sure you want to change the base?
Conversation
linters are flagging the exec cmds. IMO shelling out commands is not ideal here. I know that AKS has their own SDK that is able to interact with EKS https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/service/eks so maybe this could b[e worth look into as an alternative? |
Yeah, to do it for real we will want to use aws-sdk, but shelling out to eksctl is fine while we're just trying to say hey, Retina E2E could work on EKS |
@whatnick this is great, thanks for putting it together so fast! |
The aim overall is to bolster Retina's **ANY CNI** claim with a public
demonstration.
I just wanted to start the ball rolling and from past e2e building
experience I expect this to be slow. Just a POC that this is possible and
also flagging corner cases and other setup necessary
- ECR repo for pulling into cluster OR managing GHCR Credentials in cluster
- AWS-Github OIDC pairing for securely logging into account for GHA,
setting up policies and roles on AWS side.
- Setting up OIDC in EKS during cluster provisioning to hook AWS VPC CNI
and demonstrate using that + retina.
- Possible issues with parallel execution of Azure and AWS tests and
clobbering of the kubeconfig.
The slowness will also give me an opportunity to figure out pulling in
eksctl go code via a `require` + `import` to avoid shelling out.
…On Wed, Aug 28, 2024, 07:37 Evan Baker ***@***.***> wrote:
@whatnick <https://github.com/whatnick> this is great, thanks for putting
it together so fast!
While we review/discuss I do want to set the expectation appropriately
that us getting an AWS account provisioned will likely be the slow/hard
part of this 😓
—
Reply to this email directly, view it on GitHub
<#667 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADX7BGZS7I7VYQNRV56VSTZTT2B5AVCNFSM6AAAAABNF7YW52VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJTGY2TSNBQGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
good stuff, thanks for taking a look into this @whatnick |
It has added a lot of requires, but I have updated the PoC to consume eksctl as a package and run the cobra commands. It can be slimmed down to remove fancy things like coloured logging which are not really relevant for this use-case. |
With proper environment variable settings the e2e cluster creation and deletion works as expected. Some test scenarios are failing and will investigate further. Will pause work here while we wait for the Account provisioning to take place. AWS VPC CNI is also provisioned as a managed addon during cluster creation and can be tested against. export AWS_ACCOUNT_ID=XXXXXXXXXXX
export AWS_REGION=us-west-2
export TAG=v0.0.16
export IMAGE_REGISTRY=ghcr.io
export IMAGE_NAMESPACE=microsoft/retina
go test -run TestE2ERetinaAWS ./test/e2e/ -timeout 30m 2024/09/14 17:00:08 found chart at /home/whatnick/dev/retina/deploy/legacy/manifests/controller/helm/retina
#################### InstallHelmChart ###################################################################
2024/09/14 17:00:11 creating 1 resource(s)
2024/09/14 17:00:11 creating 1 resource(s)
2024/09/14 17:00:12 creating 1 resource(s)
2024/09/14 17:00:12 creating 1 resource(s)
2024/09/14 17:00:12 beginning wait for 4 resources with timeout of 1m0s
2024/09/14 17:00:14 Clearing REST mapper cache
2024/09/14 17:00:19 creating 8 resource(s)
2024/09/14 17:00:21 beginning wait for 8 resources with timeout of 4m0s
2024/09/14 17:00:21 DaemonSet is not ready: kube-system/retina-agent. 0 out of 2 expected pods are ready
2024/09/14 17:00:23 DaemonSet is not ready: kube-system/retina-agent. 0 out of 2 expected pods are ready
2024/09/14 17:00:25 DaemonSet is not ready: kube-system/retina-agent. 0 out of 2 expected pods are ready
2024/09/14 17:00:27 DaemonSet is not ready: kube-system/retina-agent. 0 out of 2 expected pods are ready
2024/09/14 17:00:29 DaemonSet is not ready: kube-system/retina-agent. 0 out of 2 expected pods are ready
2024/09/14 17:00:31 DaemonSet is not ready: kube-system/retina-agent. 0 out of 2 expected pods are ready
2024/09/14 17:00:33 DaemonSet is not ready: kube-system/retina-agent. 0 out of 2 expected pods are ready
2024/09/14 17:00:35 DaemonSet is not ready: kube-system/retina-agent. 0 out of 2 expected pods are ready
2024/09/14 17:00:38 installed chart from path: retina in namespace: kube-system
2024/09/14 17:00:38 chart values: map[agent:map[name:retina-agent] agent_win:map[name:retina-agent-win] apiServer:map[host:0.0.0.0 port:10093] azure:map[appinsights:map[instrumentation_key:app-insights-instrumentation-key]] bypassLookupIPOfInterest:false capture:map[aadClientId: aadClientSecret: debug:true enableManagedStorageAccount:false jobNumLimit:0 location: managedIdentityClientId: resourceGroup: subscriptionId: tenantId:] daemonset:map[container:map[retina:map[args:[--config /retina/config/config.yaml] command:[/retina/controller] healthProbeBindAddress::18081 metricsBindAddress::18080 ports:map[containerPort:10093]]]] dataAggregationLevel:low enableAnnotations:false enablePodLevel:false enableTelemetry:false enabledPlugin_linux:["dropreason","packetforward","linuxutil","dns"] enabledPlugin_win:["hnsstats"] fullnameOverride:retina-svc image:map[initRepository:ghcr.io/microsoft/retina/retina-init pullPolicy:Always repository:ghcr.io/microsoft/retina/retina-agent tag:v0.0.16] imagePullSecrets:[map[name:acr-credentials]] logLevel:debug metrics:map[podMonitor:map[additionalLabels:map[] enabled:false interval:30s namespace:<nil> relabelings:[] scheme:http scrapeTimeout:30s tlsConfig:map[]] serviceMonitor:map[additionalLabels:map[] enabled:false interval:30s metricRelabelings:[] namespace:<nil> relabelings:[] scheme:http scrapeTimeout:30s tlsConfig:map[]]] metricsIntervalDuration:10s nameOverride:retina namespace:kube-system nodeSelector:map[] operator:map[container:map[args:[--config /retina/operator-config.yaml] command:[/retina-operator]] enableRetinaEndpoint:false enabled:false installCRDs:true repository:ghcr.io/microsoft/retina/retina-operator resources:map[limits:map[cpu:500m memory:128Mi] requests:map[cpu:10m memory:128Mi]] tag:v0.0.16] os:map[linux:true windows:true] remoteContext:false resources:map[limits:map[cpu:500m memory:300Mi] requests:map[cpu:500m memory:300Mi]] retinaPort:10093 securityContext:map[capabilities:map[add:[SYS_ADMIN SYS_RESOURCE NET_ADMIN IPC_LOCK]] privileged:false windowsOptions:map[runAsUserName:NT AUTHORITY\SYSTEM]] service:map[name:retina port:10093 targetPort:10093 type:ClusterIP] serviceAccount:map[annotations:map[] name:retina-agent] tolerations:[] volumeMounts:map[bpf:/sys/fs/bpf cgroup:/sys/fs/cgroup config:/retina/config debug:/sys/kernel/debug tmp:/tmp trace:/sys/kernel/tracing] volumeMounts_win:map[retina-config-win:retina]]
#################### CreateDenyAllNetworkPolicy (scenario: Drop Metrics) ################################
2024/09/14 17:00:38 Creating/Updating NetworkPolicy "deny-all" in namespace "kube-system"...
#################### CreateAgnhostStatefulSet (scenario: Drop Metrics) ##################################
2024/09/14 17:00:38 Creating/Updating StatefulSet "agnhost-a" in namespace "kube-system"...
2024/09/14 17:00:39 pod "agnhost-a-0" is not in Running state yet. Waiting...
2024/09/14 17:00:44 pod "agnhost-a-0" is in Running state
2024/09/14 17:00:44 all pods in namespace "kube-system" with label "app=agnhost-a" are in Running state
#################### ExecInPod (scenario: Drop Metrics) #################################################
2024/09/14 17:00:44 executing command "curl -s -m 5 bing.com" on pod "agnhost-a-0" in namespace "kube-system"...
#################### Sleep (scenario: Drop Metrics) #####################################################
2024/09/14 17:00:52 sleeping for 5s...
#################### ExecInPod (scenario: Drop Metrics) #################################################
2024/09/14 17:00:57 executing command "curl -s -m 5 bing.com" on pod "agnhost-a-0" in namespace "kube-system"...
#################### PortForward (scenario: Drop Metrics) ###############################################
2024/09/14 17:01:03 attempting to find pod with label "k8s-app=retina", on a node with a pod with label "app=agnhost-a"
2024/09/14 17:01:04 attempting port forward to pod name "retina-agent-zwvdl" with label "k8s-app=retina", in namespace "kube-system"...
2024/09/14 17:01:06 port forward validation HTTP request to "http://localhost:10093" succeeded, response: 200 OK
2024/09/14 17:01:06 successfully port forwarded to "http://localhost:10093"
#################### ValidateRetinaDropMetric (scenario: Drop Metrics) ##################################
2024/09/14 17:01:06 checking for metrics on http://localhost:10093/metrics
2024/09/14 17:01:07 failed to find metric matching networkobservability_drop_count: map[direction:unknown reason:IPTABLE_RULE_DROP]
2024/09/14 17:01:12 checking for metrics on http://localhost:10093/metrics
2024/09/14 17:01:12 failed to find metric matching networkobservability_drop_count: map[direction:unknown reason:IPTABLE_RULE_DROP]
...
2024-09-14 17:19:26 [✔] all cluster resources were deleted
2024/09/14 17:19:26 Cluster deleted successfully!
--- FAIL: TestE2ERetinaAWS (2109.29s)
runner.go:27:
Error Trace: /home/whatnick/dev/retina/test/e2e/framework/types/runner.go:27
/home/whatnick/dev/retina/test/e2e/retina_e2e_test.go:107
Error: Received unexpected error:
did not expect error from step ValidateRetinaDropMetric but got error: failed to verify prometheus metrics networkobservability_drop_count: failed to get prometheus metrics: no metric found
Test: TestE2ERetinaAWS
FAIL
FAIL github.com/microsoft/retina/test/e2e 2109.530s
FAIL |
More progress by enabling AWS VPC-CNI in Network Policy enforcement mode. kubectl --kubeconfig=/home/whatnick/dev/retina/test/e2e/test.pem get pods -n kube-system
NAME READY STATUS RESTARTS AGE
agnhost-a-0 1/1 Running 0 16s
aws-node-77sfp 2/2 Running 0 2m19s
aws-node-ssjsk 2/2 Running 0 2m15s
aws-node-xs2kv 2/2 Running 0 2m17s
coredns-787cb67946-lrxxh 1/1 Running 0 6m34s
coredns-787cb67946-qr4xx 1/1 Running 0 6m34s
kube-proxy-7h7vk 1/1 Running 0 2m15s
kube-proxy-qxwb7 1/1 Running 0 2m19s
kube-proxy-xcbcf 1/1 Running 0 2m17s
retina-agent-22mcl 1/1 Running 0 34s
retina-agent-cc8b4 1/1 Running 0 34s
retina-agent-hwj5t 1/1 Running 0 34s Network policy is enabled kubectl --kubeconfig=/home/whatnick/dev/retina/test/e2e/test.pem get networkpolicy -n kube-system
NAME POD-SELECTOR AGE
deny-all app=agnhost-a 66s |
223e957
to
db6d7b0
Compare
This PR will be closed in 7 days due to inactivity. |
Will merge to upstream soon. |
4bb5b0f
to
4c01faa
Compare
Currently disabled Windows tests for AWS, can enable once windows cluster setup via eksctl is tested. |
Description
NOTE : Since this will take a bit of CI and other account provisioning planning to keep this synced to upstream once a week till across the line or I run out of juice.
Add EKS based e2e tests by execing EKSCtl to provision and delete temporary cluster. Currently at POC stage since Account setup etc. are needed to run this in practice in conjuction with secrets and variables associated with this repository.
The AWS integration should be setup via OIDC as shown here : https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-amazon-web-services
with roles relevant to EKSCtl as shown here :
https://eksctl.io/usage/minimum-iam-policies/
Related Issue
Partially addresses #451
Checklist
git commit -S -s ...
). See this documentation on signing commits.Screenshots (if applicable) or Testing Completed
With drop packets metrics Scenario disabled as per #746 , the AWS e2e test suite runs successfully.
go test -run TestE2ERetinaAWS ./test/e2e/ -timeout 40m ok github.com/microsoft/retina/test/e2e 1866.467s
For failing test runs cluster creation and tear down is as below.
Additional Notes
The helm chart install portion of this test fails in practice presumably due to unreachable image registry. May need to push images to corresponding ECR or debug ghcr access.
Opening this PR for feedback and discussions on AWS e2e testing approach. In practice I have successfully deployed retina legacy charts in EKS.
Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.