Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Litmus exporter not exporting metrics #141

Open
anuraagrijal3138 opened this issue Jul 26, 2023 · 0 comments
Open

Litmus exporter not exporting metrics #141

anuraagrijal3138 opened this issue Jul 26, 2023 · 0 comments

Comments

@anuraagrijal3138
Copy link

anuraagrijal3138 commented Jul 26, 2023

##BUG REPORT

What happened:
I am using the litmus-exporter image litmuschaos/chaos-exporter:3.0.0-beta5.
When running a chaos experiment in GKE, there is a gap in metrics sent by the exporter. Sometimes the metrics are exported, while other times the exporter does not even detect the metrics.

Logs not showing any metrics after run % kl litmus chaos-monitor-788f87f99-4vqdl
W0726 11:39:16.209768 1 client_config.go:615] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2023-07-26T11:39:16Z" level=info msg="Started creating Metrics"
time="2023-07-26T11:39:16Z" level=info msg="Registering Fixed Metrics"
time="2023-07-26T11:39:16Z" level=info msg="Beginning to serve on port :8080"
time="2023-07-26T11:41:26Z" level=info msg="[Wait]: Hold on, no active chaosengine found ... "
Logs showing metrics after run time="2023-07-26T11:39:20Z" level=info msg="Beginning to serve on port :8080"
time="2023-07-26T11:39:20Z" level=info msg="Started creating Metrics"
time="2023-07-26T11:39:20Z" level=info msg="Registering Fixed Metrics"
time="2023-07-26T11:41:51Z" level=info msg="[Wait]: Hold on, no active chaosengine found ... "
time="2023-07-26T12:20:39Z" level=info msg="The chaos metrics are as follows" ProbeSuccessPercentage=0 EndTime=0 FaultName=pod-delete PassedExperiments=0 FailedExperiments=0 AwaitedExperiments=1 StartTime=1.69037394e+09 ChaosInjectTime=1690374019 TotalDuration=0 ResultName=ambassador-pod-delete-1690373933-pod-delete ResultNamespace=litmus ResultVerdict=Awaited
The run is successful in both case % kl litmus pod-delete-xxq0f6-fpvqg
...
time="2023-07-26T11:52:07Z" level=info msg="[Status]: The status of Pods are as follows" Pod=pay-dummy-dev-676577958d-jjlnh Status=Running
time="2023-07-26T11:52:11Z" level=info msg="[Probe]: check-http-probe-success probe has been Passed 😄 " ProbeName=check-http-probe-success ProbeType=httpProbe ProbeInstance=PostChaos ProbeStatus=Passed time="2023-07-26T11:52:11Z" level=info msg="[The End]: Updating the chaos result of pod-delete experiment (EOT)"
The chaosResult is present as well ``` % kubectl describe -n litmus chaosresult {service-name}-pod-delete-1690372234-pod-delete
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Awaited 47m pod-delete-xxq0f6-fpvqg experiment: pod-delete, Result: Awaited
Normal Pass 46m pod-delete-xxq0f6-fpvqg experiment: pod-delete, Result: Pass ```
**What you expected to happen**: Metrics from all the chaos engine is exported. **How to reproduce it (as minimally and precisely as possible)**: It happens often. We have 4 GKE clusters where we are running the experiments as schedules.
Yaml manifest used to create the experiment apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
namespace: litmus
name: "{service name}-pod-delete"
labels:
app: {service name}-chaos
spec:
schedule:
now: true
engineTemplateSpec:
appinfo:
appns: '{namespace}'
applabel: 'app.kubernetes.io/instance={serviceName}'
appkind: 'deployment'
engineState: 'active'
chaosServiceAccount: litmus-runner
jobCleanUpPolicy: "delete"
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'true'
- name: PODS_AFFECTED_PERC
value: '70'
probe:
- name: 'check-http-probe-success'
type: 'httpProbe'
httpProbe/inputs:
url: "http://{servicename.namespace}.svc.cluster.local/"
insecureSkipVerify: true
responseTimeout: 1000
method:
get:
criteria: "=="
responseCode: '200'
mode: "Continuous"
runProperties:
probeTimeout: 10000
interval: 5000
retry: 2
probePollingInterval: 5000
**Anything else we need to know?**:
We usually schedule on repeat. To debug the issue, we are running the schedule as `now: true`.
We have installed litmus-core and kubernetes-chaos version 2.14.0.
For the chaos-exporter deployment, we are using the image version `3.0.0-beta5`, mainly to get the fault_name label in the metrics.
The resource(memory/CPU) for the litmus-exporter pod is adequate. Less than 50% of resource/request is being used.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant