You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
I am using the litmus-exporter image litmuschaos/chaos-exporter:3.0.0-beta5.
When running a chaos experiment in GKE, there is a gap in metrics sent by the exporter. Sometimes the metrics are exported, while other times the exporter does not even detect the metrics.
Logs not showing any metrics after run
% kl litmus chaos-monitor-788f87f99-4vqdl
W0726 11:39:16.209768 1 client_config.go:615] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2023-07-26T11:39:16Z" level=info msg="Started creating Metrics"
time="2023-07-26T11:39:16Z" level=info msg="Registering Fixed Metrics"
time="2023-07-26T11:39:16Z" level=info msg="Beginning to serve on port :8080"
time="2023-07-26T11:41:26Z" level=info msg="[Wait]: Hold on, no active chaosengine found ... " Logs showing metrics after run
time="2023-07-26T11:39:20Z" level=info msg="Beginning to serve on port :8080"
time="2023-07-26T11:39:20Z" level=info msg="Started creating Metrics"
time="2023-07-26T11:39:20Z" level=info msg="Registering Fixed Metrics"
time="2023-07-26T11:41:51Z" level=info msg="[Wait]: Hold on, no active chaosengine found ... "
time="2023-07-26T12:20:39Z" level=info msg="The chaos metrics are as follows" ProbeSuccessPercentage=0 EndTime=0 FaultName=pod-delete PassedExperiments=0 FailedExperiments=0 AwaitedExperiments=1 StartTime=1.69037394e+09 ChaosInjectTime=1690374019 TotalDuration=0 ResultName=ambassador-pod-delete-1690373933-pod-delete ResultNamespace=litmus ResultVerdict=Awaited
The run is successful in both case
% kl litmus pod-delete-xxq0f6-fpvqg
...
time="2023-07-26T11:52:07Z" level=info msg="[Status]: The status of Pods are as follows" Pod=pay-dummy-dev-676577958d-jjlnh Status=Running
time="2023-07-26T11:52:11Z" level=info msg="[Probe]: check-http-probe-success probe has been Passed 😄 " ProbeName=check-http-probe-success ProbeType=httpProbe ProbeInstance=PostChaos ProbeStatus=Passed
time="2023-07-26T11:52:11Z" level=info msg="[The End]: Updating the chaos result of pod-delete experiment (EOT)"
The chaosResult is present as well
```
% kubectl describe -n litmus chaosresult {service-name}-pod-delete-1690372234-pod-delete
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Awaited 47m pod-delete-xxq0f6-fpvqg experiment: pod-delete, Result: Awaited
Normal Pass 46m pod-delete-xxq0f6-fpvqg experiment: pod-delete, Result: Pass
```
**What you expected to happen**:
Metrics from all the chaos engine is exported.
**How to reproduce it (as minimally and precisely as possible)**:
It happens often. We have 4 GKE clusters where we are running the experiments as schedules.
Yaml manifest used to create the experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
namespace: litmus
name: "{service name}-pod-delete"
labels:
app: {service name}-chaos
spec:
schedule:
now: true
engineTemplateSpec:
appinfo:
appns: '{namespace}'
applabel: 'app.kubernetes.io/instance={serviceName}'
appkind: 'deployment'
engineState: 'active'
chaosServiceAccount: litmus-runner
jobCleanUpPolicy: "delete"
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'true'
- name: PODS_AFFECTED_PERC
value: '70'
probe:
- name: 'check-http-probe-success'
type: 'httpProbe'
httpProbe/inputs:
url: "http://{servicename.namespace}.svc.cluster.local/"
insecureSkipVerify: true
responseTimeout: 1000
method:
get:
criteria: "=="
responseCode: '200'
mode: "Continuous"
runProperties:
probeTimeout: 10000
interval: 5000
retry: 2
probePollingInterval: 5000
**Anything else we need to know?**:
We usually schedule on repeat. To debug the issue, we are running the schedule as `now: true`.
We have installed litmus-core and kubernetes-chaos version 2.14.0.
For the chaos-exporter deployment, we are using the image version `3.0.0-beta5`, mainly to get the fault_name label in the metrics.
The resource(memory/CPU) for the litmus-exporter pod is adequate. Less than 50% of resource/request is being used.
Thank you!
The text was updated successfully, but these errors were encountered:
##BUG REPORT
What happened:
I am using the litmus-exporter image
litmuschaos/chaos-exporter:3.0.0-beta5
.When running a chaos experiment in GKE, there is a gap in metrics sent by the exporter. Sometimes the metrics are exported, while other times the exporter does not even detect the metrics.
Logs not showing any metrics after run
% kl litmus chaos-monitor-788f87f99-4vqdlW0726 11:39:16.209768 1 client_config.go:615] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2023-07-26T11:39:16Z" level=info msg="Started creating Metrics"
time="2023-07-26T11:39:16Z" level=info msg="Registering Fixed Metrics"
time="2023-07-26T11:39:16Z" level=info msg="Beginning to serve on port :8080"
time="2023-07-26T11:41:26Z" level=info msg="[Wait]: Hold on, no active chaosengine found ... "
Logs showing metrics after run
time="2023-07-26T11:39:20Z" level=info msg="Beginning to serve on port :8080"time="2023-07-26T11:39:20Z" level=info msg="Started creating Metrics"
time="2023-07-26T11:39:20Z" level=info msg="Registering Fixed Metrics"
time="2023-07-26T11:41:51Z" level=info msg="[Wait]: Hold on, no active chaosengine found ... "
time="2023-07-26T12:20:39Z" level=info msg="The chaos metrics are as follows" ProbeSuccessPercentage=0 EndTime=0 FaultName=pod-delete PassedExperiments=0 FailedExperiments=0 AwaitedExperiments=1 StartTime=1.69037394e+09 ChaosInjectTime=1690374019 TotalDuration=0 ResultName=ambassador-pod-delete-1690373933-pod-delete ResultNamespace=litmus ResultVerdict=Awaited
The run is successful in both case
% kl litmus pod-delete-xxq0f6-fpvqg...
time="2023-07-26T11:52:07Z" level=info msg="[Status]: The status of Pods are as follows" Pod=pay-dummy-dev-676577958d-jjlnh Status=Running
time="2023-07-26T11:52:11Z" level=info msg="[Probe]: check-http-probe-success probe has been Passed 😄 " ProbeName=check-http-probe-success ProbeType=httpProbe ProbeInstance=PostChaos ProbeStatus=Passed time="2023-07-26T11:52:11Z" level=info msg="[The End]: Updating the chaos result of pod-delete experiment (EOT)"
The chaosResult is present as well
``` % kubectl describe -n litmus chaosresult {service-name}-pod-delete-1690372234-pod-delete...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Awaited 47m pod-delete-xxq0f6-fpvqg experiment: pod-delete, Result: Awaited
Normal Pass 46m pod-delete-xxq0f6-fpvqg experiment: pod-delete, Result: Pass ```
Yaml manifest used to create the experiment
apiVersion: litmuschaos.io/v1alpha1kind: ChaosSchedule
metadata:
namespace: litmus
name: "{service name}-pod-delete"
labels:
app: {service name}-chaos
spec:
schedule:
now: true
engineTemplateSpec:
appinfo:
appns: '{namespace}'
applabel: 'app.kubernetes.io/instance={serviceName}'
appkind: 'deployment'
engineState: 'active'
chaosServiceAccount: litmus-runner
jobCleanUpPolicy: "delete"
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'true'
- name: PODS_AFFECTED_PERC
value: '70'
probe:
- name: 'check-http-probe-success'
type: 'httpProbe'
httpProbe/inputs:
url: "http://{servicename.namespace}.svc.cluster.local/"
insecureSkipVerify: true
responseTimeout: 1000
method:
get:
criteria: "=="
responseCode: '200'
mode: "Continuous"
runProperties:
probeTimeout: 10000
interval: 5000
retry: 2
probePollingInterval: 5000
We usually schedule on repeat. To debug the issue, we are running the schedule as `now: true`.
We have installed litmus-core and kubernetes-chaos version 2.14.0.
For the chaos-exporter deployment, we are using the image version `3.0.0-beta5`, mainly to get the fault_name label in the metrics.
The resource(memory/CPU) for the litmus-exporter pod is adequate. Less than 50% of resource/request is being used.
Thank you!
The text was updated successfully, but these errors were encountered: