-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Howto implement exclusion filters for (queue) metrics #59
Comments
One Idea I came up with does not solve the "too much queue metrics" issue, but at least could prevent overload on the broker by introducing a rate limiting on outgoing SEMP requests. Digging deeper into potential request limits I found 2 different numbers which are bit confusing 1.) SEMP v1 Polling Frequency Guidelines
2.) SEMP v1 Request Over HTTP Service
Does that mean, there could be up to 500 simultaneous requests open, but not more then 10 newly opened per second? |
The sempv1 api only returns 100 items at once. And paging is required to get all results. But paging at solace is implemented with result pointer not with page numbers, There for parallelize the paging results is not as easy as expected. But you can use Prometheus to parallelize.
PS asked solace to check to may be provide a better filter possibility for sempv1 or semp2 at this point. |
Would the sempv2 filter may be fit your needs? |
Yes, sure. I think the |
yes semp v2 was slower last time i reviewed. But i know solace improved in meanwhile. Next question do you like to provide an endpoint using v2 or do you wait for me to do this (may be i will find the time this year) |
I would not complain if I can just grab a new exporter version with the new endpoint, but of course we're willing to contribute as well. [endpoint.new-sempv2]
VpnStatsV2=default|*
QueueStatsV2=default|queueName!=exclude/those/queues/* This option would be the most flexible I guess and would allow slow transition for users. In theory this would even allow to define which metrics should be exported by extending the filtering options with the [endpoint.new-sempv2-with-select]
QueueStatsV2=default|queueName!=exclude/those/queues/*|bindRequestRate,bindSuccessCount Other option would be to have a global "useSempV2" flag, that would automatically use SEMPv2 for every datasource where it is available func (e *Exporter) Collect(ch chan<- prometheus.Metric) {
...
case "QueueStats":
if e.config.useSempV2 {
up, err = e.semp.GetQueueStatsSemp2(ch, dataSource.VpnFilter, dataSource.ItemFilter)
} else {
up, err = e.semp.GetQueueStatsSemp1(ch, dataSource.VpnFilter, dataSource.ItemFilter)
} |
I like the new v2 datasources more. Because i fear that sempv2 and sempv1 powered endpoints will provide totally the same metrics. The idea with parametrize able select options is very good. But i here we should be able to enter metric name and map it in exporter back to the sempv2 field name. To coordinate, the first of us having time to work on this issue just post here to avoid doubled work |
I started work here: #60 |
I implemented queue stats with v2 api. |
Hi @GreenRover, thanks for implementing this so fast 💯 I tried to test it today, but sadly failed. Pretty sure I am doing something wrong on my end.
(I also changed and tried with / without URL encoding) But this sends the pod into a CrashLoopBackOff with the following error
|
I fixed this issue now. Please test again. I performed a performance test as well:
Started the exported like this: kind: ConfigMap
apiVersion: v1
metadata:
name: solace-exporter-config
data:
solace_prometheus_exporter.ini: |
[solace]
listenAddr = 0.0.0.0:1920
enableTLS = false
scrapeUri = https://mr-connection-j6em7yuyi7k.messaging.solace.cloud:943
username = monitoring
password = monitoring
defaultVpn = AaaBbbCcc
timeout = 60s
sslVerify = false
[endpoint.solace-v1-test]
QueueStats = AaaBbbCcc|*
[endpoint.solace-v2-test]
QueueStatsV2 = AaaBbbCcc|queueName!=not/interesting*|solace_queue_msg_shutdown_discarded,solace_queue_msg_max_redelivered_discarded
---
kind: Deployment
apiVersion: apps/v1
metadata:
name: solace-exporter
labels:
app: solace-exporter
spec:
replicas: 1
selector:
matchLabels:
app: solace-exporter
template:
metadata:
labels:
app: solace-exporter
annotations:
collectord.io/logs-output: devnull
spec:
volumes:
- name: solace-exporter-config-volume
configMap:
name: solace-exporter-config
defaultMode: 420
containers:
- resources:
limits:
cpu: 300m
memory: 800M
requests:
cpu: 150m
memory: 500M
readinessProbe:
httpGet:
path: /metrics
port: 1920
scheme: HTTP
initialDelaySeconds: 5
timeoutSeconds: 7
periodSeconds: 10
successThreshold: 1
failureThreshold: 2
terminationMessagePath: /dev/termination-log
name: solace-exporter
command:
- /solace_prometheus_exporter
- '--config-file=/etc/solace/solace_prometheus_exporter.ini'
livenessProbe:
httpGet:
path: /metrics
port: 1920
scheme: HTTP
initialDelaySeconds: 5
timeoutSeconds: 7
periodSeconds: 10
successThreshold: 1
failureThreshold: 6
env:
- name: SOLACE_LISTEN_ADDR
value: '0.0.0.0:1920'
- name: SOLACE_LISTEN_TLS
value: 'false'
- name: TZ
value: UTC
ports:
- containerPort: 1920
protocol: TCP
imagePullPolicy: Always
volumeMounts:
- name: solace-exporter-config-volume
readOnly: true
mountPath: /etc/solace
image: 'docker.bin.sbb.ch/solacecommunity/solace-prometheus-exporter:latest'
restartPolicy: Always
---
kind: Service
apiVersion: v1
metadata:
name: solace-exporter
spec:
ipFamilies:
- IPv4
ports:
- name: 1920-tcp
protocol: TCP
port: 1920
targetPort: 1920
internalTrafficPolicy: Cluster
type: ClusterIP
ipFamilyPolicy: SingleStack
sessionAffinity: None
selector:
app: solace-exporter
---
kind: Route
apiVersion: route.openshift.io/v1
metadata:
name: solace-exporter
spec:
to:
kind: Service
name: solace-exporter
tls:
termination: edge
host: solace-exporter.apps.trs01t.sbb-aws-test.net
port:
targetPort: 1920-tcp Result:Query "solace-v2-test" took 13.84 sec What your need to know about solace semp v2. |
For queue stats there is got another idea but this make the exporter having a state what i tried to avoid, because i use it centralized and horizontal scalled. But here the idea (i am not shure about i like it) Having a thread getting the list of all queues for example every 5 minutes. Filter this list by regex for queues of interest. On metric call ask semp for the stats of the particular queues. But this feature is so specialized that it only should be used unter this conditions:
|
Thank you for the updated version. So, the endpoint is now working correctly, but we are still running into a timeout before the request finishes. I also increased the timeout to 60s, I also increased the timeout on the route in openshift to 60s, but it seems to take longer in the end. Not sure how far I should spin this time up, seems like the broker is already under heavy load if he is not able to finish the request in 60s. We will try to reduce the amount of queues and try to identify if something on the brokers is running wild which might also slow down the requests. |
In my load test i took a 10k solace cloud broker (4cores) with no load on it. Can you provide:
Another possible solution would be: Having a config file like: [solace]
listenAddr = 0.0.0.0:1920
enableTLS = false
scrapeUri = https://mr-connection-j6em7yuyi7k.messaging.solace.cloud:943
username = monitoring
password = monitoring
defaultVpn = AaaBbbCcc
timeout = 60s
sslVerify = false
prefetchInterval= 60s // NEW Option: the sleep time between the prefetches
[endpoint.solace-v2-test]
QueueStats =AaaBbbCcc|queueName!=not/interesting* As soon as this options is configured.
will than be:
than prometheus should be configured to use "honor_timestamps". But be aware of that Prometheus may warn if the prometheus see 2 times same values. This is not a problem but can result in noisy Prometheus logs. |
We are using a 100k Broker (8 cores max) with a lot of load on it :-)
Around 4500
Around 1500
It was in the past 60s, but since we do not even finish requests now in this time might go up to 300s
Just tried that - took a whooping 105 seconds to finish. I assume that 1500 is still to big of an amount. And if I understood you correct even the where filter from the sempv2 does not help much here as it still has to go through all 4500 queues first and filter always on the groups of 100 it fetches at a time. |
Ok that is an average of 2.1sec per semp call. Do you use a dedicated exporter per broker? If yes i would try to provide async scraping.
|
Please take newest "testing" build [solace]
timeout=5s
# 0s means disabled. When set an interval, all well configured endpoints will fetched async.
# This may help you to deal with slower broker or extreme amount of results.
prefetchInterval=30s
# Because a solace advices to have less than 10 semp requests per second . This is a very rare resources that needs to be take care of.
parallelSempConnections=1
[endpoint.solace-esb-sempv2]
QueueStatsV2=pksolocp01|queueName%21%3Desb%2Fq%2Fkfl%2Fsib%2Fdata%2Fin*|solace_queue_msg_shutdown_discarded,solace_queue_msg_max_redelivered_discarded
[endpoint.solace-esb-sempv1]
QueueStatsV1=pksolocp01|* And run with the argument: I would expect to see a log like:
This is just a first attempt to run everything async. By incremental scraping is still a hard thing, because. While my tests, i have observed:
What we possibly can try:
|
Hi,
Yesterday evening I thought about another (maybe) option. Already wishing you happy holidays and a good start into the year 2024 :-) |
SempOver messaging is just an different transport protocol. The problem for you is the slow semp query. I see here no improvement potention. What i think is strange at your log is:
Didt you tried to add more cpu cores to the broker or asks solace how the performance might be improved? What do you use as
|
Oh, I was using directly your config from the comment above with the 5s timeout.
I already created a ticket (just now actually). I think I have your mail aswell from the communication with the Swiss Airport colleague. If so, I will add you in CC to the ticket. |
Thanks to this attempt but the log is not changing. What i would expect to see is, either:
OR
........... I found the issue:
too
I improved the error logging to detect similar issues easier. PS: better not use url coding in configuration to improve readability |
@JanGrosse I not jet received any ticket information from solace. I guess you have to forward response/communication with solace manually. |
@GreenRover, I updated the image this morning and removed the URL encoding & added the correct sempv1 endpoint. Regarding the CPU mentioning. The broker itself has a upper limit of 8 CPU but usually never reaches those (it stays between 3-4 CPU used) - so I am not sure if adding more CPU will change anything. Here are the amount of CPU used over the last week (barely scratching the 4 CPU)
|
Ok very strange results.
If i would only scrape pages with SempV2 that have no results it would take 109.2sec whats is slower than a full scan with semp v1 If you would do this: [solace]
timeout=5s
# 0s means disabled. When set an interval, all well configured endpoints will fetched async.
# This may help you to deal with slower broker or extreme amount of results.
prefetchInterval=15s
# Because a solace advices to have less than 10 semp requests per second . This is a very rare resources that needs to be take care of.
parallelSempConnections=1
[endpoint.solace-v1-test]
QueueStatsV1=pksolocp01|*| And scrape with Prometheus every 60sec. Would this be a solution for you? |
Just for my understanding - the only difference between how the exporter is doing it now and how he would do it in the upcoming version would be that he prefetches the result in a asynchronous fashion and on scrape requests from prometheus basically provides the "cached" data ? You also mentioned
could you elobarte on what you mean by that ? Providing this much metrics as endpoint is expensive for the exporter itself ? |
Yes that is correct. Just try to open the metric page. What only will provide metrics from cache. But this might take up to 2 seconds. I can try to skip uninteresting pages for sempV1 but if i check the statistics sempv2 is so slow that it is completely out of the race. Sorry but without do all the Prometheus magic by my own i can not help further.
|
Understood, thank you very much for your assistance. From our point we would be happy if this feature gets "officially" released in a upcoming version of the solace-prometheus-exporter, I will discuss the results internally with my colleagues as well (once they are all back from vacation) and move further discussions about improved scrape times into the Solace ticket. Thank you very much (again) and happy holidays. |
Lets say it this way: solace managed to respond without giving any answers. I the current solution helps you, i will do:
I guess if the current SempV1 implementation brings no improvement. You should consider a renaming your queues to be able to query all using an reasonable amount of SempV1 querys. |
With #65 i added a new behavior that may helps here. |
Can you please give the |
@andreas-habel i close this issue caused by in activity on your side. |
Hi Solace Community,
we're using the Prometheus exporter very successfully, but since we get more and more assets on our brokers we see timeouts and gaps in our metrics. Error messages like
Can't scrape QueueDetailsSemp1 ..... context deadline exceeded
show up in the logs.One reason surely is, that on one broker we do have +4500 Queues, which is obviously too much and causes timeouts. Also additionally we're afraid if gathering those metrics every minute may harm the broker.
In theory, it would be fine if we could filter the gathered queue metrics, because roughly 50% of them are not as time critical - we could export those metrics less often.
However, the exporter does not support blacklisting queue names with a
!
or similar. There just is no option to do something like this:http://your-exporter:9628/solace?m.QueueStats=myVpn|!ARBON
And using whitelisting instead would require us to add / change the query constantly.
So basically a feature request would be to be able to exclude metrics gathering for queues that start with a specific name/string.
Of course this is not even problem of the exporter, since it just forwards the filter to the the SEMP API
Because I currently see no good solution how this could be implemented I wanted to create this post for a potential discussion.
How did Solace solve that with Datadog and Insights?
Do you have similar challenges or Issues?
The text was updated successfully, but these errors were encountered: