Huge fixture to test slown down #62

RDruon · 2024-07-09T14:00:29Z

Draft PR that includes a big fixture from jobstats to investigate performance slow down.

codecov · 2024-07-09T14:02:52Z

Codecov Report

Attention: Patch coverage is 85.74181% with 74 lines in your changes missing coverage. Please review.

Project coverage is 93.57%. Comparing base (0c3cd61) to head (a41a546).

Files	Patch %	Lines
lustrefs-exporter/src/main.rs	0.00%	63 Missing ⚠️
lustre-collector/src/oss/job_stats.rs	96.26%	2 Missing and 7 partials ⚠️
lustre-collector/src/lib.rs	0.00%	1 Missing ⚠️
lustre-collector/src/mds/job_stats.rs	99.24%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #62      +/-   ##
==========================================
- Coverage   94.28%   93.57%   -0.71%     
==========================================
  Files          43       43              
  Lines        5582     6071     +489     
  Branches     5582     6071     +489     
==========================================
+ Hits         5263     5681     +418     
- Misses        296      360      +64     
- Partials       23       30       +7

Flag	Coverage Δ
2_14_0_ddn133	`34.71% <48.02%> (+0.98%)`	⬆️
2_14_0_ddn145	`35.34% <48.02%> (+0.92%)`	⬆️
all-tests	`93.57% <85.74%> (-0.71%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

RDruon · 2024-07-09T16:39:59Z

Since I am having issuer with perf, here is the sample code I am using:

    use std::time::Instant;

    let repeat = 100000;

    let job = r#"
- job_id:          "SLURM_JOB_machine184_74186:0:ma"
  snapshot_time:   1720516680
  read_bytes:      { samples:           0, unit: bytes, min:        0, max:        0, sum:                0, sumsq:                  0 }
  write_bytes:     { samples:          52, unit: bytes, min:     4096, max:   475136, sum:          5468160, sumsq:      1071040692224 }
  read:            { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  write:           { samples:          52, unit: usecs, min:       12, max:    40081, sum:           692342, sumsq:        17432258604 }
  getattr:         { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  setattr:         { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  punch:           { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  sync:            { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  destroy:         { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  create:          { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  statfs:          { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  get_info:        { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  set_info:        { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  quotactl:        { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  prealloc:        { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }"#;
    let lctl_output = r#"obdfilter.ds002-OST0000.job_stats=
job_stats:"#;

    let input = format!("{lctl_output}{}", job.to_string().repeat(repeat));

    let now = Instant::now();

    let lctl_record = parse_lctl_output(input.as_bytes())?;

    let elapsed = now.elapsed();
    println!("'parse_lctl_output' took: {:.2?}", elapsed);

    
    let input = format!("job_stats:{}", job.to_string().repeat(repeat));
    
    let now = Instant::now();

    serde_yaml::from_str::<JobStatsOst>(&input).unwrap();

    let elapsed = now.elapsed();
    println!("'serde_yaml' took: {:.2?}", elapsed);

Here is the output:

'parse_lctl_output' took: 38.14s
'serde_yaml' took: 11.31s

RDruon · 2024-07-16T13:41:05Z

Caching the jobstats gives us a very low response time:

[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
3393045

real    0m1.232s
user    0m0.526s
sys     0m0.621s

The memory stays stable at ~2.3Gwith peaks at ~6GB when processing jobstats:

[root@ds86 ~]# systemctl status lustrefs_exporter | grep Memory
   Memory: 5.1G
[root@ds86 ~]# systemctl status lustrefs_exporter | grep Memory
   Memory: 2.0G

RDruon · 2024-07-18T07:17:42Z

Using latest commit, I was able to divide the parsing time by ~3:

'parse_lctl_output' took: 23.93s
'parse_lctl_output_old' took: 75.41s

It relies on zero-copy parser from combine, https://docs.rs/combine/latest/combine/parser/range/index.html

The memory usage seems to be the same but I will need to test the previous implementation with the same script:

[root@node ~]# for i in $(seq 0 1000); do systemctl status lustrefs_exporter.service |grep Memory >> /tmp/test_memory; curl -s http://localhost:32221/metrics -o /dev/null; sleep 1; done; cat /tmp/test_memory | cut -d':' -f2 | numfmt --from=iec | awk -v OFMT='%.10f' '{if(min==""){min=max=$1}; if($1>max) {max=$1}; if($1<min) {min=$1}; total+=$1; count+=1} END {print total/count, max, min}' | xargs -n1 numfmt --to=iec | awk '{if (NR==1) avg=$1; else if (NR==2) max=$1; else if (NR==3) min=$1} END {print "avg " avg " | max " max " | min " min}'
avg 3.7G | max 7.5G | min 1.4G

Please find attached the samply profile from latest commit.
profile.zip
You can use https://profiler.firefox.com/ to visualize it

From the flamegraph, it seems most of the time is now spent deserializing the YAML blob.

Next steps are:

Profile memory usage (most likely due to YAML deserialization), bytehound or heaptrack
Implement a zero-copy parser for the YAML blob, do not need to be YAML compliant but good enough to ingest jobstats YAML

RDruon · 2024-07-19T14:55:28Z

Sub 10s (on my test system) with the latest commit:

'read_to_string' took: 222.52ms
'parse_lctl_output' took: 8.96s

On a real system

[root@ds86 ~]# time ./lustre_collector > /dev/null

real    0m5.466s
user    0m2.464s
sys     0m2.950s

jgrund · 2024-07-19T21:21:13Z

Sub 10s (on my test system) with the latest commit:

'read_to_string' took: 222.52ms
'parse_lctl_output' took: 8.96s

On a real system

[root@ds86 ~]# time ./lustre_collector > /dev/null

real    0m5.466s
user    0m2.464s
sys     0m2.950s

Good improvement. Is this compiled in release mode? Also, what does current memory usage look like?

breuhan · 2024-07-20T06:33:41Z

Yes, actually it has to be compiled in release mode in order to work. There is a slight problem in Debug mode. The memory consumption is down to 440 to 650MB which comes down to reading the source file into memory.
There is still a spike of 2GB when using the lustrefs_exporter at the end. Due to the conversion into Prometheus string.

jgrund · 2024-07-21T17:31:58Z

The memory consumption is down to 440 to 650MB which comes down to reading the source file into memory.

Which source file?

There is still a spike of 2GB when using the lustrefs_exporter at the end. Due to the conversion into Prometheus string.

PrometheusMetric::render() ?

breuhan · 2024-07-21T19:23:03Z

Which source file?

In case of our fixture the read_to_string() function which reads the ds86.txt

PrometheusMetric::render() ?

Exactly, Raphael thought about using the official prometheus crate which supports writing in a stream.

jgrund · 2024-07-21T20:37:35Z

Exactly, Raphael thought about using the official prometheus crate which supports writing in a stream.

Link to method for consideration?

breuhan · 2024-07-22T06:45:04Z

Link to method for consideration?

https://docs.rs/prometheus-client/latest/prometheus_client/encoding/text/fn.encode.html

RDruon · 2024-07-23T14:17:54Z

Caching of jobstats metrics is mandatory:

To avoid timing out
To fix the multi consumer use case, in this case, the parsing of jobstats will be done for each scrape, we want to avoid duplicating this information, which will increase the CPU and memory consumption so we need to cache it.

RDruon · 2024-07-24T13:06:42Z

Here are the results of the latest commit running on a live system.

With jobstats enabled

[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
0

real    0m0.019s
user    0m0.007s
sys     0m0.002s
[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
0

real    0m0.021s
user    0m0.006s
sys     0m0.002s
[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
1652586

real    0m0.419s
user    0m0.193s
sys     0m0.297s
[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
1652586

real    0m0.451s
user    0m0.215s
sys     0m0.316s
[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
1652586

real    0m0.410s
user    0m0.189s
sys     0m0.290s
[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
1652586

real    0m0.430s
user    0m0.215s
sys     0m0.276s
[root@ds86 ~]# for i in $(seq 0 100); do systemctl status lustrefs_exporter.service |grep Memory >> /tmp/test_memory; curl -s [http://localhost:32221/metrics -o /dev/null](http://localhost:32221/metrics%20-o%20/dev/null); sleep 1; done; cat /tmp/test_memory |
cut -d':' -f2 | numfmt --from=iec | awk -v OFMT='%.10f' '{if(min==""){min=max=$1}; if($1>max) {max=$1}; if($1<min) {min=$1}; total+=$1; count+=1} END {print total/count, max, min}' | xargs -n1 numfmt --to=iec
| awk '{if (NR==1) avg=$1; else if (NR==2) max=$1; else if (NR==3) min=$1} END {print "avg " avg " | max " max " | min " min}'
avg 1.8G | max 2.5G | min 753M
[root@ds86 ~]# ps -C lustrefs_exporter -o %cpu,%mem,cmd
%CPU %MEM CMD
70.1  2.0 /usr/bin/lustrefs_exporter
[root@ds86 ~]# ps -C lustrefs_exporter -o %cpu,%mem,cmd
%CPU %MEM CMD
70.2  2.2 /usr/bin/lustrefs_exporter

After some time (~10s) the jobstats cache is warm and the exporter start outputting jobstats metrics.
The average response time is ~0.5s.
The average memory consumption is ~2GB, with a max of 2.5GB and a min of 800MB.
The average CPU consumption is ~70% (not 70% percent of the whole CPU but 70% of a thread)

With jobstats disabled

[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
0

real    0m0.019s
user    0m0.005s
sys     0m0.004s
[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
0

real    0m0.021s
user    0m0.005s
sys     0m0.004s
[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
0

real    0m0.019s
user    0m0.004s
sys     0m0.004s

[root@ds86 ~]# for i in $(seq 0 100); do systemctl status lustrefs_exporter.service |grep Memory >> /tmp/test_memory; curl -s [http://localhost:32221/metrics -o /dev/null](http://localhost:32221/metrics%20-o%20/dev/null); sleep 1; done; cat /tmp/test_memory | cut -d':' -f2 | numfmt --from=iec | awk -v OFMT='%.10f' '{if(min==""){min=max=$1}; if($1>max) {max=$1}; if($1<min) {min=$1}; total+=$1; count+=1} END {print total/count, max, min}' | xargs -n1 numfmt --to=iec | awk '{if (NR==1) avg=$1; else if (NR==2) max=$1; else if (NR==3) min=$1} END {print "avg " avg " | max " max " | min " min}'
avg 4.1M | max 5.0M | min 3.9M
[root@ds86 ~]# ps -C lustrefs_exporter -o %cpu,%mem,cmd
%CPU %MEM CMD
 0.0  0.0 /usr/bin/lustrefs_exporter --disable-jobstats
[root@ds86 ~]# ps -C lustrefs_exporter -o %cpu,%mem,cmd
%CPU %MEM CMD
 0.0  0.0 /usr/bin/lustrefs_exporter --disable-jobstats

The average response time is ~0.02s.
The average memory consumption is ~4MB, with a max of 5MB and a min of 3.9MB.
The average CPU consumption is ~0% (not ~0% percent of the whole CPU but ~0% of a thread)

Summary

	Jobstats	No Jobstats
Memory	~2GB	~5MB
CPU (thread)	~70%	~0%
Response time	~0.5s	~0.02s

RDruon · 2024-07-25T11:05:35Z

lustre-collector/src/main.rs

-    let mgs_fs_handle = thread::spawn(move || -> Result<Vec<Record>, LustreCollectorError> {
-        let lctl_output = get_lctl_mgs_fs_output()?;
-        let lctl_record = parse_mgs_fs_output(&lctl_output)?;
-
-        Ok(lctl_record)
-    });
-
-    let lnetctl_stats_handle =
-        thread::spawn(move || -> Result<Vec<Record>, LustreCollectorError> {
-            let lnetctl_stats_output = get_lnetctl_stats_output()?;
-            let lnetctl_stats_record = parse_lnetctl_stats(str::from_utf8(&lnetctl_stats_output)?)?;
-
-            Ok(lnetctl_stats_record)
-        });
-
-    let recovery_status_handle =
-        thread::spawn(move || -> Result<Vec<Record>, LustreCollectorError> {
-            let recovery_status_output = get_recovery_status_output()?;
-            let recovery_statuses = parse_recovery_status_output(&recovery_status_output)?;
-
-            Ok(recovery_statuses)
-        });


Removed thread::spawn as the results is now referencing a variable local to the thread.
I tried to use Arc without success.

@jgrund any idea?

jgrund · 2024-08-07T19:53:19Z

No longer relevant.

Add fixture showing slow down

99658e4

Use a separate thread to handle jobstats and cache it

d444fef

RDruon force-pushed the rdruon/EX-10117 branch from 92e5047 to f3513a7 Compare July 17, 2024 17:42

RDruon added 2 commits July 18, 2024 01:01

Use zero-copy parser for jobstats

a348bf6

Use lfs to pull fixture in CI

b5dd7ce

RDruon force-pushed the rdruon/EX-10117 branch from f3513a7 to b5dd7ce Compare July 18, 2024 07:04

RDruon force-pushed the rdruon/EX-10117 branch 2 times, most recently from 0dfdd69 to 35a1320 Compare July 19, 2024 14:27

Use combine to parse YAML jobstats

d776a9d

RDruon force-pushed the rdruon/EX-10117 branch from 35a1320 to d776a9d Compare July 19, 2024 14:38

jgrund assigned RDruon Jul 19, 2024

jgrund added the bug Something isn't working label Jul 19, 2024

RDruon added 3 commits July 22, 2024 07:15

Remove unwrap and fix overflow bug (tuple parser)

e3d75d5

Add flag to disable jobstats

7eb556c

Add bench using criterion

4ebaa7a

Use moka cache

a41a546

Use reference for Jobstats

efcf56c

RDruon force-pushed the rdruon/EX-10117 branch from 8d18ae9 to 971909a Compare July 25, 2024 08:08

Fix lustrefs_exporter

a809c3a

RDruon force-pushed the rdruon/EX-10117 branch from 971909a to a809c3a Compare July 25, 2024 08:10

RDruon commented Jul 25, 2024

View reviewed changes

jgrund closed this Aug 7, 2024

jgrund deleted the rdruon/EX-10117 branch August 7, 2024 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge fixture to test slown down #62

Huge fixture to test slown down #62

RDruon commented Jul 9, 2024

codecov bot commented Jul 9, 2024 •

edited

Loading

RDruon commented Jul 9, 2024

RDruon commented Jul 16, 2024

RDruon commented Jul 18, 2024 •

edited

Loading

RDruon commented Jul 19, 2024 •

edited

Loading

jgrund commented Jul 19, 2024 •

edited

Loading

breuhan commented Jul 20, 2024

jgrund commented Jul 21, 2024

breuhan commented Jul 21, 2024 •

edited

Loading

jgrund commented Jul 21, 2024

breuhan commented Jul 22, 2024

RDruon commented Jul 23, 2024

RDruon commented Jul 24, 2024 •

edited

Loading

RDruon Jul 25, 2024

jgrund commented Aug 7, 2024

Huge fixture to test slown down #62

Huge fixture to test slown down #62

Conversation

RDruon commented Jul 9, 2024

codecov bot commented Jul 9, 2024 • edited Loading

Codecov Report

RDruon commented Jul 9, 2024

RDruon commented Jul 16, 2024

RDruon commented Jul 18, 2024 • edited Loading

RDruon commented Jul 19, 2024 • edited Loading

jgrund commented Jul 19, 2024 • edited Loading

breuhan commented Jul 20, 2024

jgrund commented Jul 21, 2024

breuhan commented Jul 21, 2024 • edited Loading

jgrund commented Jul 21, 2024

breuhan commented Jul 22, 2024

RDruon commented Jul 23, 2024

RDruon commented Jul 24, 2024 • edited Loading

With jobstats enabled

With jobstats disabled

Summary

RDruon Jul 25, 2024

Choose a reason for hiding this comment

jgrund commented Aug 7, 2024

codecov bot commented Jul 9, 2024 •

edited

Loading

RDruon commented Jul 18, 2024 •

edited

Loading

RDruon commented Jul 19, 2024 •

edited

Loading

jgrund commented Jul 19, 2024 •

edited

Loading

breuhan commented Jul 21, 2024 •

edited

Loading

RDruon commented Jul 24, 2024 •

edited

Loading