Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge fixture to test slown down #62

Closed
wants to merge 11 commits into from
Closed

Huge fixture to test slown down #62

wants to merge 11 commits into from

Conversation

RDruon
Copy link
Contributor

@RDruon RDruon commented Jul 9, 2024

Draft PR that includes a big fixture from jobstats to investigate performance slow down.

Copy link

codecov bot commented Jul 9, 2024

Codecov Report

Attention: Patch coverage is 85.74181% with 74 lines in your changes missing coverage. Please review.

Project coverage is 93.57%. Comparing base (0c3cd61) to head (a41a546).

Files Patch % Lines
lustrefs-exporter/src/main.rs 0.00% 63 Missing ⚠️
lustre-collector/src/oss/job_stats.rs 96.26% 2 Missing and 7 partials ⚠️
lustre-collector/src/lib.rs 0.00% 1 Missing ⚠️
lustre-collector/src/mds/job_stats.rs 99.24% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #62      +/-   ##
==========================================
- Coverage   94.28%   93.57%   -0.71%     
==========================================
  Files          43       43              
  Lines        5582     6071     +489     
  Branches     5582     6071     +489     
==========================================
+ Hits         5263     5681     +418     
- Misses        296      360      +64     
- Partials       23       30       +7     
Flag Coverage Δ
2_14_0_ddn133 34.71% <48.02%> (+0.98%) ⬆️
2_14_0_ddn145 35.34% <48.02%> (+0.92%) ⬆️
all-tests 93.57% <85.74%> (-0.71%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@RDruon
Copy link
Contributor Author

RDruon commented Jul 9, 2024

Since I am having issuer with perf, here is the sample code I am using:

    use std::time::Instant;

    let repeat = 100000;

    let job = r#"
- job_id:          "SLURM_JOB_machine184_74186:0:ma"
  snapshot_time:   1720516680
  read_bytes:      { samples:           0, unit: bytes, min:        0, max:        0, sum:                0, sumsq:                  0 }
  write_bytes:     { samples:          52, unit: bytes, min:     4096, max:   475136, sum:          5468160, sumsq:      1071040692224 }
  read:            { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  write:           { samples:          52, unit: usecs, min:       12, max:    40081, sum:           692342, sumsq:        17432258604 }
  getattr:         { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  setattr:         { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  punch:           { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  sync:            { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  destroy:         { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  create:          { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  statfs:          { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  get_info:        { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  set_info:        { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  quotactl:        { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }
  prealloc:        { samples:           0, unit: usecs, min:        0, max:        0, sum:                0, sumsq:                  0 }"#;
    let lctl_output = r#"obdfilter.ds002-OST0000.job_stats=
job_stats:"#;

    let input = format!("{lctl_output}{}", job.to_string().repeat(repeat));

    let now = Instant::now();

    let lctl_record = parse_lctl_output(input.as_bytes())?;

    let elapsed = now.elapsed();
    println!("'parse_lctl_output' took: {:.2?}", elapsed);

    
    let input = format!("job_stats:{}", job.to_string().repeat(repeat));
    
    let now = Instant::now();

    serde_yaml::from_str::<JobStatsOst>(&input).unwrap();

    let elapsed = now.elapsed();
    println!("'serde_yaml' took: {:.2?}", elapsed);

Here is the output:

'parse_lctl_output' took: 38.14s
'serde_yaml' took: 11.31s

@RDruon
Copy link
Contributor Author

RDruon commented Jul 16, 2024

Caching the jobstats gives us a very low response time:

[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
3393045

real    0m1.232s
user    0m0.526s
sys     0m0.621s

The memory stays stable at ~2.3Gwith peaks at ~6GB when processing jobstats:

[root@ds86 ~]# systemctl status lustrefs_exporter | grep Memory
   Memory: 5.1G
[root@ds86 ~]# systemctl status lustrefs_exporter | grep Memory
   Memory: 2.0G

@RDruon RDruon force-pushed the rdruon/EX-10117 branch from 92e5047 to f3513a7 Compare July 17, 2024 17:42
@RDruon RDruon force-pushed the rdruon/EX-10117 branch from f3513a7 to b5dd7ce Compare July 18, 2024 07:04
@RDruon
Copy link
Contributor Author

RDruon commented Jul 18, 2024

Using latest commit, I was able to divide the parsing time by ~3:

'parse_lctl_output' took: 23.93s
'parse_lctl_output_old' took: 75.41s

It relies on zero-copy parser from combine, https://docs.rs/combine/latest/combine/parser/range/index.html

The memory usage seems to be the same but I will need to test the previous implementation with the same script:

[root@node ~]# for i in $(seq 0 1000); do systemctl status lustrefs_exporter.service |grep Memory >> /tmp/test_memory; curl -s http://localhost:32221/metrics -o /dev/null; sleep 1; done; cat /tmp/test_memory | cut -d':' -f2 | numfmt --from=iec | awk -v OFMT='%.10f' '{if(min==""){min=max=$1}; if($1>max) {max=$1}; if($1<min) {min=$1}; total+=$1; count+=1} END {print total/count, max, min}' | xargs -n1 numfmt --to=iec | awk '{if (NR==1) avg=$1; else if (NR==2) max=$1; else if (NR==3) min=$1} END {print "avg " avg " | max " max " | min " min}'
avg 3.7G | max 7.5G | min 1.4G

Please find attached the samply profile from latest commit.
profile.zip
You can use https://profiler.firefox.com/ to visualize it

From the flamegraph, it seems most of the time is now spent deserializing the YAML blob.
image

Next steps are:

  • Profile memory usage (most likely due to YAML deserialization), bytehound or heaptrack
  • Implement a zero-copy parser for the YAML blob, do not need to be YAML compliant but good enough to ingest jobstats YAML

@RDruon RDruon force-pushed the rdruon/EX-10117 branch 2 times, most recently from 0dfdd69 to 35a1320 Compare July 19, 2024 14:27
@RDruon RDruon force-pushed the rdruon/EX-10117 branch from 35a1320 to d776a9d Compare July 19, 2024 14:38
@RDruon
Copy link
Contributor Author

RDruon commented Jul 19, 2024

Sub 10s (on my test system) with the latest commit:

'read_to_string' took: 222.52ms
'parse_lctl_output' took: 8.96s

On a real system

[root@ds86 ~]# time ./lustre_collector > /dev/null

real    0m5.466s
user    0m2.464s
sys     0m2.950s

@jgrund
Copy link
Member

jgrund commented Jul 19, 2024

Sub 10s (on my test system) with the latest commit:

'read_to_string' took: 222.52ms
'parse_lctl_output' took: 8.96s

On a real system

[root@ds86 ~]# time ./lustre_collector > /dev/null

real    0m5.466s
user    0m2.464s
sys     0m2.950s

Good improvement. Is this compiled in release mode? Also, what does current memory usage look like?

@jgrund jgrund added the bug Something isn't working label Jul 19, 2024
@breuhan
Copy link
Contributor

breuhan commented Jul 20, 2024

Yes, actually it has to be compiled in release mode in order to work. There is a slight problem in Debug mode. The memory consumption is down to 440 to 650MB which comes down to reading the source file into memory.
There is still a spike of 2GB when using the lustrefs_exporter at the end. Due to the conversion into Prometheus string.

@jgrund
Copy link
Member

jgrund commented Jul 21, 2024

The memory consumption is down to 440 to 650MB which comes down to reading the source file into memory.

Which source file?

There is still a spike of 2GB when using the lustrefs_exporter at the end. Due to the conversion into Prometheus string.

PrometheusMetric::render() ?

@breuhan
Copy link
Contributor

breuhan commented Jul 21, 2024

Which source file?

In case of our fixture the read_to_string() function which reads the ds86.txt

PrometheusMetric::render() ?

Exactly, Raphael thought about using the official prometheus crate which supports writing in a stream.

@jgrund
Copy link
Member

jgrund commented Jul 21, 2024

Exactly, Raphael thought about using the official prometheus crate which supports writing in a stream.

Link to method for consideration?

@breuhan
Copy link
Contributor

breuhan commented Jul 22, 2024

@RDruon
Copy link
Contributor Author

RDruon commented Jul 23, 2024

Caching of jobstats metrics is mandatory:

  • To avoid timing out
  • To fix the multi consumer use case, in this case, the parsing of jobstats will be done for each scrape, we want to avoid duplicating this information, which will increase the CPU and memory consumption so we need to cache it.

@RDruon
Copy link
Contributor Author

RDruon commented Jul 24, 2024

Here are the results of the latest commit running on a live system.

With jobstats enabled

[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
0

real    0m0.019s
user    0m0.007s
sys     0m0.002s
[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
0

real    0m0.021s
user    0m0.006s
sys     0m0.002s
[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
1652586

real    0m0.419s
user    0m0.193s
sys     0m0.297s
[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
1652586

real    0m0.451s
user    0m0.215s
sys     0m0.316s
[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
1652586

real    0m0.410s
user    0m0.189s
sys     0m0.290s
[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
1652586

real    0m0.430s
user    0m0.215s
sys     0m0.276s
[root@ds86 ~]# for i in $(seq 0 100); do systemctl status lustrefs_exporter.service |grep Memory >> /tmp/test_memory; curl -s [http://localhost:32221/metrics -o /dev/null](http://localhost:32221/metrics%20-o%20/dev/null); sleep 1; done; cat /tmp/test_memory |
cut -d':' -f2 | numfmt --from=iec | awk -v OFMT='%.10f' '{if(min==""){min=max=$1}; if($1>max) {max=$1}; if($1<min) {min=$1}; total+=$1; count+=1} END {print total/count, max, min}' | xargs -n1 numfmt --to=iec
| awk '{if (NR==1) avg=$1; else if (NR==2) max=$1; else if (NR==3) min=$1} END {print "avg " avg " | max " max " | min " min}'
avg 1.8G | max 2.5G | min 753M
[root@ds86 ~]# ps -C lustrefs_exporter -o %cpu,%mem,cmd
%CPU %MEM CMD
70.1  2.0 /usr/bin/lustrefs_exporter
[root@ds86 ~]# ps -C lustrefs_exporter -o %cpu,%mem,cmd
%CPU %MEM CMD
70.2  2.2 /usr/bin/lustrefs_exporter

After some time (~10s) the jobstats cache is warm and the exporter start outputting jobstats metrics.
The average response time is ~0.5s.
The average memory consumption is ~2GB, with a max of 2.5GB and a min of 800MB.
The average CPU consumption is ~70% (not 70% percent of the whole CPU but 70% of a thread)

With jobstats disabled

[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
0

real    0m0.019s
user    0m0.005s
sys     0m0.004s
[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
0

real    0m0.021s
user    0m0.005s
sys     0m0.004s
[root@ds86 ~]# time curl -s http://localhost:32221/metrics | grep job | wc -l
0

real    0m0.019s
user    0m0.004s
sys     0m0.004s

[root@ds86 ~]# for i in $(seq 0 100); do systemctl status lustrefs_exporter.service |grep Memory >> /tmp/test_memory; curl -s [http://localhost:32221/metrics -o /dev/null](http://localhost:32221/metrics%20-o%20/dev/null); sleep 1; done; cat /tmp/test_memory | cut -d':' -f2 | numfmt --from=iec | awk -v OFMT='%.10f' '{if(min==""){min=max=$1}; if($1>max) {max=$1}; if($1<min) {min=$1}; total+=$1; count+=1} END {print total/count, max, min}' | xargs -n1 numfmt --to=iec | awk '{if (NR==1) avg=$1; else if (NR==2) max=$1; else if (NR==3) min=$1} END {print "avg " avg " | max " max " | min " min}'
avg 4.1M | max 5.0M | min 3.9M
[root@ds86 ~]# ps -C lustrefs_exporter -o %cpu,%mem,cmd
%CPU %MEM CMD
 0.0  0.0 /usr/bin/lustrefs_exporter --disable-jobstats
[root@ds86 ~]# ps -C lustrefs_exporter -o %cpu,%mem,cmd
%CPU %MEM CMD
 0.0  0.0 /usr/bin/lustrefs_exporter --disable-jobstats

The average response time is ~0.02s.
The average memory consumption is ~4MB, with a max of 5MB and a min of 3.9MB.
The average CPU consumption is ~0% (not ~0% percent of the whole CPU but ~0% of a thread)

Summary

Jobstats No Jobstats
Memory ~2GB ~5MB
CPU (thread) ~70% ~0%
Response time ~0.5s ~0.02s

@RDruon RDruon force-pushed the rdruon/EX-10117 branch from 8d18ae9 to 971909a Compare July 25, 2024 08:08
@RDruon RDruon force-pushed the rdruon/EX-10117 branch from 971909a to a809c3a Compare July 25, 2024 08:10
Comment on lines -124 to -145
let mgs_fs_handle = thread::spawn(move || -> Result<Vec<Record>, LustreCollectorError> {
let lctl_output = get_lctl_mgs_fs_output()?;
let lctl_record = parse_mgs_fs_output(&lctl_output)?;

Ok(lctl_record)
});

let lnetctl_stats_handle =
thread::spawn(move || -> Result<Vec<Record>, LustreCollectorError> {
let lnetctl_stats_output = get_lnetctl_stats_output()?;
let lnetctl_stats_record = parse_lnetctl_stats(str::from_utf8(&lnetctl_stats_output)?)?;

Ok(lnetctl_stats_record)
});

let recovery_status_handle =
thread::spawn(move || -> Result<Vec<Record>, LustreCollectorError> {
let recovery_status_output = get_recovery_status_output()?;
let recovery_statuses = parse_recovery_status_output(&recovery_status_output)?;

Ok(recovery_statuses)
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed thread::spawn as the results is now referencing a variable local to the thread.
I tried to use Arc without success.

@jgrund any idea?

@jgrund jgrund closed this Aug 7, 2024
@jgrund
Copy link
Member

jgrund commented Aug 7, 2024

No longer relevant.

@jgrund jgrund deleted the rdruon/EX-10117 branch August 7, 2024 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants