Investigate performance of Score Runs #270

MrSerth · 2023-01-04T18:12:26Z

Some score runs triggered in CodeOcean are notoriously slow and take quite some time to finish. This led to some teachers spending a lot of time combining various test cases to minimize scoring results. One of the optimizations taken is to squash all test cases into one file, don't use the linter (in Python) or don't use a testing framework at all. Especially putting all tests into one file is somewhat understandable, as in the current workflow test files get executed sequentially.

Obviously, we want to optimize our tool rather than putting more efforts on teachers to optimize their tests. Therefore, before taking any specific steps, we should get a better understanding of the score runs and actually collect some profiling data. The data could help us answer the following questions:

How much time is needed ...

in CodeOcean
- to save an exercise before the actual scoring request is issued
- to collect files
- to request a runner
- to send files to Poseidon
in Poseidon
- to copy files to an allocation in Nomad
- to prepare an execution
in the allocation
- to compile the code (if applicable, e.g., for Java)
- to execute the actual test cases
in CodeOcean
- to parse the scoring result using the RegEx
- store the output of the testrun (and other post-processing tasks)
- to forward the response

Based on these numbers, we could identify the main pain points to tackle them individually.

mpass99 · 2023-01-30T10:01:08Z

I see two methods of collecting the required data:

Add log statements including the duration for the individual parts; Manually collect data of interest.
Add Influx client for CodeOcean (or use the Prometheus endpoint?); Send the duration data for the individual parts to our Influx server; Create appropriate panels.

I prefer the second option as it allows easy continuous evaluations although it requires more effort now. What do you think?

mpass99 · 2023-01-30T10:07:23Z

* in CodeOcean
  
  * to request a runner
  * to send files to Poseidon

* in Poseidon
  
  * to copy files to an allocation in Nomad
  * to prepare an execution

We already collect this data with our Poseidon influxdb Middleware. Do you want to collect additional data? Such as the network delay or the Poseidon overhead of the Nomad Copy Files process?

MrSerth · 2023-02-01T15:08:17Z

I like the idea 👍, let's have a look at Sentry's Distributed Tracing to capture the data and proceed with our evaluation.

MrSerth · 2023-02-09T19:43:33Z

The Distributed Tracing on Sentry works so far and traces are associated across CodeOcean and Poseidon. However, we don't have an instrumentation for the WebSocket part in CodeOcean yet, and therefore are missing the corresponding span in Poseidon. Hence:

Add custom span around WebSocket in CodeOcean
Ensure the TraceID is sent along the WebSocket opening from CodeOcean to Poseidon (to associate the corresponding request)
Add support for JavaScript Tracing (and keep trace ID between "create submission" and "code execution")
Add and configure Sentry Relay

MrSerth · 2023-02-10T18:21:32Z

I created PR openHPI/codeocean#1536 tackling the first two aspects of my list above.

mpass99 · 2023-02-15T15:49:20Z

Additionally, we want to gain insights of the span nomad.execute.exec that is describing the Nomad execute request. To do so, we produce a special string in the bash command, scan the output for this string and convert it into a Sentry span.

MrSerth · 2023-05-12T16:19:04Z

I've added support for the JavaScript-based Sentry library for CodeOcean; those changes have been integrated. The final step is to finish the Sentry Relay, as tracked by #370.

mpass99 · 2023-09-11T12:52:29Z

We have performed our first analysis. Leading to further steps.

Example Performance Breakdown

Sentry Performance Measurement

From the 24th to the 28th of August.

in CodeOcean
- ❗ to save an exercise before the actual scoring request is issued
  - P50: 71.60ms, P95: 144.54ms
- ✔️ to collect files
  - P75: 9.90ms
- ✔️ to request a runner
  - P50: 4.27ms, P95: 9.22ms
- ✔️ to send files to Poseidon
  - P50: 389.78ms, P95: 449.59ms
  - While the latency of 12ms is ok the file processing by Poseidon is not as we see in the next step.
in Poseidon
- ❗ to copy files to an allocation in Nomad
  - P50: 377.55ms, P95: 432.42ms
  - See Nomad connection reuse #444, Remove unsetting env variables for file copy requests #446, Python: Investigate impact of file size #450
- ✔️ to prepare an execution
  - P50: 4.06ms, P95: 8.23ms
in the allocation
- ❗ connect to the allocation ❗
  - P50: 232ms, P95: 590ms
  - See Nomad connection reuse #444, Nomad: StdOut and StdErr over one Nomad connection #445, Investigate Nomad Exec Connect Delay #449
- ❗ unset environment variables
  - P50: 36ms, P95: 46ms
  - Investigate gVisor impact on unsetting environment variables #447
- ❗ setUser to an unprivileged
  - P50: 86ms, P95: 236ms
  - See Sentry: Add Debug Message #442
- See "Payload execution"
- ❗ disconnect from allocation
  - P50: 70ms, P95: 165ms
  - See Nomad connection reuse #444
Payload execution
- Java
  - to compile the code
    - provisionally value: ~1300ms/1500ms
  - to execute the actual test cases
  - See Java: Add Performance Spans #439
- Python
  - to load the modules and imports
  - to execute the actual test cases
  - See Python: Add Performance Spans #440
- R
  - to load the modules and imports
  - to execute the actual test cases
  - See R: Add Performance Spans #441
in CodeOcean
- to parse the scoring result using the RegEx
  - See Sentry: Add Performance Spans codeocean#1890
- ✔️ store the output of the testrun (and other post-processing tasks)
  - not too much, about 20ms...
- to forward the response
  - See Sentry: Add Performance Spans codeocean#1890

Next steps

Until the next analysis we want to complete the issues:

Additionally, we want to complete these issues:

MrSerth added enhancement New feature or request help wanted Extra attention is needed deployment Everything related to our production environment labels Jan 4, 2023

mpass99 mentioned this issue Feb 2, 2023

Sentry Performance Tracing #275

Merged

MrSerth mentioned this issue Feb 10, 2023

Enable Sentry instrumentation for WebSocket connection openHPI/codeocean#1536

Merged

MrSerth mentioned this issue Feb 15, 2023

Slightly improve performance for code execution and improve Sentry transaction listing openHPI/codeocean#1543

Merged

mpass99 mentioned this issue Feb 17, 2023

Add Sentry Spans for Bash execution. #289

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance of Score Runs #270

Investigate performance of Score Runs #270

MrSerth commented Jan 4, 2023

mpass99 commented Jan 30, 2023

mpass99 commented Jan 30, 2023

MrSerth commented Feb 1, 2023

MrSerth commented Feb 9, 2023 •

edited by mpass99

Loading

MrSerth commented Feb 10, 2023

mpass99 commented Feb 15, 2023

MrSerth commented May 12, 2023

mpass99 commented Sep 11, 2023 •

edited

Loading

Investigate performance of Score Runs #270

Investigate performance of Score Runs #270

Comments

MrSerth commented Jan 4, 2023

mpass99 commented Jan 30, 2023

mpass99 commented Jan 30, 2023

MrSerth commented Feb 1, 2023

MrSerth commented Feb 9, 2023 • edited by mpass99 Loading

MrSerth commented Feb 10, 2023

mpass99 commented Feb 15, 2023

MrSerth commented May 12, 2023

mpass99 commented Sep 11, 2023 • edited Loading

MrSerth commented Feb 9, 2023 •

edited by mpass99

Loading

mpass99 commented Sep 11, 2023 •

edited

Loading