Add multiprocessing / Split delivery code #46

jhunkeler · 2024-09-20T12:46:24Z

This PR

Implements parallel execution of INI [test:*] blocks.
Breaks up delivery.c into smaller more manageable files.

Changes

Running pip install in parallel will, more often than not, break the site-packages directory. In order to solve this I created a new test block key: script_setup. All of these setup scripts are executed in series prior to running any scripts.
If a test block's disable key is true (default: false), the script will not be executed.
If a test block's parallel key is false (default: true) the script will be added to the serial task pool. This is useful when you have a huge test suite and want to use pytest-xdist without oversubscribing the system.
A test block's setup_script is always executed. This ensures all test blocks are using package versions defined in the stasis config.
A formatted table is printed after a pool has exhausted all tasks, or a fatal error occurs.
- STATUS: DONE, FAIL, TERM (TERM is used by --parallel-fail-fast to indicate processes have been kill()'d on tear down)
- PID: The task process ID
- DURATION: Seconds since the child task fork()
- IDENT: The task identifier string
_GNU_SOURCE is now defined globally at compile-time instead of within the source code.
Addresses many of the longstanding compiler warnings thrown by gcc and clang
Adds a cmake option to define _FORTIFY_SOURCE=1 (use cmake -DFORTIFY_SOURCE=ON [..] to enable).
- Unsurprisingly _FORTIFY_SOURCE=2 breaks the code. Variables of type const char * are optimized out all over the place for reasons unknown.

New CLI arguments:

--cpu-limit defines the number of tasks that will run concurrently. If the input value is <1 it is reset to 1. The default is CPU_COUNT - 1
--parallel-fail-fast terminates all processes in a task pool when an error occurs. The behavior of --continue-on-error has not changed. If both "fail fast" and "continue on error" are enabled you may end up with a partially tested environment.

Notes

You can probably see that workaround.tox_posargs has been replaced by a template function tox_run... However, I suggest avoiding tox altogether. Tox generates its own virtual environments that share nothing in common with the STASIS test environment, and because dependencies are managed by tox.ini directly (often wide open with no constraints) it's not even testing anything relevant to your delivery.

In the near future I'm going to rip out tox-related code. Use pytest, or whichever test runner is appropriate for the package you're testing.

* Move core_mem.h below config.h

* Adds --cpu-limit and --parallel-fail-fast arguments * Adds disable, parallel, and setup_script keys to [test] blocks

…d sem_destroy

* Move slot->gate assignment to mp_pool_task() * Remove mmap() to slot->gate. * Change type of ident and log_root variables for the sake of easy (fewer maps)

* Remove multiprocessing.h from other files

* Only initiate a kill if we have more than one process. The current process is already failed out, no need to terminate it again.

* Add get_task_duration() * Add get_pool_show_summary() * Add signaled_by member to MultiProcessingTask * Add time_data member to MultiProcessingTask for duration tracking

* Fix child not returning result of execvp(). task->status is for program status, not fork() status.

* Remove exmain() and dead comments from main()

pllim · 2024-09-27T16:54:46Z

+3,066 −2,040 👏

@kmacdonald-stsci

* reported by @kmacdonald-stsci

@kmacdonald-stsci

* When strdup fails and the temporary file handle is open, close the handle and die. * reported by @kmacdonald-stsci

* pararm -> param

@kmacdonald-stsci

* Reported by @kmacdonald-stsci

…tring * Fix leaks caused by css_filename path and the dirs array

kmacdonald-stsci

Overall it looks fine. I have some questions, but no blockers. Also, I noted some areas where I think there could be memory leaks and a possible race condition.

kmacdonald-stsci · 2024-10-02T12:39:46Z

src/delivery_artifactory.c

+    union INIVal val;
+
+    memset(&data, 0, sizeof(data));
+    data.src = calloc(PATH_MAX, sizeof(*data.src));


Since this is a local array that is always allocated to the same size, maybe just define this as a stack variable, instead of allocating it on the heap to avoid potential memory leaks.

kmacdonald-stsci · 2024-10-02T13:04:26Z

src/utils.c

The functions pushd and popd don't appear to be safe for shared memory. It's possible for one or both of these functions to be called at the same time in two different processes, with undefined results.

Correct. On the bright side these are not used by the child process(es). I think the point of confusion (or at least the point where it looks like it would cause problems) stems from using pushd/popd to enter the package's source directory. At that point the directory is recorded in shared shared memory, and popdd. This takes place before mp_task_fork() is called so it should be safe as-is.

When the fork() occurs later on the child runs chdir(dir_path_waiting_in_shared_memory);

kmacdonald-stsci · 2024-10-02T13:06:01Z

src/delivery_build.c

+            }
+            recipe_type = recipe_get_type(recipe_dir);
+            pushd(recipe_dir);
+            {


Why is this segment of code within a set of curly brackets?

🤯
I think the bare pushd above the curly brackets was supposed to be if (!pushd(receipe_dir)).

kmacdonald-stsci · 2024-10-02T14:51:41Z

src/delivery_build.c

+            }
+
+            pushd(srcdir);
+            {


Why this block inside curly brackets?

Same as the other one. I must have encased the code in brackets before (forgetting to) write an if statement

kmacdonald-stsci · 2024-10-02T15:01:03Z

src/delivery_init.c

+        if (globals.jfrog.url) {
+            guard_free(globals.jfrog.url);
+        }
+        globals.jfrog.url = strdup(jfurl);


Should this check for NULL returns?

kmacdonald-stsci · 2024-10-02T16:00:25Z

src/delivery_postprocess.c

+        sprintf(bottom_index, "%s/%s/index.html", ctx->storage.wheel_artifact_dir, rec->d_name);
+        bottom_fp = fopen(bottom_index, "w+");
+        if (!bottom_fp) {
+            return -3;


The dp and top_fp are still open. For a function that can return in several places, but still needs to do clean up before returning, I suggest setting a return value, the jumping to a CLEANUP label at the end of the function, to ensure all clean up that needs to happen will happen before returning.

kmacdonald-stsci · 2024-10-02T16:00:52Z

src/delivery_postprocess.c

+        sprintf(dpath, "%s/%s", ctx->storage.wheel_artifact_dir, rec->d_name);
+        struct StrList *packages = listdir(dpath);
+        if (!packages) {
+            fclose(top_fp);


dp is sill open.

kmacdonald-stsci · 2024-10-02T16:06:37Z

src/stasis_main.c

The main function is 500 lines long. I suggest refactoring that into smaller, easier to read and follow functions.

Agreed. I'll refactor it in a separate PR

kmacdonald-stsci · 2024-10-02T16:08:42Z

src/template_func_proto.c

+            }
+            char *basetemp_path = NULL;
+            if (get_basetemp_dir_entrypoint(f, &basetemp_path)) {
+                return -2;


Could this be a memory leak for output or is data_out handled by the calling function?

data_out is allocated at the start of get_basetemp_dir_entrypoint so this might be a leak. I'll have to run it through the debugger to make sure.

kmacdonald-stsci · 2024-10-02T16:36:50Z

src/multiprocessing.c

+    task->pid = pid;
+    task->parent_pid = pid;
+
+    mp_global_task_count++;


Will this increment properly without locking first? Isn't there a possibility for a race condition here?

Good question. I haven't observed any log clobbering at all, but it's possible that you're right and a lock needs to exist... even for a nanosecond.

* All tasks are executed by the same machinery under the hood. So have them all react the same way.

jhunkeler added 30 commits September 18, 2024 23:06

Move guard_ macros to core_mem.h

b7251ce

* Move core_mem.h below config.h

Implement multiprocessing pool(s)

8f17199

* Adds --cpu-limit and --parallel-fail-fast arguments * Adds disable, parallel, and setup_script keys to [test] blocks

Update integration test to utilize the multiprocessing pool

261c91d

Fixing headers

8797163

Correct package name

4d68bd4

Darwin portability: Use sem_open and sem_close instead of sem_init an…

1fe385d

…d sem_destroy

Darwin: Remove mmap MAP_POPULATE flag

4e0e40b

Fix sem_open initial state

17d3d05

* Move slot->gate assignment to mp_pool_task() * Remove mmap() to slot->gate. * Change type of ident and log_root variables for the sake of easy (fewer maps)

Workaround for a bug in firewatch

da8196e

Fix mp_pool_join example

8573ad7

Fix opt_flags assignment.

f2a5bc6

Add multiprocessing.h to core.h

63cf314

* Remove multiprocessing.h from other files

Fix doxygen comments

1e320e2

Update example config

fd3b4bd

Guard against overrun

7c0b2a9

Split mp_task into to functions

ead0756

Add test_multiprocessing.c

754bb98

Break parent/child calls into static functions

6921cb4

Add comments, remove dead code

daf8c81

Set task status to -1 by default

3c3468d

Wait for signaled processes to hang up

0f95b43

* Only initiate a kill if we have more than one process. The current process is already failed out, no need to terminate it again.

Add pool summary and elapsed time output

db1a305

* Add get_task_duration() * Add get_pool_show_summary() * Add signaled_by member to MultiProcessingTask * Add time_data member to MultiProcessingTask for duration tracking

Implement mp_pool_show_summary

16597bc

Fix format spacing

60c0a3c

Fix test status expectation

3ef3a27

* Fix child not returning result of execvp(). task->status is for program status, not fork() status.

Remove short circuit test code

6f7cf6e

* Remove exmain() and dead comments from main()

Remove workaround.tox_posargs

76e8bac

Bugfix: log_show_contents() did not close FILE pointer

4efce32

Rename mp_task to mp_pool_task

8b47235

mp_pool_kill marks PIDs as unused

528b3b2

jhunkeler added the enhancement New feature or request label Sep 27, 2024

jhunkeler mentioned this pull request Sep 27, 2024

Add multiprocessing #43

Closed

jhunkeler changed the title ~~Split delivery code~~ Add multiprocessing / Split delivery code Sep 27, 2024

jhunkeler added 3 commits September 27, 2024 16:02

mp_pool_init(): return NULL when ident argument is NULL

108242c

* reported by @kmacdonald-stsci

Fix missing COMMAND string in the log header

9f535ba

Fix leak

a84b874

* When strdup fails and the temporary file handle is open, close the handle and die. * reported by @kmacdonald-stsci

jhunkeler force-pushed the split-delivery-code branch from 203708e to a84b874 Compare September 27, 2024 20:36

jhunkeler added 8 commits September 30, 2024 11:44

Fix typo

31db9bb

* pararm -> param

Fix leaking of basetemp_path and jxml_path on error

d1642b3

* Reported by @kmacdonald-stsci

Allocate runner_cmd using asprintf

73219c6

Shorten comment

4e45792

Replace sprintf with snprintf

3881ec0

Fix leaks in tests

239926d

Replace strcpy with strlcpy

8a0b0d1

Replace strlcpy with strncpy (maybe later)

d5fa0c2

jhunkeler requested a review from kmacdonald-stsci October 1, 2024 12:36

jhunkeler added 4 commits October 1, 2024 15:40

shell: exit program when stream redirection fails

9366c56

Add comment about use of xtrace

1ba1fcc

Add missing space to destdir to ensure its separate from the srcdir s…

e462c06

…tring * Fix leaks caused by css_filename path and the dirs array

Use watcher_diff to see how many seconds have elapsed.

23253d9

kmacdonald-stsci approved these changes Oct 2, 2024

View reviewed changes

jhunkeler added 7 commits October 2, 2024 14:54

Free resources on error in delivery_index_wheel_artifacts

4b948ef

Allow user to define the time interval for "task is running" message

f6c5046

Allow user to define the time interval for "task is running" message

04cf9ee

Allow user to define the time interval for "task is running" message

9028e5e

Allow user to disable parallel mode (shortcut for --cpu-limit=1)

6fe8c25

"Task started" is more accurate than "queued" when this is printed

b041e82

Rename argument --parallel-fail-fast to --fail-fast

f0ba8cd

* All tasks are executed by the same machinery under the hood. So have them all react the same way.

jhunkeler merged commit d7e3deb into spacetelescope:master Oct 4, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multiprocessing / Split delivery code #46

Add multiprocessing / Split delivery code #46

jhunkeler commented Sep 20, 2024 •

edited

Loading

pllim commented Sep 27, 2024

kmacdonald-stsci left a comment

kmacdonald-stsci Oct 2, 2024

kmacdonald-stsci Oct 2, 2024

jhunkeler Oct 2, 2024

kmacdonald-stsci Oct 2, 2024

jhunkeler Oct 2, 2024

kmacdonald-stsci Oct 2, 2024

jhunkeler Oct 2, 2024

kmacdonald-stsci Oct 2, 2024

kmacdonald-stsci Oct 2, 2024

kmacdonald-stsci Oct 2, 2024

kmacdonald-stsci Oct 2, 2024

jhunkeler Oct 2, 2024

kmacdonald-stsci Oct 2, 2024

jhunkeler Oct 2, 2024

kmacdonald-stsci Oct 2, 2024

jhunkeler Oct 2, 2024

Add multiprocessing / Split delivery code #46

Add multiprocessing / Split delivery code #46

Conversation

jhunkeler commented Sep 20, 2024 • edited Loading

Changes

New CLI arguments:

Notes

pllim commented Sep 27, 2024

kmacdonald-stsci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhunkeler commented Sep 20, 2024 •

edited

Loading