Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiprocessing / Split delivery code #46

Merged
merged 84 commits into from
Oct 4, 2024

Conversation

jhunkeler
Copy link
Collaborator

@jhunkeler jhunkeler commented Sep 20, 2024

This PR

  • Implements parallel execution of INI [test:*] blocks.
  • Breaks up delivery.c into smaller more manageable files.

Changes

  • Running pip install in parallel will, more often than not, break the site-packages directory. In order to solve this I created a new test block key: script_setup. All of these setup scripts are executed in series prior to running any scripts.
  • If a test block's disable key is true (default: false), the script will not be executed.
  • If a test block's parallel key is false (default: true) the script will be added to the serial task pool. This is useful when you have a huge test suite and want to use pytest-xdist without oversubscribing the system.
  • A test block's setup_script is always executed. This ensures all test blocks are using package versions defined in the stasis config.
  • A formatted table is printed after a pool has exhausted all tasks, or a fatal error occurs.
    • STATUS: DONE, FAIL, TERM (TERM is used by --parallel-fail-fast to indicate processes have been kill()'d on tear down)
    • PID: The task process ID
    • DURATION: Seconds since the child task fork()
    • IDENT: The task identifier string
  • _GNU_SOURCE is now defined globally at compile-time instead of within the source code.
  • Addresses many of the longstanding compiler warnings thrown by gcc and clang
  • Adds a cmake option to define _FORTIFY_SOURCE=1 (use cmake -DFORTIFY_SOURCE=ON [..] to enable).
    • Unsurprisingly _FORTIFY_SOURCE=2 breaks the code. Variables of type const char * are optimized out all over the place for reasons unknown.

New CLI arguments:

  • --cpu-limit defines the number of tasks that will run concurrently. If the input value is <1 it is reset to 1. The default is CPU_COUNT - 1
  • --parallel-fail-fast terminates all processes in a task pool when an error occurs. The behavior of --continue-on-error has not changed. If both "fail fast" and "continue on error" are enabled you may end up with a partially tested environment.

Notes

You can probably see that workaround.tox_posargs has been replaced by a template function tox_run... However, I suggest avoiding tox altogether. Tox generates its own virtual environments that share nothing in common with the STASIS test environment, and because dependencies are managed by tox.ini directly (often wide open with no constraints) it's not even testing anything relevant to your delivery.

In the near future I'm going to rip out tox-related code. Use pytest, or whichever test runner is appropriate for the package you're testing.

* Move core_mem.h below config.h
* Adds --cpu-limit and --parallel-fail-fast arguments
* Adds disable, parallel, and setup_script keys to [test] blocks
* Move slot->gate assignment to mp_pool_task()
* Remove mmap() to slot->gate.
* Change type of ident and log_root variables for the sake of easy (fewer maps)
* Remove multiprocessing.h from other files
* Only initiate a kill if we have more than one process. The current process is already failed out, no need to terminate it again.
* Add get_task_duration()
* Add get_pool_show_summary()
* Add signaled_by member to MultiProcessingTask
* Add time_data member to MultiProcessingTask for duration tracking
* Fix child not returning result of execvp(). task->status is for program status, not fork() status.
* Remove exmain() and dead comments from main()
@jhunkeler jhunkeler added the enhancement New feature or request label Sep 27, 2024
@jhunkeler jhunkeler mentioned this pull request Sep 27, 2024
@jhunkeler jhunkeler changed the title Split delivery code Add multiprocessing / Split delivery code Sep 27, 2024
@pllim
Copy link
Contributor

pllim commented Sep 27, 2024

+3,066 −2,040 👏

* When strdup fails and the temporary file handle is open, close the handle and die.
* reported by @kmacdonald-stsci
Copy link
Collaborator

@kmacdonald-stsci kmacdonald-stsci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks fine. I have some questions, but no blockers. Also, I noted some areas where I think there could be memory leaks and a possible race condition.

union INIVal val;

memset(&data, 0, sizeof(data));
data.src = calloc(PATH_MAX, sizeof(*data.src));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a local array that is always allocated to the same size, maybe just define this as a stack variable, instead of allocating it on the heap to avoid potential memory leaks.

src/utils.c Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions pushd and popd don't appear to be safe for shared memory. It's possible for one or both of these functions to be called at the same time in two different processes, with undefined results.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. On the bright side these are not used by the child process(es). I think the point of confusion (or at least the point where it looks like it would cause problems) stems from using pushd/popd to enter the package's source directory. At that point the directory is recorded in shared shared memory, and popdd. This takes place before mp_task_fork() is called so it should be safe as-is.

When the fork() occurs later on the child runs chdir(dir_path_waiting_in_shared_memory);

}
recipe_type = recipe_get_type(recipe_dir);
pushd(recipe_dir);
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this segment of code within a set of curly brackets?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯
I think the bare pushd above the curly brackets was supposed to be if (!pushd(receipe_dir)).

}

pushd(srcdir);
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this block inside curly brackets?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the other one. I must have encased the code in brackets before (forgetting to) write an if statement

if (globals.jfrog.url) {
guard_free(globals.jfrog.url);
}
globals.jfrog.url = strdup(jfurl);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check for NULL returns?

sprintf(bottom_index, "%s/%s/index.html", ctx->storage.wheel_artifact_dir, rec->d_name);
bottom_fp = fopen(bottom_index, "w+");
if (!bottom_fp) {
return -3;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dp and top_fp are still open. For a function that can return in several places, but still needs to do clean up before returning, I suggest setting a return value, the jumping to a CLEANUP label at the end of the function, to ensure all clean up that needs to happen will happen before returning.

sprintf(dpath, "%s/%s", ctx->storage.wheel_artifact_dir, rec->d_name);
struct StrList *packages = listdir(dpath);
if (!packages) {
fclose(top_fp);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dp is sill open.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main function is 500 lines long. I suggest refactoring that into smaller, easier to read and follow functions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'll refactor it in a separate PR

}
char *basetemp_path = NULL;
if (get_basetemp_dir_entrypoint(f, &basetemp_path)) {
return -2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a memory leak for output or is data_out handled by the calling function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_out is allocated at the start of get_basetemp_dir_entrypoint so this might be a leak. I'll have to run it through the debugger to make sure.

task->pid = pid;
task->parent_pid = pid;

mp_global_task_count++;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this increment properly without locking first? Isn't there a possibility for a race condition here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I haven't observed any log clobbering at all, but it's possible that you're right and a lock needs to exist... even for a nanosecond.

@jhunkeler jhunkeler merged commit d7e3deb into spacetelescope:master Oct 4, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants