Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ctl::shared_ptr #1229

Closed
wants to merge 18 commits into from
Closed

Implement ctl::shared_ptr #1229

wants to merge 18 commits into from

Conversation

mrdomino
Copy link
Collaborator

@mrdomino mrdomino commented Jun 18, 2024

No description provided.

@mrdomino mrdomino changed the title Ctl shared Implement ctl::shared_ptr Jun 18, 2024
@github-actions github-actions bot added the libc label Jun 18, 2024
@mrdomino mrdomino mentioned this pull request Jun 18, 2024
This code does not segfault and it compiles cleanly, and it took a while
to get here.
__::shared_pointer::make takes ownership of p the same way as shared_ptr
does: either the allocation succeeds and the returned control block owns
it, or the allocation throws and the unique_ptr frees it.

Note that this construction is safe since C++17, in which the evaluation
of constructor arguments is sequenced after a new-expression allocation.
ctl is overloaded...
}

// TODO(mrdomino): exercise more of API
// TODO(mrdomino): threading stress-test
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll find numerous threading torture tests throughout the codebase, since it's sadly the best thing we have without TSAN.

incref(_Atomic(size_t)* r)
{
size_t r2 = atomic_fetch_add_explicit(r, 1, memory_order_relaxed);
if (r2 > ((size_t)-1) >> 1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we realistically expect 2**63 references to happen? Probably not worth the bloat. This function is basically just an XADD instruction. I'd say put it in the header. The atomics header in cosmo libc is pretty lightweight. It's a natural dependency of this class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a check against memory corruption mainly. Could see the argument for it not being worth it.

decref(_Atomic(size_t)* r)
{
if (!atomic_fetch_sub_explicit(r, 1, memory_order_release)) {
atomic_thread_fence(memory_order_acquire);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I feel about atomic_thread_fence(). I just tried using a similar trick to this for my syncthreads() function on an M2 processor and it made things slower. For example:

void
syncthreads(void)
{
    static atomic_uint count;
    static atomic_uint phaser;
    int phase = atomic_load_explicit(&phaser, memory_order_relaxed);
    if (atomic_fetch_add_explicit(&count, 1, memory_order_acq_rel) + 1 == nth) {
        atomic_store_explicit(&count, 0, memory_order_relaxed);
        atomic_store_explicit(&phaser, phase + 1, memory_order_release);
    } else {
        int backoff = 0;
        while (atomic_load_explicit(&phaser, memory_order_acquire) == phase)
            backoff = delay(backoff);
    }
}

Became:

void
syncthreads(void)
{
    static atomic_uint count;
    static atomic_uint phaser;
    int phase = atomic_load_explicit(&phaser, memory_order_relaxed);
    if (atomic_fetch_add_explicit(&count, 1, memory_order_release) + 1 == nth) {
        atomic_thread_fence(memory_order_acquire);
        atomic_store_explicit(&count, 0, memory_order_relaxed);
        atomic_store_explicit(&phaser, phase + 1, memory_order_release);
    } else {
        int backoff = 0;
        while (atomic_load_explicit(&phaser, memory_order_acquire) == phase)
            backoff = delay(backoff);
    }
}

One nasty thing about atomic_thread_fence() is it isn't supported by TSAN at all. So it's harder to prove code is correct.

How certain are you that this decref() implementation is optimal? Could you whip up a torture test + benchmark that demonstrates its merit in comparison to alternatives? I would have assumed that that doing an acquire load beforehand to check for zero would be faster, since it'd let you avoid the costly XADD instruction in many cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not certain at all, would be interesting to see it benchmarked at some point; I just cribbed off of Rust, and Rust is pretty good on these fronts but not perfect.

If acq_rel is opitmal then that's a point in libc++'s favor...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore my earlier comment, I've discovered how to correctly use atomic thread fence. Consider the following barrier, where nth is the number of threads and ith is the thread id:

void
syncthreads(int ith)
{
    static atomic_uint count;
    static atomic_uint phaser;
    int phase = atomic_load_explicit(&phaser, memory_order_relaxed);
    if (atomic_fetch_add_explicit(&count, 1, memory_order_acq_rel) + 1 == nth) {
        atomic_store_explicit(&count, 0, memory_order_relaxed);
        atomic_store_explicit(&phaser, phase + 1, memory_order_release);
    } else {
        int backoff = 0;
        while (atomic_load_explicit(&phaser, memory_order_acquire) == phase)
            backoff = delay(backoff);
    }
}

Spinning on separate cache lines and using thread fence brings its latency down from 7.3 us to 6.2 us on my M2:

struct ggml_phaser
{
    alignas(64) atomic_uint i;
};

void
syncthreads(int ith)
{
    static atomic_uint count;
    static struct ggml_phaser ph[nth];
    int phase = atomic_load_explicit(&ph[ith].i, memory_order_relaxed);
    if (atomic_fetch_add_explicit(&count, 1, memory_order_acq_rel) + 1 == nth) {
        atomic_store_explicit(&count, 0, memory_order_relaxed);
        for (int i = 0; i < nth; ++i)
            atomic_store_explicit(&ph[i].i, phase + 1, memory_order_relaxed);
        atomic_thread_fence(memory_order_release);
    } else {
        int backoff = 0;
        while (atomic_load_explicit(&ph[ith].i, memory_order_relaxed) == phase)
            backoff = delay(backoff);
        atomic_thread_fence(memory_order_acquire);
    }
}

@jart jart force-pushed the master branch 2 times, most recently from faa6285 to 2c4b887 Compare July 25, 2024 08:23
- Adds is_void_v.

- Uses enable_if to disable operator* and operator[] for void types.

- Uses the new school C++-compatible atomic hack.

- There is no implicit conversion from raw pointers to unique_ptr.

Test compiles and passes now.
Just write out the constructor definitions on cppreference.com.

Make the private constructor from a raw rc into a static private method
instead of a private constructor.

Various other minor fixes.

// TODO(mrdomino): move
#ifndef __cplusplus
#define _CTL_ATOMIC(x) _Atomic(x)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So _Atomic is really a C11 thing. The C++ compiler doesn't support it. It's only able to be supported in C++23 because I think it uses a macro to turn it into std::atomic<size_t> which naturally requires schlepping in a million libcxx headers. Should we just use the GCC compiler builtins in this header to make things simpler? Also why check __cplusplus when it's a C++ only header?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically just cribbed this stanza from the 5 other places that pattern is used (rg '_Atomic\(x\)'.) Indeed, no reason to check __cplusplus here.

The STL atomic thing is indeed a nightmare. Maybe we'll wind up writing our own stdatomic.h, with blackjack, etc.

Meanwhile, in the new PR I just use the GCC builtins as you suggest.

// TODO(mrdomino): allocators

template<typename U>
requires is_convertible_v<U, T>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new requires syntax looks cool.

@mrdomino
Copy link
Collaborator Author

Closing in favor of #1267.

@mrdomino mrdomino closed this Aug 28, 2024
@mrdomino mrdomino deleted the ctl-shared branch September 2, 2024 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants