Implement ctl::shared_ptr #1229

mrdomino · 2024-06-18T03:32:55Z

No description provided.

This code does not segfault and it compiles cleanly, and it took a while to get here.

__::shared_pointer::make takes ownership of p the same way as shared_ptr does: either the allocation succeeds and the returned control block owns it, or the allocation throws and the unique_ptr frees it. Note that this construction is safe since C++17, in which the evaluation of constructor arguments is sequenced after a new-expression allocation.

ctl is overloaded...

jart · 2024-06-20T22:47:28Z

test/ctl/shared_ptr_test.cc

+    }
+
+    // TODO(mrdomino): exercise more of API
+    // TODO(mrdomino): threading stress-test


You'll find numerous threading torture tests throughout the codebase, since it's sadly the best thing we have without TSAN.

jart · 2024-06-20T22:50:10Z

ctl/shared_ptr.cc

+incref(_Atomic(size_t)* r)
+{
+    size_t r2 = atomic_fetch_add_explicit(r, 1, memory_order_relaxed);
+    if (r2 > ((size_t)-1) >> 1)


Do we realistically expect 2**63 references to happen? Probably not worth the bloat. This function is basically just an XADD instruction. I'd say put it in the header. The atomics header in cosmo libc is pretty lightweight. It's a natural dependency of this class.

It's a check against memory corruption mainly. Could see the argument for it not being worth it.

jart · 2024-06-21T06:19:41Z

ctl/shared_ptr.cc

+decref(_Atomic(size_t)* r)
+{
+    if (!atomic_fetch_sub_explicit(r, 1, memory_order_release)) {
+        atomic_thread_fence(memory_order_acquire);


I'm not sure how I feel about atomic_thread_fence(). I just tried using a similar trick to this for my syncthreads() function on an M2 processor and it made things slower. For example:

void syncthreads(void) { static atomic_uint count; static atomic_uint phaser; int phase = atomic_load_explicit(&phaser, memory_order_relaxed); if (atomic_fetch_add_explicit(&count, 1, memory_order_acq_rel) + 1 == nth) { atomic_store_explicit(&count, 0, memory_order_relaxed); atomic_store_explicit(&phaser, phase + 1, memory_order_release); } else { int backoff = 0; while (atomic_load_explicit(&phaser, memory_order_acquire) == phase) backoff = delay(backoff); } }

Became:

void syncthreads(void) { static atomic_uint count; static atomic_uint phaser; int phase = atomic_load_explicit(&phaser, memory_order_relaxed); if (atomic_fetch_add_explicit(&count, 1, memory_order_release) + 1 == nth) { atomic_thread_fence(memory_order_acquire); atomic_store_explicit(&count, 0, memory_order_relaxed); atomic_store_explicit(&phaser, phase + 1, memory_order_release); } else { int backoff = 0; while (atomic_load_explicit(&phaser, memory_order_acquire) == phase) backoff = delay(backoff); } }

One nasty thing about atomic_thread_fence() is it isn't supported by TSAN at all. So it's harder to prove code is correct.

How certain are you that this decref() implementation is optimal? Could you whip up a torture test + benchmark that demonstrates its merit in comparison to alternatives? I would have assumed that that doing an acquire load beforehand to check for zero would be faster, since it'd let you avoid the costly XADD instruction in many cases.

Not certain at all, would be interesting to see it benchmarked at some point; I just cribbed off of Rust, and Rust is pretty good on these fronts but not perfect.

If acq_rel is opitmal then that's a point in libc++'s favor...

Ignore my earlier comment, I've discovered how to correctly use atomic thread fence. Consider the following barrier, where nth is the number of threads and ith is the thread id:

void syncthreads(int ith) { static atomic_uint count; static atomic_uint phaser; int phase = atomic_load_explicit(&phaser, memory_order_relaxed); if (atomic_fetch_add_explicit(&count, 1, memory_order_acq_rel) + 1 == nth) { atomic_store_explicit(&count, 0, memory_order_relaxed); atomic_store_explicit(&phaser, phase + 1, memory_order_release); } else { int backoff = 0; while (atomic_load_explicit(&phaser, memory_order_acquire) == phase) backoff = delay(backoff); } }

Spinning on separate cache lines and using thread fence brings its latency down from 7.3 us to 6.2 us on my M2:

struct ggml_phaser { alignas(64) atomic_uint i; }; void syncthreads(int ith) { static atomic_uint count; static struct ggml_phaser ph[nth]; int phase = atomic_load_explicit(&ph[ith].i, memory_order_relaxed); if (atomic_fetch_add_explicit(&count, 1, memory_order_acq_rel) + 1 == nth) { atomic_store_explicit(&count, 0, memory_order_relaxed); for (int i = 0; i < nth; ++i) atomic_store_explicit(&ph[i].i, phase + 1, memory_order_relaxed); atomic_thread_fence(memory_order_release); } else { int backoff = 0; while (atomic_load_explicit(&ph[ith].i, memory_order_relaxed) == phase) backoff = delay(backoff); atomic_thread_fence(memory_order_acquire); } }

- Adds is_void_v. - Uses enable_if to disable operator* and operator[] for void types. - Uses the new school C++-compatible atomic hack. - There is no implicit conversion from raw pointers to unique_ptr. Test compiles and passes now.

Just write out the constructor definitions on cppreference.com. Make the private constructor from a raw rc into a static private method instead of a private constructor. Various other minor fixes.

jart · 2024-08-27T06:12:03Z

ctl/shared_ptr.h

+
+// TODO(mrdomino): move
+#ifndef __cplusplus
+#define _CTL_ATOMIC(x) _Atomic(x)


So _Atomic is really a C11 thing. The C++ compiler doesn't support it. It's only able to be supported in C++23 because I think it uses a macro to turn it into std::atomic<size_t> which naturally requires schlepping in a million libcxx headers. Should we just use the GCC compiler builtins in this header to make things simpler? Also why check __cplusplus when it's a C++ only header?

Basically just cribbed this stanza from the 5 other places that pattern is used (rg '_Atomic\(x\)'.) Indeed, no reason to check __cplusplus here.

The STL atomic thing is indeed a nightmare. Maybe we'll wind up writing our own stdatomic.h, with blackjack, etc.

Meanwhile, in the new PR I just use the GCC builtins as you suggest.

jart · 2024-08-27T06:12:38Z

ctl/shared_ptr.h

+    // TODO(mrdomino): allocators
+
+    template<typename U>
+        requires is_convertible_v<U, T>


This new requires syntax looks cool.

mrdomino · 2024-08-28T04:18:41Z

Closing in favor of #1267.

mrdomino changed the title ~~Ctl shared~~ Implement ctl::shared_ptr Jun 18, 2024

github-actions bot added the libc label Jun 18, 2024

mrdomino mentioned this pull request Jun 18, 2024

Fix c.inc #1230

Closed

mrdomino force-pushed the ctl-shared branch from b919b80 to 34d5a4a Compare June 18, 2024 03:52

mrdomino mentioned this pull request Jun 18, 2024

Fix c.inc _Atomic define for C++ #1231

Merged

mrdomino force-pushed the ctl-shared branch from 34d5a4a to f25fe72 Compare June 18, 2024 04:04

mrdomino added 2 commits June 17, 2024 21:25

wip Implement ctl::shared_ptr

b16cf2e

This code does not segfault and it compiles cleanly, and it took a while to get here.

shared_control cannot be copied

d8ee013

mrdomino force-pushed the ctl-shared branch from f25fe72 to d8ee013 Compare June 18, 2024 04:26

mrdomino added 10 commits June 17, 2024 21:58

Hey, CTAD was introduced in C++17 wasn’t it

3b01802

further wip

c719ded

Merge branch 'master' into ctl-shared

1fffa42

Rename refcounting methods

2edff8f

Rename ctl -> rc

0b710a1

ctl is overloaded...

Basics of make_shared with anonymous union

da58178

Check for memory leaks

79c0e97

clang-format

df00ae1

wip deleter

507ecaa

jart reviewed Jun 20, 2024

View reviewed changes

jart reviewed Jun 21, 2024

View reviewed changes

jart force-pushed the master branch 2 times, most recently from faa6285 to 2c4b887 Compare July 25, 2024 08:23

mrdomino added 6 commits August 1, 2024 13:12

Merge branch 'master' into ctl-shared

a2555ed

Merge branch 'master' into ctl-shared

1758936

wip modernize/continue implementation

9bea795

- Adds is_void_v. - Uses enable_if to disable operator* and operator[] for void types. - Uses the new school C++-compatible atomic hack. - There is no implicit conversion from raw pointers to unique_ptr. Test compiles and passes now.

Merge branch 'master' into ctl-shared

f717889

Uncomment shared_ptr<void> test

3c88e61

weak_ptr, conform more closely to the STL

b86808c

Just write out the constructor definitions on cppreference.com. Make the private constructor from a raw rc into a static private method instead of a private constructor. Various other minor fixes.

jart reviewed Aug 27, 2024

View reviewed changes

mrdomino mentioned this pull request Aug 28, 2024

Basic CTL shared_ptr implementation #1267

Merged

mrdomino closed this Aug 28, 2024

mrdomino deleted the ctl-shared branch September 2, 2024 04:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ctl::shared_ptr #1229

Implement ctl::shared_ptr #1229

mrdomino commented Jun 18, 2024 •

edited

Loading

jart Jun 20, 2024

jart Jun 20, 2024

mrdomino Aug 27, 2024

jart Jun 21, 2024

mrdomino Aug 27, 2024

jart Aug 27, 2024

jart Aug 27, 2024

mrdomino Aug 28, 2024

jart Aug 27, 2024

mrdomino commented Aug 28, 2024

Implement ctl::shared_ptr #1229

Implement ctl::shared_ptr #1229

Conversation

mrdomino commented Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrdomino commented Aug 28, 2024

mrdomino commented Jun 18, 2024 •

edited

Loading