- Aim to lift common Go scalar operations to vectors
- Including those from standard libraries
- A simple API that does not expose the complexities of the underlying CPU architecture
- Should not require understanding of the underlying CPU vector instructions or be aware of vector size limits
- Sympathetic to hardware
- Support for common hardware vector lengths - currently small powers of 2
- Assume better performance from mixing nearby lanes than far-away lanes, and best performance from operating within lane
- Open to extension in future
- Adding new instructions
- Adding new vector lengths
- API should be free of side effects
- Favour returning a result over passing in a destination pointer
- Focus on vector lengths that have been implemented in hardware - currently up to 512 bits (e.g. AVX512, Fugaku supercomputer implements Arm SVE at 512 bits)
- Assume Go compiler can emulate longer vectors on vector hardware by ganging registers or looping
- Aim for straightforward integration of hardware instructions with larger/wider operations that require fallback implementations in Go
- Assume Go compiler can learn to fuse broadcast, memory and blend operations
- No need to support the full range of addressing modes, blending, broadcast and architectural registers in the API
- Utilise existing Go types
- float16, bfloat16, etc. out of scope of this API
- Utilise existing Go golang.org/x/exp/constraints where possible
- Propose adding
Number interface { constraints.Float | constraints.Integer }
- Propose adding
- Utilise Go arrays to represent vectors
- Allows direct access to array elements
- Avoids adding additional operations to insert/extract a scalar value from a vector
- Simplifies future vector size extensions by decoupling vector sizes from types declared in other vector packages
- Allows direct access to array elements
- Simple naming that follows Go library/intrinsics names where possible (particularly the math and bits packages)
- Except that we use opportunities that generics open up to reduce API call variants
- Where there is no Go name, adopt the name used by AVX512
- Except where there is a simpler name (e.g.
VPCONFLICTD
detects duplicates, so might be better namedIsDuplicate
)
- Except where there is a simpler name (e.g.
- Names are generally singular (e.g.
IsNaN
rather thanAreNaNs
) for consistency with existing APIs - Utilise generic types to abstract over different integer/float sizes
- i.e. favour
Add[T constraints.Number](a,b)
overAddInt64
,AddInt32
,AddFloat64
, etc. - Avoids explosion in the number variants of basic operations
- i.e. favour
- Utilise packages to represent different vector lengths
- i.e. favour
vec4.Add(a,b)
overvec.Add4(a,b)
,vec.Add8(a,b)
- Avoids explosion in the number variants of basic operations in the documentation
- All operations on vectors of a specific length are grouped together and documented in a package
- i.e. favour
- Control flow/predication is expressed through mask/condition vectors
- Mask vectors should be opaque in order to support future vector length extension
- e.g. Arm SVE permits up to 2048 bit vector supporting byte operations, which implies 256-bit masks
- Mask vectors are a separate type
- Guide users that the package is for numeric vectors (including byte, rune and complex)
- If it were a standard package it should be
math/vec
ormath/vector
- If it were a standard package it should be
I am not very familiar with programming vector CPUs so it is likely that common idioms are not well supported. I have tried to build understanding by researching the Intel AVX instruction set flavours and reading general details of Arm SVE. It is possible I have misunderstood the semantics of some instructions and implemented the fallbacks incorrectly.
No effort has been made to make fallback implementations efficient. The aim is for simple and succinct code that keeps the focus on the API.
There are likely many opportunities to express Go fallback implementations in terms of smaller
vectors. Go compiler intrinsics could implement specific operations like vec2.Add
in hardware,
allowing unsupported operations like vec4.Add
to gang 2 operations using shorter vectors.
This would probably be the best way to structure fallback implementations of the different vector sizes.
Because I am generating the different vector sizes packages, vec2.Deinterlace
returns [1]T
instead of T
.
- Arrange generic type params so hard-to-infer params appear earlier
- Revisit automatic broadcast - is it worth the extra type system complexity? (vs
Add(&z, &x, Broadcast[int,[8]int](&y)
) - Interlace/Deinterlace should return 2 vectors of the same size (see Rust simd)
- Mix
- Vector.Slice() to return a slice of an otherwise-opaque vector's elements
- Reduction can be performed repeatedly across vectors via orchestrating remaining lanes
- Lanes/Bool should support different reduction strategies: FirstToLast, LastToFirst, BinarySplit
- Possible lane patterns
- 1111, 0111, 0011, 0001
- 1111, 1110, 1100, 1000
- 1000, 0100, 0010, 0001
- 0001, 0010, 0100, 1000
- 1111, 1010, 1000
- 1111, 0101, 0001
- An efficient way to reverse the lane mask halves number of patterns
- Possibly a bool 'reverse' param
- Presumably it should be possible to AND these with an initial lane mask
- Don't run an iteration in which pattern is all zero - it should be a no-op
- Possible lane patterns
- Lanes/Bool should support different reduction strategies: FirstToLast, LastToFirst, BinarySplit
- Horizontal instructions and naming convention
- "Horizontal" "Across" "All" "Vector" "Lanes" "Element(s)"
- Loop tail and alignment operations
- Should there be a neighbour swap based on mask bits? (1 = swap, 0 = leave)
- Swap(mask, a,b *[N]T) ?
- VP2INTERSECT?
- AppendAndPermute (2-source table based permute in AVX512)
- Absolute difference? (SABD, UABD, FABD in SVE)
- Saturating arithmetic
- Dot product (SVE and AVX define partial integer dot product)
- Improve documentation by substituting
Length
with exact length - Vector address generation
- How to implement safely? Could return N x ptr to slice element
- Interlace and other ops may be over-constrained - no reason not to allow pointer elements
- May need better generic type constraints to allow number + pointer
- Extended multiply and divide (i.e. 32b x 32b -> 64b)
- Complex number operations?
- Interlace/Deinterlace could be Zip/Unzip
- By analogy to FP, Zip should take 2 x
[N]T
and return[N][2]T
but I'm not sure this really helps
- By analogy to FP, Zip should take 2 x
- BlendAdd, BlendSub etc - m = false: return a, m = true: return a+b
- Emulates x += 10 (vs Add which emulates x+10)
- Make consistent with Add, Sub...
- Maybe
BlendAdd(m, &a, b)
? OrAddBlend
?- Should
Blend
also take a pointer to assign?
- Should
- SVE-like mask operators
- Need to understand these better
- BreakBefore, BreakAfter
- Panic/error if 2+ Bunch elements can't fit in vector size
- How do we deal with Convert()? It will generally return a different vector length for a different type.
- Could return a slice of bunches?
- Could return a special Bunch that can hold more than 1 Bunch inside and hope multiple conversions don't blow up the vector size too much
- One optimisation would be when converting the special Bunch, aim to pack a full result vector even if it means converting multiple embedded input bunches
- Could represent as a bunch of bunches?
- If Element constrained to largest native (64 or 128 bits), worst case is converting from 64 int8 Bunch (el size=1) to 64 complex128 Bunch (el size=16)
- So it will fit inside a bunch[16]
- Parallel LoadCorresponding
- Loads each Bunch in parallel, stopping when the shortest one is full or the first Reader is empty
Truncate
a vector to get a smaller size (ie. split and discard)- https://en.wikipedia.org/wiki/Vector_processor
- https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set
- https://en.wikipedia.org/wiki/Stream_processing
- Works with Go's type system to reduce the size of the API and the number of types
- Fallback Go implementations are currently duplicated across different vector sizes
- Pushes the effort of coping with baroque vector instruction sets onto the Go compiler
- But I believe this is the correct place for it
- Different packages for each vector size mean the user has to navigate between packages
- However, each vector size supports the same operations, so understanding
vec2
enables understanding ofvec4
, etc.
- However, each vector size supports the same operations, so understanding
- There was originally 1 mask type (opaque, with 64 bits)
- Masks are now reimplemented at the vector size and called
vecN.Bool
- Pro: removes the need for masks to be opaque in order to support future vector length extension
- Con: duplicates all the mask operations across each vector size package
- Con: may make it harder to implement in-vector conditional logic across differing vector sizes (need to resize mask vector)
- Pro: we have
Any()
to detect set bits, if mask follows vector size, we can also haveAll()
- Pro: no longer have to pass
Length
toForEach()
- Masks are now reimplemented at the vector size and called
Broadcast
should not accept a mask parameter- Debatable decision: simplifies the vecN packages but might not work with the fluent style package I'm working on
- The assumption is that tracking the value of a mask would make it harder for the compiler to fuse a broadcast with another operation
- Using broadcast without a mask is simpler/shorter for the user
- The fallback implementation of broadcast may lose some performance through unnecessarily copying the scalar value across the full array
- Broadcasting scalars involves a design trade-off between
- allocating and copying scalar to target vector length or
- more complex type constraints to allow for a copy-free scalar argument
- the complex type constraints tend to spread through functions that can accept a broadcast argument
- hence may be better to copy and trust compiler intrinsics to elide the copy where possible
- It is possible to reduce the width of type supported by longer vectors in sympathy with hardware
- Can be done by having narrower constraints on types within the package
- e.g.
type Number interface { Float32 | Integer32 }
- This is still open to extension in future by relaxing the type constraints
- Disadvantage is not reusing the standard
constraints
package and duplicating constraints in each vector size package - Another disadvantage is that widening code to a larger vector size can make it uncompilable for reasons that seem arbitrary to the user
- e.g.
- Can be done by having narrower constraints on types within the package
- It looks possible within Go's generic type system to include pointers within some operations
- Type constraints get a bit complicated when pointers and underlying values mix
- It is difficult to cleanly express pointers to pointers because recursive types not allowed
- You can almost fold all combinations of vector length N of type T down to a single (verbose) type constraint:
- i.e.
type Vector interface { [2]int32 | [2]int64 ... | [4]int32 | [4]int64 ...
- But fallback implementations in Go trip over this rule:
- https://go.dev/ref/spec#Index_expressions, specifically "The element types of all types in P's type set must be identical."
- This prevents indexing the elements of
Vector
if they can be of different types
- In practical terms, it seems better not to have types with specific (typically power-of-2) vector lengths
- Makes it easier to express fallbacks of vector length N operations in terms of operations on vectors of length N/2
- The compiler substitutes hardware instructions where they are available
- e.g. recognise
Add[[2]uint64](a, b)
has a native instruction, whichAdd[[4]uint64](a, b)
can execute twice - Similar to the approach in Zig linked from proposal #53171
- e.g. recognise
- i.e.
- Another variant which is possible but inconvenient is to define fallback as
Add[Element,Vector[Element]](a, b)
wheretype Vector[E any] interface { [2]E | [4]E | [8]...}
- This works but automatic type inference isn't smart enough to allow type parameters to be elided at the call site:
- So you have to write
Add[int,[4]int](a, b)
instead ofAdd(a, b)
- TODO needs reconfirming - discovered this during a number of related generics experiments
- As with previous bullet point, I prefer array types at hardware-sympathetic vector lengths where it is possible to write fallbacks of length N in terms of intrinsics of length N/2
- So not sure I would advocate this route even if type inference were better
- So you have to write
- This works but automatic type inference isn't smart enough to allow type parameters to be elided at the call site:
- Could adopt https://pkg.go.dev/math/big convention of assigning the method receiver
- This would require named Vector types and we might as well collapse all the vector types into 1 package
- i.e. replace
x := vec2.Add(m, a, b)
withx := Vec2{}; x.Add(m, a, b)
- Would we want to pass blend/zero parameter in this case?
- i.e.
x.Add(m, vector.Blend, a, b)
- i.e.
- One alternative is extend "fluent" API style
- e.g.
x.Blend(m).Add(a,b)
x.Add(a,b)
(not masked) andx.Zero(m).Add(a,b)
zeroed instead of blended
- e.g.
- Horizontal ops could be...
- e.g.
x.Reduce().Max()
- or generally:
x.Reduce().ForRange(func(i int, v Element){...})
- Unclear if it's good/safe to pass element index
i
- Unclear if it's good/safe to pass element index
- e.g.
- i.e. replace
- Note math/big passes around pointers (required for receivers to allow assignment but we might pass vectors as values to operations like Add)
- A vector as an opaque type (like big.Float) is in sympathy with Arm's SVE design
- SVE more or less requires an opaque vector
- SVE makes interesting and extensive use of predication to hide the vector length
- Not an immediately intuitive model (to me)
- There are hints that their primitives could allow lifting to code that is almost auto-vectorized in flavour
- This might provide an intuitive model in which "looping" is captured by pulling vector-sized chunks from an input slice, processing using SIMD operations, then reducing to final output in a loop tail
- SVE mask/predicate registers introduce a complication not present in x86:
- in x86 mask bits are aligned to vector element index - easy to use across element sizes
- in SVE mask bits are aligned to vector bytes
- so a predicate formed from comparing int16s can't be used in int32 ops without a spread/unpack operation on the predicate
- to cope:
- define type-width-specific mask types (which I don't favour)
- normalise masks after every operation
- i.e. pack to x86-format after every op and unpack before every op
- Could often be elided in practice as you will often work on the same width types (e.g. float64)
- hide the masks by bundling them with the vectors
- might work in a Fluent-style API
- might be awkward to mask a vector using another vector, rather than a mask
- might need to panic when combining vectors with different masks? (or you'd potentially be subtly losing data)
- Potentially get partway to auto-vectorization with this approach
- Assuming you can range over a vector and lift Go's scalar operations to vector equivalents
- Also require loop head and tail in many cases, e.g. memory alignment in head + tail, final reduction in tail
- Alignment may not be achievable when combining 2 or more slices and it's not clear SVE requires it
- This would require named Vector types and we might as well collapse all the vector types into 1 package
- Allow non-power-of-2 vector sizes and have
binaryOp
,split
et al inspect array length in binary to determine how to split up vector into smaller ops- 15 should be sized up to 16
- On CPU which can blend and doesn't read/write/fault a masked-off memory access, we could still store 15 elements but safely operate on 16
- If we can't use raw Go arrays, makes more sense to have a Vec15 type which is [16]E internally
- 9 should be sized down to 8 plus a scalar op
- Probably best assume a given width should be half full or more
- 15 should be sized up to 16
- Improve type inference so type parameters of e.g.
Add[E Number, V[E] Vector](z, x, y *V)
can be inferred- Edit I've just realised you only need to provide enough type params to allow the rest to be worked out!
- So it only requires
Add[float64](z, x, y)
, rather thanAdd[float64,[8]float64](z, x, y)
as I originally thought- Big improvement to readability
- So it only requires
- Edit I've just realised you only need to provide enough type params to allow the rest to be worked out!
- Add ability to switch on type parameter, e.g.
switch V.(type) { case [2]E:
...- This should allow several same-type parameters (e.g.
(z, x, y *V)
) to be interpreted as a concrete type after thecase
- In some instances this could avoid a lot of type assertions
- Compiler can check cases are possible, which can't be done with
switch any(v).(type)
- This should allow several same-type parameters (e.g.
- In some cases it would help to be able to embed a type param in a constraint:
type Broadcast[E any, V Vector[E]] V
<-V
can only be struct field- This stops us taking
len
of values that include broadcast, even though they are all arrays
- This stops us taking
- Similarly
type VectorScalar[E any] interface { Vector[E] | E }
<-E
must be wrapped in a struct
- Puzzle:
- Given
Broadcast[E any, V Vector[E]] struct { Replicated V }
- And
VectorBroadcast[E any, V Vector[E]] interface { Vector[E] | Broadcast[E, V] }
- Where I have a value of type
*Broadcast[E, V]
type-asserted from aVectorBroadcast[E, V]
- I'm not sure why I can find the
len
ofBroadcast.Replicated
but I can't slice it- Go knows it has type
V
but it doesn't seem to know that is alsoVector[E]
and therefore sliceable
- Go knows it has type
- Given
- It would be nice to assist
func (t *testing.T) MustPanic() { if r := recover(); r == nil { t.Fail() } }
- So tests can check panic easily with
defer t.MustPanic(); /* code that panics */
- So tests can check panic easily with
func unsafeSlice[E any, V constraintsExt.Vector[E]](v *V) (slice []E) {
// TODO not sure why I can't return (*v)[:]
// A value of type V can be trivially converted to a value of type [n]E.
// Unfortunately we can't explicitly convert to that type with a type constructor because we don't know n.
// It would be nice if array sizes could be generic parameters, e.g. `type Vector[E any, N const] interface { ~[N]E }`.
// However, in this case we would be giving up "blessed" power-of-two vector lengths unless Go also allowed the const
// to be constrained. So it is not a trivial extension to the language conceptually.
return unsafe.Slice((*E)(unsafe.Pointer(v)), len(*v))
z := x <op> y
forop
in+ - * / & | &^ ^ << >>
z := <op> x
forop
in& * - ^
z := <type>(x)