SIMD Classes¶
The purpose of the SIMD classes is to abstract and consolidate the use of compiler intrinsics for the manipulation of architecturespecific vector (SIMD) values.
The implementation is rather loosely based on the dataparallel vector types proposal P0214R6 for the C++ Parallelism TS 2.
Unless otherwise specified, all classes, namespaces and toplevel functions described below are all within the toplevel arb::simd namespace.
Example usage
The following code performs an elementwise vector product, storing only nonzero values in the resultant array.
#include <simd/simd.hpp>
using namespace arb::simd;
void product_nonzero(int n, const double* a, const double* b, double* result) {
constexpr int N = simd_abi::native_width<double>::value;
using simd = simd<double, N>;
using mask = simd::simd_mask;
int i = 0;
for (; i+N<=n; i+=N) {
auto vp = simd(a+i)*simd(b+i);
where(vp!=0, vp).copy_to(result+i);
}
int tail = ni;
auto m = mask::unpack((1<<tail)1);
auto vp = simd(a+i, m)*simd(b+i, m);
where(m && vp!=0, vp).copy_to(c+i);
}
Classes¶
Three userfacing template classes are provided:
simd<V, N, I = simd_abi::default_abi>
Nwide vector type of values of type V, using architecturespecific implementation I. The implementation parameter is itself a template, acting as a typemap, with
I<V, N>::type
being the concrete implementation class (see below) for Nwide vectors of type V for this architecture.The implementation
simd_abi::generic
provides astd::array
backed implementation for arbitrary V and N, whilesimd_abi::native
maps to the native architecture implementation for V and N, if one is available for the target architecture.simd_abi::default_abi
will usesimd_abi::native
if available, or else fall back to the generic implementation.simd_mask<V, N, I = simd_api::default_abi>
The result of performing a lanewise comparison/test operation on a
simd<V, N, I>
vector value.simd_mask
objects support logical operations and are used as arguments towhere
expressions.simd_mask<V, N, I>
is a type alias forsimd<V, N, I>::simd_mask
.where_expression<simd<V, N, I>>
The result of a
where
expression, used for masked assignment.
There is, in addition, a templated class detail::indirect_expression
that holds the result of an indirect(…) expression. These arise in
gather and scatter operations, and are detailed below.
Implementation typemaps live in the simd_abi
namespace, while concrete
implementation classes live in detail
. A particular specialization
for an architecture, for example 4wide double on AVX, then requires:
A concrete implementation class, e.g.
detail::avx_double4
.A specialization of its ABI map, so that
simd_abi::avx<double, 4>::type
is an alias fordetail::avx_double4
.A specialization of the native ABI map, so that
simd_abi::native<double, 4>::type
is an alias forsimd_abi::avx<double, 4>::type
.
The maximum natively supported width for a scalar type V is recorded in
simd_abi::native_width<V>::value
.
Indirect expressions¶
An expression of the form indirect(p, k)
or indirect(p, k, constraint)
describes
a sequence of memory locations based at the pointer p with offsets given by the
simd
variable k. A constraint of type index_constraint
can be provided, which
promises certain guarantees on the index values in k:
Constraint 
Guarantee 


No restrictions. 

No indices are repeated, i.e. k_{i} = k_{j} implies i = j. 

Indices are sequential, i.e. k_{i} = k_{0} + i. 

Indices are all equal, i.e. k_{i} = k_{j} for all i and j. 
Class simd
¶
The class simd<V, N, I>
is an alias for detail::simd_impl<I<V, N>::type>
;
the class detail::simd_impl<C>
provides the public interface and
arithmetic operators for a concrete implementation class C.
In the following:
S stands for the class
simd<V, N, I>
.s is a SIMD value of type S.
m is a mask value of type
S::simd_mask
.t, u and v are const objects of type S.
w is a SIMD value of type
simd<W, N, J>
.i is an index of type
int
.j is a const object of type
simd<U, N, J>
where U is an integral type.x is a value of type V.
p is a pointer to V.
c is a const pointer to V or a length N array of V.
Here and below, the value in lane i of a SIMD vector or mask v is denoted by v_{i}
Type aliases and constexpr members
Name 
Type 
Description 


V 
The type of one lane of the SIMD type. 


The 


The SIMD width N. 
Constructors
Expression 
Description 


A SIMD value v with v_{i} equal to x for i = 0…N1. 

A copy of the SIMD value t. 

A SIMD value v with v_{i} equal to 

A copy or valuecast of the SIMD value w of a different type but same width. 

A SIMD value v with v_{i} equal to 

A SIMD value v with v_{i} equal to 
Member functions
Expression 
Type 
Description 



Set 


Set 


Set s_{i} to 


Set s_{i} to 


Sum of s_{i} for i = 0…N1. 
Expressions
Expression 
Type 
Description 



Lanewise sum. 


Lanewise difference. 


Lanewise product. 


Lanewise quotient. 


Lanewise FMA t * u + v. 


Lanewise lessthan comparison. 


Lanewise lessthanorequals comparison. 


Lanewise greaterthan comparison. 


Lanewise greaterthanorequals comparison. 


Lanewise equality test. 


Lanewise inequality test. 


Lanewise assignment. 


Equivalent to 


Equivalent to 


Equivalent to 


Equivalent to 


Equivalent to 


Equivalent to 


Compound indirect assignment: 


Compound indirect assignment: 


Value t_{i} 


Set value s_{i} to x. 
The (nonconst) index operator operator[]
returns a proxy object of type S::reference
,
which writes the corresponding lane in the SIMD value on assignment, and has an
implicit conversion to scalar_type
.
Class simd_mask
¶
simd_mask<V, N, I>
is an alias for simd<V, N, I>::simd_mask
, which in turn
will be an alias for a class detail::simd_mask_impl<D>
, where D is
a concrete implementation class for the SIMD mask representation. simd_mask_impl<D>
inherits from, and is implemented in terms of, detail::simd_impl<D>
,
but note that the concrete implementation class D may or may not be the same
as the concrete implementation class I<V, N>::type
used by simd<V, N, I>
.
Mask values are read and written as bool
values of 0 or 1, which may
differ from the internal representation in each lane of the SIMD implementation.
In the following:
M stands for the class
simd_mask<V, N, I>
.m and q are const objects of type
simd_mask<V, N, I>
.u is an object of type
simd_mask<V, N, I>
.b is a boolean value.
q is a pointer to
bool
.y is a const pointer to
bool
or a length N array ofbool
.i is of type
int
.k is of type
unsigned long long
.
Constructors
Expression 
Description 


A SIMD mask u with u_{i} equal to b for i = 0…N1. 

A copy of the SIMD mask m. 

A SIMD value u with u_{i} equal to 
Note that simd_mask
does not (currently) offer a masked pointer/array constructor.
Member functions
Expression 
Type 
Description 



Write the boolean value m_{i} to 


Set u_{i} to the boolean value 
Expressions
Expression 
Type 
Description 



Lanewise negation. 


Lanewise logical and. 


Lanewise logical or. 


Lanewise equality (equivalent to 


Lanewise logical xor. 


Lanewise assignment. 


Boolean value m_{i}. 


Set m_{i} to boolean value b. 
Static member functions
Expression 
Type 
Description 



Mask with value m_{i} equal to the ith bit of k. 
Class where_expression
¶
where_expression<S>
represents a masked subset of the lanes
of a SIMD value of type S
, used for conditional assignment,
masked scatter, and masked gather. It is a type alias for
S::where_expression
, and is the result of an expression of the
form where(mask, simdvalue)
.
In the following:
W stands for the class
where_expression<simd<V, N, I>>
.s is a reference to a SIMD value of type
simd<V, N, I>&
.t is a SIMD value of type
simd<V, N, I>
.m is a mask of type
simd<V, N, I>::simd_mask
.j is a const object of type
simd<U, N, J>
where U is an integral type.x is a scalar of type V.
p is a pointer to V.
c is a const pointer to V or a length N array of V.
Expression 
Type 
Description 



A proxy for maskedassignment operations. 


Set s_{i} to t_{i} for i where m_{i} is true. 


Set s_{i} to x for i where m_{i} is true. 


Set 


Set 


Set s_{i} to 


Set s_{i} to 
Toplevel functions¶
Lanewise mathematical operations abs(x), min(x, y) and max(x, y) are offered for all SIMD value types, while the transcendental functions are only usable for SIMD floating point types.
Vectorized implementations of some of the transcendental functions are provided: refer to the vector transcendental functions documentation for details.
In the following:
I and J are SIMD implementations.
A is a SIMD class
simd<K, N, I>
for some scalar type K.S is a SIMD class
simd<V, N, I>
for a floating point type V.L is a scalar type implicitly convertible from K.
a and b are values of type A.
s and t are values of type S.
r is a value of type
std::array<K, N>
.
Expression 
Type 
Description 


A 
Lanewise absolute value of a. 

A 
Lanewise minimum of a and b. 

A 
Lanewise maximum of a and b. 

S 
Lanewise sine of s. 

S 
Lanewise cosine of s. 

S 
Lanewise natural logarithm of s. 

S 
Lanewise exponential of s. 

S 
Lanewise \(x \mapsto e^x  1\). 

S 
Lanewise \(x \mapsto x / (e^x  1)\). 

S 
Lanewise raise s to the power of t. 


Lanewise cast of values in a to scalar type L in 


Lanewise cast of values in a to scalar type L in 


Lanewise cast of values in the 
Implementation requirements¶
Each specific architecture is represented by a templated class I, with
I<V, N>::type
being the concrete implementation for an Nwide
SIMD value with scalar_type
V.
A concrete implementation class C inherits from detail::implbase<C>
,
which provides (via CRTP) generic implementations of most of the SIMD
functionality. The base class implbase<C>
in turn relies upon
detail::simd_traits<C>
to look up the SIMD width, and associated types.
All the required SIMD operations are given by static member functions of C.
Some arguments to static member functions use a tag class (detail::tag
)
parameterized on a concrete implementation class for dispatch purposes.
Minimal implementation¶
In the following, let C be the concrete implementation class for a
Nwide vector of scalar_type V, with lowlevel representation
archvec
.
The specialization of detail::simd_traits<C>
then exposes these
types and values, and also provides the concrete implementation class M
for masks associated with C:
template <>
struct simd_traits<C> {
static constexpr unsigned width = N;
using scalar_type = V;
using vector_type = archvec;
using mask_impl = M;
};
The mask implementation class M may or may not be the same as C.
For example, detail::avx_double4
provides both the arithmetic operations and mask
operations for an AVX 4 × double SIMD vector, while the mask
implementation for detail::avx512_double8
is detail::avx512_mask8
.
The concrete implementation class must provide at minimum implementations
of copy_to
and copy_from
(see the section below for semantics):
struct C: implbase<C> {
static void copy_to(const arch_vector&, V*);
static arch_vector copy_from(const V*);
};
If the implementation is also acting as a mask implementation, it must also
provide mask_copy_to
, mask_copy_from
, mask_element
and
mask_set_element
:
struct C: implbase<C> {
static void copy_to(const arch_vector&, V*);
static arch_vector copy_from(const V*);
static void mask_copy_to(const arch_vector& v, bool* w);
static arch_vector mask_copy_from(const bool* y);
static bool mask_element(const arch_vector& v, int i);
static void mask_set_element(arch_vector& v, int i, bool x);
};
The simd_detial::generic<T, N>
provides an example of a minimal
implementation based on an arch_vector
type of std::array<T, N>
.
Concrete implementation API¶
In the following, C represents the concrete implementation class for a SIMD class of width N and value type V.
u, v, and w are values of type
C::vector_type
.r is a reference of type
C::vector_type&
.x is a value of type
C::scalar_type
.c is a const pointer of type
const C::scalar_type*
.p is a pointer of type
C::scalar_type*
.j is a SIMD index representation of type
J::vector_type
for an integral concrete implementation class J.d is a SIMD representation of type
D::vector_type
for a (different) concrete implementation class D.b is a
bool
value.q is a pointer to
bool
.y is a const pointer to
bool
.i is an unsigned (index) value.
k is an unsigned long long value.
m is a mask representation of type
C::mask_type
.z is an
index_constraint
value.
Types and constants
Name 
Type 
Description 



Underlying SIMD representation type. 


Should be convertible to/from V. 


Concrete implementation class for mask SIMD type. 


Underlying SIMD representation for masks. 


The SIMD width N. 
Initialization, load, store
Expression 
Type 
Description 



Return a vector with values v_{i} = d_{i}, where 


Fill representation with scalar x. 


Store values v_{i} to p+i. p may be unaligned. 


Store values v_{i} to p+i wherever m_{i} is true. p may be unaligned. 


Return a vector with values v_{i} loaded from c+i. c may be unaligned. 


Return a vector with values v_{i} loaded from c+i wherever m_{i} is true. c may be unaligned. 


Return a vector with values v_{i} loaded from c+i wherever m_{i} is true, or equal to u_{i} otherwise. c may be unaligned. 
Lane access
Expression 
Type 
Description 



Value in ith lane of v. 


Set value in lane i of r to x. 
Gather and scatter
The offsets for gather and scatter operations are given
by a vector type J::vector_type
for some possibly
different concrete implementation class J, and the
static methods implementing gather and scatter are templated
on this class.
Implementations can provide optimized versions for specific index classes J; this process would be simplified with more support for casts between SIMD types and their concrete implementations, functionality which is not yet provided.
The first argument to these functions is a dummy argument of type J, used only to disambiguate overloads.
Expression 
Type 
Description 



Vector v with values v_{i} = 


Vector v with values v_{i} = m_{i} ? 


Write values u_{i} to 


Write values u_{i} to 


Update values 
Casting
Implementations can provide optimized versions of lanewise value casting from other specific implementation classes.
The first argument is a dummy argument of type J, used only to disambiguate overloads.
Expression 
Type 
Description 



Returns vector v with values v_{i} = d_{i}, cast from 
Arithmetic operations
Expression 
Type 
Description 



Lanewise unary minus. 


Lanewise multiplication. 


Lanewise addition. 


Lanewise subtraction. 


Lanewise division. 


Lanewise fused multiplyadd (u*v+w). 


(Horizontal) sum of values u_{i} in each lane. 
Comparison and blends
Expression 
Type 
Description 



Lanewise u = v. 


Lanewise u ≠ v. 


Lanewise u > v. 


Lanewise u ≥ v. 


Lanewise u < v. 


Lanewise u ≤ v. 


Vector w with values w_{i} = m_{i} ? u_{i} : v_{i}. 
Mathematical function support.
With the exception of abs
, min
and max
, these are only
required for floating point vector implementations.
Expression 
Type 
Description 



Lanewise absolute value. 


Lanewise minimum. 


Lanewise maximum. 


Lanewise sine. 


Lanewise cosine. 


Lanewise natural logarithm. 


Lanewise exponential. 


Lanewise \(x \mapsto e^x 1\). 


Lanewise \(x \mapsto x/(e^x 1)\). 


Lanewise u raised to the power of v. 
Mask value support
Mask operations are only required if C constitutes the implementation of a SIMD mask class.
Expression 
Type 
Description 



Fill mask representation with bool b. 


Mask value v_{i}. 


Set mask value u_{i} to b. 


Write bool values to memory (unaligned). 


Load bool values from memory (unaligned). 


Return vector v with boolean value v_{i} equal to the ith bit of k. 
Logical operations
Logical operations are only required if C constitutes the implementation of a SIMD mask class.
Expression 
Type 
Description 



Lanewise negation. 


Lanewise logical and. 


Lanewise logical or. 


Lanewise m? v: w. 
Missing functionality¶
There is no support yet for the following features, although some of these will need to be provided in order to improve the efficiency of SIMD versions of our generated mechanisms.
Contraintbased dispatch for indirect operations other than
+=
and=
.Vectorizable implementations of trigonometric functions.
Implementation of vector transcendental functions¶
When building with the Intel C++ compiler, transcendental
functions on SIMD values in simd<double, 8, detail::avx512>
wrap calls to the Intel scalar vector mathematics library (SVML).
Outside of this case, the functions exp, log, expm1 and exprelr use explicit approximations as detailed below. The algortihms follow those used in the Cephes library, with some accommodations.
Exponentials¶
\(\operatorname{exp}(x)\)¶
The exponential is computed as
with \(g ≤ 0.5\) and \(n\) an integer. The power of two is computed via direct manipulation of the exponent bits of the floating point representation, while \(e^g\) is approximated by a rational polynomial.
\(n\) and \(g\) are computed by:
where the subtraction in the calculation of \(g\) is performed in two stages, to limit cancellation error:
where \(c_1+c_2 = \log 2\), \(c_1\) comprising the first 32 bits of the mantissa. (In principle \(c_1\) might contain more bits of the logarithm, but this particular decomposition matches that used in the Cephes library.) This decomposition gives \(g\leq \frac{1}{2}\log 2\approx 0.347\).
The rational approximation for \(e^g\) is of the form
where \(R(g)\) is a polynomial of order 6. The coefficients are again those used by Cephes, and probably are derived via a Remez algorithm. \(R(g)\) is decomposed into even and odd terms
so that the ratio can be calculated by:
Randomized testing indicates the approximation is accurate to 2 ulp.
\(\operatorname{expm1}(x)\)¶
A similar decomposition of \(x = g + n·\log 2\) is performed so that \(g≤0.5\), with the exception that \(n\) is always taken to be zero for \(x≤0.5\), i.e.
\(\operatorname{expm1}(x)\) is then computed as
and the same rational polynomial is used to approximate \(e^g1\),
The scaling by step for \(n≠0\) is in practice calculated as
in order to avoid overflow at the upper end of the range.
The order 6 rational polynomial approximation for small \(x\) is insufficiently accurate to maintain 1 ulp accuracy; randomized testing indicates a maximum error of up to 3 ulp.
\(\operatorname{exprelr}(x)\)¶
The function is defined as
and is the reciprocal of the relative exponential function,
This is computed in terms of expm1 by:
With the approximation for \(\operatorname{expm1}\) used above, randomized testing demonstrates a maximum error on the order of 4 ulp.
Logarithms¶
The natural logarithm is computed as
where \(n\) is an integer and \(u\) is in the interval \([ \frac{1}{2}\sqrt 2, \sqrt 2]\). The logarithm of \(u\) is then approximated by the rational polynomial used in the Cephes implementation,
where \(P\) and \(Q\) are polynomials of degree 5, with \(Q\) monic.
Cancellation error is minimized by computing the sum for \(\log x\) as:
where \(z=u1\) and \(c_3+c_4=\log 2\), \(c_3\) comprising the first 9 bits of the mantissa.