Crate safe_arch[−][src]
Expand description
A crate that safely exposes arch intrinsics via #[cfg()]
.
safe_arch
lets you safely use CPU intrinsics. Those things in the
core::arch
modules. It works purely via #[cfg()]
and
compile time CPU feature declaration. If you want to check for a feature at
runtime and then call an intrinsic or use a fallback path based on that then
this crate is sadly not for you.
SIMD register types are “newtype’d” so that better trait impls can be given
to them, but the inner value is a pub
field so feel free to just grab it
out if you need to. Trait impls of the newtypes include: Default
(zeroed),
From
/Into
of appropriate data types, and appropriate operator
overloading.
- Most intrinsics (like addition and multiplication) are totally safe to use as long as the CPU feature is available. In this case, what you get is 1:1 with the actual intrinsic.
- Some intrinsics take a pointer of an assumed minimum alignment and
validity span. For these, the
safe_arch
function takes a reference of an appropriate type to uphold safety.- Try the bytemuck crate (and turn on the
bytemuck
feature of this crate) if you want help safely casting between reference types.
- Try the bytemuck crate (and turn on the
- Some intrinsics are not safe unless you’re very careful about how you use them, such as the streaming operations requiring you to use them in combination with an appropriate memory fence. Those operations aren’t exposed here.
- Some intrinsics mess with the processor state, such as changing the floating point flags, saving and loading special register state, and so on. LLVM doesn’t really support you messing with that within a high level language, so those operations aren’t exposed here. Use assembly or something if you want to do that.
Naming Conventions
The safe_arch
crate does not simply use the “official” names for each
intrinsic, because the official names are generally poor. Instead, the
operations have been given better names that makes things hopefully easier
to understand then you’re reading the code.
For a full explanation of the naming used, see the Naming Conventions page.
Current Support
x86
/x86_64
(Intel, AMD, etc)- 128-bit:
sse
,sse2
,sse3
,ssse3
,sse4.1
,sse4.2
- 256-bit:
avx
,avx2
- Other:
adx
,aes
,bmi1
,bmi2
,fma
,lzcnt
,pclmulqdq
,popcnt
,rdrand
,rdseed
- 128-bit:
Compile Time CPU Target Features
At the time of me writing this, Rust enables the sse
and sse2
CPU
features by default for all i686
(x86) and x86_64
builds. Those CPU
features are built into the design of x86_64
, and you’d need a super old
x86
CPU for it to not support at least sse
and sse2
, so they’re a safe
bet for the language to enable all the time. In fact, because the standard
library is compiled with them enabled, simply trying to disable those
features would actually cause ABI issues and fill your program with UB
(link).
If you want additional CPU features available at compile time you’ll have to
enable them with an additional arg to rustc
. For a feature named name
you pass -C target-feature=+name
, such as -C target-feature=+sse3
for
sse3
.
You can alternately enable all target features of the current CPU with -C target-cpu=native
. This is primarily of use if you’re building a program
you’ll only run on your own system.
It’s sometimes hard to know if your target platform will support a given
feature set, but the Steam Hardware Survey is generally
taken as a guide to what you can expect people to have available. If you
click “Other Settings” it’ll expand into a list of CPU target features and
how common they are. These days, it seems that sse3
can be safely assumed,
and ssse3
, sse4.1
, and sse4.2
are pretty safe bets as well. The stuff
above 128-bit isn’t as common yet, give it another few years.
Please note that executing a program on a CPU that doesn’t support the target features it was compiles for is Undefined Behavior.
Currently, Rust doesn’t actually support an easy way for you to check that a
feature enabled at compile time is actually available at runtime. There is
the “feature_detected” family of macros, but if you
enable a feature they will evaluate to a constant true
instead of actually
deferring the check for the feature to runtime. This means that, if you
did want a check at the start of your program, to confirm that all the
assumed features are present and error out when the assumptions don’t hold,
you can’t use that macro. You gotta use CPUID and check manually. rip.
Hopefully we can make that process easier in a future version of this crate.
A Note On Working With Cfg
There’s two main ways to use cfg
:
- Via an attribute placed on an item, block, or expression:
#[cfg(debug_assertions)] println!("hello");
- Via a macro used within an expression position:
if cfg!(debug_assertions) { println!("hello"); }
The difference might seem small but it’s actually very important:
- The attribute form will include code or not before deciding if all the items named and so forth really exist or not. This means that code that is configured via attribute can safely name things that don’t always exist as long as the things they name do exist whenever that code is configured into the build.
- The macro form will include the configured code no matter what, and then
the macro resolves to a constant
true
orfalse
and the compiler uses dead code elimination to cut out the path not taken.
This crate uses cfg
via the attribute, so the functions it exposes don’t
exist at all when the appropriate CPU target features aren’t enabled.
Accordingly, if you plan to call this crate or not depending on what
features are enabled in the build you’ll also need to control your use of
this crate via cfg attribute, not cfg macro.
Modules
An explanation of the crate’s naming conventions.
Macros
Turns a round operator token to the correct constant value.
Structs
The data for a 128-bit SSE register of four f32
lanes.
The data for a 128-bit SSE register of two f64
values.
The data for a 128-bit SSE register of integer data.
The data for a 256-bit AVX register of eight f32
lanes.
The data for a 256-bit AVX register of four f64
values.
The data for a 256-bit AVX register of integer data.
Functions
Lanewise a + b
with lanes as i8
.
Lanewise a + b
with lanes as i16
.
Lanewise a + b
with lanes as i32
.
Lanewise a + b
with lanes as i64
.
Lanewise a + b
.
Low lane a + b
, other lanes unchanged.
Lanewise a + b
.
Lowest lane a + b
, high lane unchanged.
Lanewise saturating a + b
with lanes as i8
.
Lanewise saturating a + b
with lanes as i16
.
Lanewise saturating a + b
with lanes as u8
.
Lanewise saturating a + b
with lanes as u16
.
Lanewise average of the u8
values.
Lanewise average of the u16
values.
Bitwise a & b
.
Bitwise a & b
.
Bitwise a & b
.
Bitwise (!a) & b
.
Bitwise (!a) & b
.
Bitwise (!a) & b
.
Bitwise a | b
.
Bitwise a | b
.
Bitwise a | b
.
Bitwise a ^ b
.
Bitwise a ^ b
.
Bitwise a ^ b
.
Shifts all bits in the entire register left by a number of bytes.
Shifts all bits in the entire register right by a number of bytes.
Swap the bytes of the given 32-bit value.
Swap the bytes of the given 64-bit value.
Bit-preserving cast to m128
from m128d
Bit-preserving cast to m128
from m128i
Bit-preserving cast to m128d
from m128
Bit-preserving cast to m128d
from m128i
Bit-preserving cast to m128i
from m128
Bit-preserving cast to m128i
from m128d
Low lane equality.
Low lane f64
equal to.
Lanewise a == b
with lanes as i8
.
Lanewise a == b
with lanes as i16
.
Lanewise a == b
with lanes as i32
.
Lanewise a == b
.
Low lane a == b
, other lanes unchanged.
Lanewise a == b
, mask output.
Low lane a == b
, other lanes unchanged.
Low lane greater than or equal to.
Low lane f64
greater than or equal to.
Lanewise a >= b
.
Low lane a >= b
, other lanes unchanged.
Lanewise a >= b
.
Low lane a >= b
, other lanes unchanged.
Low lane greater than.
Low lane f64
greater than.
Lanewise a > b
with lanes as i8
.
Lanewise a > b
with lanes as i16
.
Lanewise a > b
with lanes as i32
.
Lanewise a > b
.
Low lane a > b
, other lanes unchanged.
Lanewise a > b
.
Low lane a > b
, other lanes unchanged.
Low lane less than or equal to.
Low lane f64
less than or equal to.
Lanewise a <= b
.
Low lane a <= b
, other lanes unchanged.
Lanewise a <= b
.
Low lane a <= b
, other lanes unchanged.
Low lane less than.
Low lane f64
less than.
Lanewise a < b
with lanes as i8
.
Lanewise a < b
with lanes as i16
.
Lanewise a < b
with lanes as i32
.
Lanewise a < b
.
Low lane a < b
, other lanes unchanged.
Lanewise a < b
.
Low lane a < b
, other lane unchanged.
Low lane not equal to.
Low lane f64
less than.
Lanewise a != b
.
Low lane a != b
, other lanes unchanged.
Lanewise a != b
.
Low lane a != b
, other lane unchanged.
Lanewise !(a >= b)
.
Low lane !(a >= b)
, other lanes unchanged.
Lanewise !(a >= b)
.
Low lane !(a >= b)
, other lane unchanged.
Lanewise !(a > b)
.
Low lane !(a > b)
, other lanes unchanged.
Lanewise !(a > b)
.
Low lane !(a > b)
, other lane unchanged.
Lanewise !(a <= b)
.
Low lane !(a <= b)
, other lanes unchanged.
Lanewise !(a <= b)
.
Low lane !(a <= b)
, other lane unchanged.
Lanewise !(a < b)
.
Low lane !(a < b)
, other lanes unchanged.
Lanewise !(a < b)
.
Low lane !(a < b)
, other lane unchanged.
Lanewise (!a.is_nan()) & (!b.is_nan())
.
Low lane (!a.is_nan()) & (!b.is_nan())
, other lanes unchanged.
Lanewise (!a.is_nan()) & (!b.is_nan())
.
Low lane (!a.is_nan()) & (!b.is_nan())
, other lane unchanged.
Lanewise a.is_nan() | b.is_nan()
.
Low lane a.is_nan() | b.is_nan()
, other lanes unchanged.
Lanewise a.is_nan() | b.is_nan()
.
Low lane a.is_nan() | b.is_nan()
, other lane unchanged.
Convert i32
to f32
and replace the low lane of the input.
Convert i32
to f64
and replace the low lane of the input.
Convert i64
to f64
and replace the low lane of the input.
Converts the lower f32
to f64
and replace the low lane of the input
Converts the low f64
to f32
and replaces the low lane of the input.
Rounds the f32
lanes to i32
lanes.
Rounds the two f64
lanes to the low two i32
lanes.
Rounds the four i32
lanes to four f32
lanes.
Rounds the two f64
lanes to the low two f32
lanes.
Rounds the lower two i32
lanes to two f64
lanes.
Rounds the two f64
lanes to the low two f32
lanes.
Copy the low i64
lane to a new register, upper bits 0.
Copies the a
value and replaces the low lane with the low b
value.
Lanewise a / b
.
Low lane a / b
, other lanes unchanged.
Lanewise a / b
.
Lowest lane a / b
, high lane unchanged.
Gets an i16
value out of an m128i
, returns as i32
.
Gets the low lane as an individual f32
value.
Gets the lower lane as an f64
value.
Converts the low lane to i32
and extracts as an individual value.
Converts the lower lane to an i32
value.
Converts the lower lane to an i32
value.
Converts the lower lane to an i64
value.
Converts the lower lane to an i64
value.
Inserts the low 16 bits of an i32
value into an m128i
.
Loads the f32
reference into the low lane of the register.
Loads the f32
reference into all lanes of a register.
Loads the reference into the low lane of the register.
Loads the f64
reference into all lanes of a register.
Loads the low i64
into a register.
Loads the reference into a register.
Loads the reference into a register.
Loads the reference into a register.
Loads the reference into a register, replacing the high lane.
Loads the reference into a register, replacing the low lane.
Loads the reference into a register with reversed order.
Loads the reference into a register with reversed order.
Loads the reference into a register.
Loads the reference into a register.
Loads the reference into a register.
Lanewise max(a, b)
with lanes as i16
.
Lanewise max(a, b)
.
Low lane max(a, b)
, other lanes unchanged.
Lanewise max(a, b)
.
Low lane max(a, b)
, other lanes unchanged.
Lanewise max(a, b)
with lanes as u8
.
Lanewise min(a, b)
with lanes as i16
.
Lanewise min(a, b)
.
Low lane min(a, b)
, other lanes unchanged.
Lanewise min(a, b)
.
Low lane min(a, b)
, other lanes unchanged.
Lanewise min(a, b)
with lanes as u8
.
Move the high lanes of b
to the low lanes of a
, other lanes unchanged.
Move the low lanes of b
to the high lanes of a
, other lanes unchanged.
Move the low lane of b
to a
, other lanes unchanged.
Gathers the i8
sign bit of each lane.
Gathers the sign bit of each lane.
Gathers the sign bit of each lane.
Multiply i16
lanes producing i32
values, horizontal add pairs of i32
values to produce the final output.
Lanewise a * b
with lanes as i16
, keep the high bits of the i32
intermediates.
Lanewise a * b
with lanes as i16
, keep the low bits of the i32
intermediates.
Lanewise a * b
.
Low lane a * b
, other lanes unchanged.
Lanewise a * b
.
Lowest lane a * b
, high lane unchanged.
Lanewise a * b
with lanes as u16
, keep the high bits of the u32
intermediates.
Multiplies the odd u32
lanes and gives the widened (u64
) results.
Saturating convert i16
to i8
, and pack the values.
Saturating convert i16
to u8
, and pack the values.
Saturating convert i32
to i16
, and pack the values.
Reads the CPU’s timestamp counter value.
Reads the CPU’s timestamp counter value and store the processor signature.
Lanewise 1.0 / a
approximation.
Low lane 1.0 / a
approximation, other lanes unchanged.
Lanewise 1.0 / sqrt(a)
approximation.
Low lane 1.0 / sqrt(a)
approximation, other lanes unchanged.
Sets the args into an m128i
, first arg is the high lane.
Sets the args into an m128i
, first arg is the high lane.
Sets the args into an m128i
, first arg is the high lane.
Set an i32
as the low 32-bit lane of an m128i
, other lanes blank.
Sets the args into an m128i
, first arg is the high lane.
Set an i64
as the low 64-bit lane of an m128i
, other lanes blank.
Sets the args into an m128
, first arg is the high lane.
Sets the args into an m128
, first arg is the high lane.
Sets the args into an m128d
, first arg is the high lane.
Sets the args into the low lane of a m128d
.
Sets the args into an m128i
, first arg is the low lane.
Sets the args into an m128i
, first arg is the low lane.
Sets the args into an m128i
, first arg is the low lane.
Sets the args into an m128
, first arg is the low lane.
Sets the args into an m128d
, first arg is the low lane.
Splats the i8
to all lanes of the m128i
.
Splats the i16
to all lanes of the m128i
.
Splats the i32
to all lanes of the m128i
.
Splats the i64
to both lanes of the m128i
.
Splats the value to all lanes.
Splats the args into both lanes of the m128d
.
Shift all u16
lanes to the left by the count
in the lower u64
lane.
Shift all u32
lanes to the left by the count
in the lower u64
lane.
Shift all u64
lanes to the left by the count
in the lower u64
lane.
Shifts all u16
lanes left by an immediate.
Shifts all u32
lanes left by an immediate.
Shifts both u64
lanes left by an immediate.
Shift each i16
lane to the right by the count
in the lower i64
lane.
Shift each i32
lane to the right by the count
in the lower i64
lane.
Shift each u16
lane to the right by the count
in the lower u64
lane.
Shift each u32
lane to the right by the count
in the lower u64
lane.
Shift each u64
lane to the right by the count
in the lower u64
lane.
Shifts all i16
lanes right by an immediate.
Shifts all i32
lanes right by an immediate.
Shifts all u16
lanes right by an immediate.
Shifts all u32
lanes right by an immediate.
Shifts both u64
lanes right by an immediate.
Shuffle the f32
lanes from $a
and $b
together using an immediate
control value.
Shuffle the f64
lanes from $a
and $b
together using an immediate
control value.
Shuffle the i32
lanes in $a
using an immediate
control value.
Shuffle the high i16
lanes in $a
using an immediate control value.
Shuffle the low i16
lanes in $a
using an immediate control value.
Lanewise sqrt(a)
.
Low lane sqrt(a)
, other lanes unchanged.
Lanewise sqrt(a)
.
Low lane sqrt(b)
, upper lane is unchanged from a
.
Stores the high lane value to the reference given.
Stores the value to the reference given.
Stores the value to the reference given.
Stores the low lane value to the reference given.
Stores the value to the reference given.
Stores the low lane value to the reference given.
Stores the value to the reference given.
Stores the value to the reference given in reverse order.
Stores the value to the reference given.
Stores the low lane value to all lanes of the reference given.
Stores the low lane value to all lanes of the reference given.
Stores the value to the reference given.
Stores the value to the reference given.
Stores the value to the reference given.
Lanewise a - b
with lanes as i8
.
Lanewise a - b
with lanes as i16
.
Lanewise a - b
with lanes as i32
.
Lanewise a - b
with lanes as i64
.
Lanewise a - b
.
Low lane a - b
, other lanes unchanged.
Lanewise a - b
.
Lowest lane a - b
, high lane unchanged.
Lanewise saturating a - b
with lanes as i8
.
Lanewise saturating a - b
with lanes as i16
.
Lanewise saturating a - b
with lanes as u8
.
Lanewise saturating a - b
with lanes as u16
.
Compute “sum of u8
absolute differences”.
Transpose four m128
as if they were a 4x4 matrix.
Truncate the f32
lanes to i32
lanes.
Truncate the f64
lanes to the lower i32
lanes (upper i32
lanes 0).
Truncate the lower lane into an i32
.
Truncate the lower lane into an i64
.
Unpack and interleave high i8
lanes of a
and b
.
Unpack and interleave high i16
lanes of a
and b
.
Unpack and interleave high i32
lanes of a
and b
.
Unpack and interleave high i64
lanes of a
and b
.
Unpack and interleave high lanes of a
and b
.
Unpack and interleave high lanes of a
and b
.
Unpack and interleave low i8
lanes of a
and b
.
Unpack and interleave low i16
lanes of a
and b
.
Unpack and interleave low i32
lanes of a
and b
.
Unpack and interleave low i64
lanes of a
and b
.
Unpack and interleave low lanes of a
and b
.
Unpack and interleave low lanes of a
and b
.
All lanes zero.
Both lanes zero.
All lanes zero.