1. 介绍

1.1. 简介

Enflame GCU 算子支持列表以支持TF/Pytorch XLA算子为主。

Enflame GCU 支持的算子语义可参考TensorFlow XLA operation semantics.

2. XLA 算子支持列表

2.1. AfterAll

Semantics

AfterAll takes a variadic number of tokens and produces a single token. Tokens are primitive types which can be threaded between side-effecting operations to enforce ordering. AfterAll can be used as a join of tokens for ordering a operation after a set operations.

Boundary

unimplemented

2.2. AllReduce

Semantics

Performs a custom computation across replicas.

Boundary

unimplemented

2.3. AlltoAll

Semantics

AllToAll is a collective operation that sends data from all cores to all cores.

Boundary

unimplemented

2.4. BatchNormGrad

Semantics

Calculates gradients of batch norm.

Boundary

Supported DataType: FP32/BF16/FP16.

Constraints:

  1. Supported Format:NHWC/CHNW/NCHW

2.5. BatchNormInference

Semantics

Normalizes an array across batch and spatial dimensions.

Boundary

Supported DataType: FP32/BF16/FP16.

Constraints:

  1. Supported Format:NHWC/CHNW/NCHW

2.6. BatchNormTraining

Semantics

Normalizes an array across batch and spatial dimensions.

Boundary

Supported DataType: FP32/BF16/FP16.

Constraints:

  1. Supported Format:NHWC/CHNW/NCHW

2.7. BitcastConvertType

Semantics

Similar to a tf.bitcast in TensorFlow, performs an element-wise bitcast operation from a data shape to a target shape. The dimensions must match, and the conversion is an element-wise one; e.g. s32 elements become f32 elements via bitcast routine. Bitcast is implemented as a low-level cast, so machines with different floating-point representations will give different results.

Boundary

Supported DataType: U64/S64/FP32/U32/S32/BF16/FP16/U16/S16/U8/S8/PRED.

Constraints:

  1. U64/S64 can be converted to U32/S32 only.

2.8. Broadcast

Semantics

Adds dimensions to an array by duplicating the data in the array.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8

Constraints:

None

2.9. BroadcastInDim

Semantics

Expands the size and rank of an array by duplicating the data in the array.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8

Constraints:

None

2.10. Call

Semantics

Invokes a computation with the given arguments.

Boundary

Constraints: The same as XLA_CPU implementations

2.11. Cholesky

Semantics

Computes the Cholesky decomposition of a batch of symmetric (Hermitian) positive definite matrices.

Boundary

unimplemented

2.12. Clamp

Semantics

Clamps an operand to within the range between a minimum and maximum value.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.13. Collapse

Semantics

Collapses dimensions of an array into one dimension.

Boundary

unimplemented

2.14. CollectivePermute

Semantics

CollectivePermute is a collective operation that sends and receives data cross replicas.

Boundary

unimplemented

2.15. Concatenate

Semantics

Concatenate composes an array from multiple array operands. The array is of the same rank as each of the input array operands (which must be of the same rank as each other) and contains the arguments in the order that they were specified.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

  1. Only support source dimention size less than 65536.

2.16. Conditional

Semantics

Executes true_computation if pred is true, false_computation if pred is false, and returns the result.

Boundary

Constraints: The same as XLA_CPU implementations

2.17. Conv

Semantics

As ConvWithGeneralPadding, but the padding is specified in a short-hand way as either SAME or VALID. SAME padding pads the input (lhs) with zeroes so that the output has the same shape as the input when not taking striding into account. VALID padding simply means no padding.

Boundary

Supported DataType: FP32/BF16/FP16.

General constraints:

  1. window size: R <= 13, S <=13

  2. Stride H/W in [1, 2, 4]

  3. LHS dilation H/W in [1, 2, 4]

  4. RHS dilation H/W in [1, 2, 4]

2.18. ConvWithGeneralPadding

Semantics

Computes a convolution of the kind used in neural networks. Here, a convolution can be thought of as a n-dimensional window moving across a n-dimensional base area and a computation is performed for each possible position of the window.

Boundary

Same as Conv.

2.19. ConvertElementType

Semantics

Similar to an element-wise static_cast in C++, performs an element-wise conversion operation from a data shape to a target shape. The dimensions must match, and the conversion is an element-wise one; e.g. s32 elements become f32 elements via an s32-to-f32 conversion routine.

Boundary

Supported converting DataType:

  1. source DataType: PRED

    destimation DataType: U8, S32, F32

  2. source DataType: S8

    destination DataType: S8, S16, F16, BF16

  3. source DataType: U8

    destination DataType: U8, U16, F16, BF16, U32, S32, F32

  4. source DataType: U16

    destination DataType: U8, U16, U32, F32

  5. source DataType: S16

    destination DataType: S8, S16, S32, F32

  6. source DataType: F16

    destination DataType: S16, F16, S32, F32

  7. source DataType: BF16

    destination DataType: S16, BF16, S32, F32

  8. source DataType: U32

    destination DataType: U16, BF16, U32, F32, U64, S64

  9. source DataType: S32

    destination DataType: S16, BF16, S32, F32, U64, S64

  10. source DataType: F32

    destination DataType: S16, F16, BF16, U32, S32, F32

  11. source DataType: U64

    destination DataType: U32, S32

  12. source DataType: S64

    destination DataType: U32, S32

2.20. CrossReplicaSum

Semantics

Performs AllReduce with a summation computation.

Boundary

unimplemented

2.21. CustomCall

Semantics

Call a user-provided function within a computation.

Boundary

unimplemented

2.22. Dot

Semantics

The operation performs sum of products over the second dimension of lhs (or the first if it has rank 1) and the first dimension of rhs. These are the “contracted” dimensions. The contracted dimensions of lhs and rhs must be of the same size. In practice, it can be used to perform dot products between vectors, vector/matrix multiplications or matrix/matrix multiplications.

Boundary

Supported DataType: F32/BF16/FP16

Constraints:

None

2.23. DotGeneral

Semantics

DotGeneral performs the sum of products over contracting dimensions specified in ‘dimension_numbers’.

Boundary

Supported DataType: F32/BF16/FP16

Constraints:

None

2.24. DynamicSlice

Semantics

DynamicSlice extracts a sub-array from the input array at dynamic start_indices. The size of the slice in each dimension is passed in size_indices, which specify the end point of exclusive slice intervals in each dimension: [start, start + size). The shape of start_indices must be rank == 1, with dimension size equal to the rank of operand.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8

Constraints:

  1. Supported number of slice dimentions: less than 10 dims’ Slice

  2. UpdateSlice dimention size less than 65536 bytes is prefered, otherwise the performance will suffer.

2.25. DynamicUpdateSlice

Semantics

DynamicUpdateSlice generates a result which is the value of the input array operand, with a slice update overwritten at start_indices. The shape of update determines the shape of the sub-array of the result which is updated. The shape of start_indices must be rank == 1, with dimension size equal to the rank of operand.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8

Constraints:

  1. Supported number of slice dimentions: less than 10 dims’ Slice

  2. UpdateSlice dimention size less than 65536 bytes is prefered, otherwise the performance will suffer.

2.26. Element-wise binary arithmetic operations:Add

Semantics

Addition

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.27. Element-wise binary arithmetic operations:Sub

Semantics

Subtraction

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.28. Element-wise binary arithmetic operations:Mul

Semantics

Multiplication

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.29. Element-wise binary arithmetic operations:Div

Semantics

Division

Boundary

Supported DataType: FP32/BF16/FP16.

Constraints:

None

2.30. Element-wise binary arithmetic operations:Rem

Semantics

Remainder

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.31. Element-wise binary arithmetic operations:Max

Semantics

Maximum

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.32. Element-wise binary arithmetic operations:Min

Semantics

Minimum

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.33. Element-wise binary arithmetic operations:LogicalAnd

Semantics

Logical AND

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.34. Element-wise binary arithmetic operations:LogicalOR

Semantics

Logical OR

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.35. Element-wise comparison operations:Eq

Semantics

Equal-to

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.36. Element-wise comparison operations:Ne

Semantics

Not equal-to

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.37. Element-wise comparison operations:Ge

Semantics

Greater-or-equal-than

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.38. Element-wise comparison operations:Gt

Semantics

Greater-than

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.39. Element-wise comparison operations:Le

Semantics

Less-or-equal-than

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.40. Element-wise comparison operations:Lt

Semantics

Less-than

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.41. Element-wise unary functions:Abs

Semantics

Element-wise abs x -> |x|.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.42. Element-wise unary functions:Ceil

Semantics

Element-wise ceil x -> ⌈x⌉.

Boundary

Supported DataType: FP32/FP16/BF16.

Constraints:

None.

2.43. Element-wise unary functions:Cos

Semantics

Element-wise cosine x -> cos(x).

Boundary

Supported DataType: FP32/FP16/BF16.

Constraints:

  1. use an equivalent x in -pi ~ pi for better accuracy

2.44. Element-wise unary functions:Exp

Semantics

Element-wise natural exponential x -> e^x.

Boundary

Supported DataType: FP32/FP16/BF16.

Constraints:

None.

2.45. Element-wise unary functions:Floor

Semantics

Element-wise floor x -> ⌊x⌋.

Boundary

Supported DataType: FP32/FP16.

Constraints:

None.

2.46. Element-wise unary functions:IsFinite

Semantics

Tests whether each element of operand is finite, i.e., is not positive or negative infinity, and is not NaN. Returns an array of PRED values with the same shape as the input, where each element is true if and only if the corresponding input element is finite.

Boundary

Supported DataType: FP32/FP16/BF16.

Constraints:

None.

2.47. Element-wise unary functions:Log

Semantics

Element-wise natural logarithm x -> ln(x).

Boundary

Supported DataType: FP32/FP16/BF16.

Constraints:

None.

2.48. Element-wise unary functions:LogicalNot

Semantics

Element-wise logical not x -> !(x).

Boundary

Supported DataType: FP32/FP16/BF16.

Constraints:

None.

2.49. Element-wise unary functions:PopulationCount

Semantics

Computes the number of bits set in each element of operand.

Boundary

unimplemented

2.50. Element-wise unary functions:Neg

Semantics

Element-wise negation x -> -x.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.51. Element-wise unary functions:Sign

Semantics

Element-wise sign operation x -> sgn(x)

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.52. Element-wise unary functions:Tanh

Semantics

Element-wise hyperbolic tangent x -> tanh(x).

Boundary

Supported DataType: FP32/FP16/BF16.

Constraints:

None.

2.53. Element-wise unary functions:sin

Semantics

Element-wise hyperbolic tangent x -> sin(x).

Boundary

Supported DataType: FP32/FP16/BF16.

Constraints:

  1. use an equivalent x in -pi ~ pi for better accuracy

2.54. Element-wise unary functions:atan2

Semantics

Element-wise hyperbolic tangent x -> atan2(x).

Boundary

Supported DataType: FP32/FP16/BF16.

Constraints:

None.

2.55. Element-wise unary functions:power

Semantics

Element-wise hyperbolic tangent x, y -> power(x, y).

Boundary

Supported DataType: FP32/FP16/BF16.

Constraints:

None.

2.56. Element-wise unary functions:rsqrt

Semantics

Element-wise hyperbolic tangent x -> rsqrt(x).

Boundary

Supported DataType: FP32/FP16/BF16.

Constraints:

None.

2.57. Element-wise unary functions:sqrt

Semantics

Element-wise hyperbolic tangent x -> sqrt(x).

Boundary

Supported DataType: FP32/FP16/BF16.

Constraints:

None.

2.58. Element-wise unary functions:round

Semantics

Element-wise hyperbolic tangent x -> round(x).

Boundary

Supported DataType: FP32/FP16/BF16.

Constraints:

None.

2.59. Element-wise unary functions:shift_left

Semantics

Element-wise hyperbolic tangent x, y -> x << y.

Boundary

Supported DataType: U32/S32/U16/S16/U8/S8.

Constraints:

None

2.60. Element-wise unary functions:shift_right_arithmetic

Semantics

Element-wise hyperbolic tangent x, y -> x >> y (arithmetic).

Boundary

Supported DataType: U32/S32/U16/S16/U8/S8.

Constraints:

None

2.61. Element-wise unary functions:shift_right_logical

Semantics

Element-wise hyperbolic tangent x, y -> x >> y (logical).

Boundary

Supported DataType: U32/S32/U16/S16/U8/S8.

Constraints:

None

2.62. Fft

Semantics

The XLA FFT operation implements the forward and inverse Fourier Transforms for real and complex inputs/outputs. Multidimensional FFTs on up to 3 axes are supported, except on TPU, where only a single axis is supported (please file a github issue if you require higher order).

Boundary

unimplemented

2.63. Gather

Semantics

The XLA gather operation stitches together several slices (each slice at a potentially different runtime offset) of an input array.

Boundary

Supported DataType: C128/C64/F64/U64/S64/FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.64. GetDimensionSize

Semantics

Returns the size of the given dimension of the operand. The operand must be array shaped.

Boundary

Constraints:

None

2.65. GetTupleElement

Semantics

Sets the dynamic size of XlaOp’s given dimension. The operand must be array shaped.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.66. Infeed

Semantics

Reads a single data item from the implicit Infeed streaming interface of the device, interpreting the data as the given shape and its layout, and returns a XlaOp of the data. Multiple Infeed operations are allowed in a computation, but there must be a total order among the Infeed operations. For example, two Infeeds in the code below have a total order since there is a dependency between the while loops.

Boundary

Constraints: The same as XLA_CPU implementations

2.67. Outfeed

Semantics

Enqueues an outfeed instruction onto the computation. This instruction generates outgoing data transfers for the given data.

Boundary

Constraints: The same as XLA_CPU implementations

2.68. lota

Semantics

Builds a constant literal on device rather than a potentially large host transfer. Creates a rank 1 array of values starting at zero and incrementing by one. For floating-point types, the produced array is equivalent to ConvertElementType(Iota(…)) where the Iota is of integral type and the conversion is to the floating-point type.

Boundary

Supported DataType: U32.

2.69. Map

Semantics

Applies a scalar function over the given operands arrays, producing an array of the same dimensions where each element is the result of the mapped function applied to the corresponding elements in the input arrays.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.70. Pad

Semantics

Expands the given operand array by padding around the array as well as between the elements of the array with the given padding_value. padding_config specifies the amount of edge padding and the interior padding for each dimension.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8

Constraints:

  1. Only support source dimention size less than 65536.

  1. Only support padded dimention size less than 2048.

2.71. Recv

Semantics

Receives data of the given shape from a Send instruction in another computation that shares the same channel handle. Returns a XlaOp for the received data.

Boundary

unimplemented

2.72. Reduce

Semantics

Applies a reduction function to one or more arrays in parallel.

Boundary

Supported DataType: FP32/BF16/FP16/Pred

Constraints:

  1. only Support layout with (3,2,1,0)

2.73. ReducePrecision

Semantics

Models the effect of converting floating-point values to a lower-precision format (such as IEEE-FP16) and back to the original format. The number of exponent and mantissa bits in the lower-precision format can be specified arbitrarily, although all bit sizes may not be supported on all hardware implementations.

Boundary

Supported DataType: FP32/BF16/FP16

Constraints:

None.

2.74. ReduceWindow

Semantics

Applies a reduction function to all elements in each window of the input multi-dimensional array, producing an output multi-dimensional array with the same number of elements as the number of valid positions of the window. A pooling layer can be expressed as a ReduceWindow. Similar to Reduce, the applied computation is always passed the init_value on the left-hand side.

Boundary

Supported DataType: F32/S32/BF16/F16

Constraints:

  1. Rank 4

  2. Computation: kMaximum

2.75. ReplicaId

Semantics

Returns the unique ID (U32 scalar) of the replica.

Boundary

unimplemented

2.76. Reshape

Semantics

Reshapes the dimensions of an array into a new configuration.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8

Constraints:

  1. Reshape dimention size less than 65536 bytes is prefered, otherwise the performance will suffer.

2.77. Rev

Semantics

Reverses the order of elements in the operand array along the specified dimensions, generating an output array of the same shape. Each element of the operand array at a multidimensional index is stored into the output array at a transformed index. The multidimensional index is transformed by reversing the index in each dimension to be reversed (i.e., if a dimension of size N is one of the reversing dimensions, its index i is transformed into N - 1 - i).

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8

Constraints:

  1. Only support source dimention size less than 65536.

2.78. RngNormal

Semantics

Constructs an output of a given shape with random numbers generated following the N(μ,σ) normal distribution.

Boundary

Supported DataType: F32/INT32/BF16/FP16.

Constraints:

  1. Truncated normal is not supported

  2. If seed=0, the timestamp will be used as the random seed.

2.79. RngUniform

Semantics

Constructs an output of a given shape with random numbers generated following the uniform distribution over the interval [a,b).

Boundary

Supported DataType: F32/INT32/BF16/FP16.

Constraints:

  1. Truncated normal is not supported

  2. If seed=0, the timestamp will be used as the random seed.

2.80. Scatter

Semantics

The XLA scatter operation generates a result which is the value of the input array operand, with several slices (at indices specified by scatter_indices) updated with the values in updates using update_computation.

Boundary

Supported DataType: C128/C64/F64/U64/S64/FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.81. Select

Semantics

Constructs an output array from elements of two input arrays, based on the values of a predicate array.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8

Constraints:

None.

2.82. SelectAndScatter

Semantics

This operation can be considered as a composite operation that first computes ReduceWindow on the operand array to select an element from each window, and then scatters the source array to the indices of the selected elements to construct an output array with the same shape as the operand array.

Boundary

Supported Data Rank: 4 dimensions

Supported Select Computation: great and equal than

Supported Scatter Computation: add

For data format, NHWC/NCHW/HWNC/NHCW/CHNW:

Supported DataType: F32/S32/BF16/F16

Padding: kValid or kSame

Constraints:

  1. window size: R <= 13, S <=13

  2. Stride H/W in [1, 2, 4]

  3. LHS dilation H/W in [1, 2, 4]

  4. RHS dilation H/W in [1, 2, 4]

2.83. Send

Semantics

Sends the given operand data to a Recv instruction in another computation that shares the same channel handle. Does not return any data.

Boundary

unimplemented

2.84. Slice

Semantics

Slicing extracts a sub-array from the input array. The sub-array is of the same rank as the input and contains the values inside a bounding box within the input array where the dimensions and indices of the bounding box are given as arguments to the slice operation.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8

Constraints:

  1. Supported number of slice dimentions: less than 10 dims’ Slice

  2. UpdateSlice dimention size less than 65536 bytes is prefered, otherwise the performance will suffer.

2.85. Sort

Semantics

Sort the operands.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.86. Transpose

Semantics

Permutes the operand dimensions with the given permutation, so ∀ i . 0 ≤ i < rank ⇒ input_dimensions[permutation[i]] = output_dimensions[i].

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8

Constraints:

  1. Supported number of slice dimentions: less than 10 dims’ copy

  2. Copy dimention size less than 65536 bytes is prefered, otherwise the performance will suffer.

2.87. TriangularSolve

Semantics

Solves systems of linear equations with lower or upper triangular coefficient matrices by forward- or back-substitution. Broadcasting along leading dimensions, this routine solves one of the matrix systems op(a) * x = b, or x * op(a) = b, for the variable x, given a and b, where op(a) is either op(a) = a, or op(a) = Transpose(a), or op(a) = Conj(Transpose(a)).

Boundary

unimplemented

2.88. Tuple

Semantics

A tuple containing a variable number of data handles, each of which has its own shape.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8.

Constraints:

None

2.89. While

Semantics

Sequentially executes the body until the condition fails.

Boundary

Constraints: Stream cache can’t be enabled.

2.90. copy

Semantics

Copy operands.

Boundary

Supported DataType: FP32/U32/S32/BF16/FP16/U16/S16/U8/S8

Constraints:

  1. Supported number of slice dimentions: less than 10 dims’ copy

  2. Copy dimention size less than 65536 bytes is prefered, otherwise the performance will suffer.

2.91. Clz

Semantics

Count leading zeros.

Boundary

unimplemented

3. 附录

表 3.3 版本历史

版本

描述

日期

V2.0

初版

2022年1月