1. Builder C++ API¶
These pages provide the documentation for the public portions of the HLIR Builder C++ API. This API can roughly be divided into four parts:
- Builder: Global handle for building up the high level IR(Intermediate Representation). 
- Op: Fundamental operators to build the the IR, and there are hundreds of interfaces. 
- Type: Basic struct constains information of shape and primitive data type for the reuslt of Ops. 
- Attribute: Attributes that can be bound to - module,- funcor- opof the IR.
HLIR Builder is the entrance to build up the high level IR for Enflame DTU. It is currently in use at Enflame in ONNX Bridge, and PyTorch/TensorFlow Bridge for the future. We are looking forward to welcome more users of the HLIR Builder C++ API.
1.1. Builder¶
Builder is a global handle to build IR, which consists of Ops. You can use it to
- turn on or off Shape Inference of each Ops, including shape & type inference; 
- create functions, and - mainis created by default;
- set input/output of a specific function, and default is - mainwhen no function name is specified;
- print or dump the built IR, and PrintingFlags can be used to control the print behavior; 
- get the built IR via a hlir::Module instance. 
A simple example of using Builder could look as follows:
#include "gcu/hlir/builder/hlir_builder.h"
auto builder = std::make_shared<builder::Builder>();
builder->SetShapeInference(true);
auto ptype = builder::PrimitiveType::F32();
std::vector<int64_t> shape = {3, 2};
builder::Type type(shape, ptype);
auto arg0 = builder->CreateInput(type);
auto arg1 = builder->CreateInput(type);
auto arg2 = builder->CreateInput(type);
auto res = arg0 + arg1 * arg2;
builder->SetOutput({res});
builder->AddFunc("foo");
arg0 = builder->CreateInput(type, "foo");
arg1 = builder->CreateInput(type, "foo");
arg2 = builder->CreateInput(type, "foo");
res = builder::Add(arg1, arg2);
res = builder::Add(arg0, res);
builder->SetOutput({res}, "foo");
builder->Dump();
The built IR is:
module {
  func @foo(%arg0: tensor<3x2xf32>, %arg1: tensor<3x2xf32>, %arg2: tensor<3x2xf32>) -> tensor<3x2xf32> {
    %0 = "dtu_hlir.add"(%arg1, %arg2) : (tensor<3x2xf32>, tensor<3x2xf32>) -> tensor<3x2xf32>
    %1 = "dtu_hlir.add"(%arg0, %0) : (tensor<3x2xf32>, tensor<3x2xf32>) -> tensor<3x2xf32>
    "dtu_hlir.return"(%1) : (tensor<3x2xf32>) -> ()
  }
  func @main(%arg0: tensor<3x2xf32>, %arg1: tensor<3x2xf32>, %arg2: tensor<3x2xf32>) -> tensor<3x2xf32> {
    %0 = "dtu_hlir.mul"(%arg1, %arg2) : (tensor<3x2xf32>, tensor<3x2xf32>) -> tensor<3x2xf32>
    %1 = "dtu_hlir.add"(%arg0, %0) : (tensor<3x2xf32>, tensor<3x2xf32>) -> tensor<3x2xf32>
    return %1 : tensor<3x2xf32>
  }
}
1.2. Op¶
Op is short for operator, which is the basic unit to build the IR. All ops can be classified as client ops and meta ops, but you can completely ignore the differences while using the interfaces.
- meta ops are atomic operators that are implemented by the hardware; 
- client ops can be decomposed to meta ops and aim for ease of use; 
There are hundreds of Op interfaces including some overloaded operators as following:
| Overloaded Operator | Op Inferface | 
|---|---|
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
There is an agument named resultType which means the Type of the output for this Op. It’s mostly can be infered if builder->SetShapeInference(true).
But for sereval ops such as Reshape or Convert, it should be set explicitly.
1.3. Type¶
Type represents the output type of an Op. It contains the information of both the shape and the PrimitiveType, which stands for the basic data types.
| PrimitiveType | Basic data type in C++ | 
|---|---|
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
| 
 | 
 | 
Note
In shape of {-1, 3, 224, 224}, -1 means unkown dim;
Shape of {-2} means unkown rank and shape, used for dynamic shape models.
Warning
At the moment, unsigned integers(uint8_t, uint16_t, uint32_t, uint64_t) are not
supported because of the old version of mlir we are using. The problem will be
solved by upgrading the depended mlir. Another problem is that users should prepare
the float16.h and bfloat16.h by themselves when there is need to use
F16() and BF16() as inputs or construct Const ops with these primitive types.
1.4. Attribute¶
Attributes are known-constant values of modules, functions and operators in the IR. There are kinds of attributes can be defined and bound to the targets.
A simple example of set attributes could look as follows:
#include "gcu/hlir/builder/hlir_builder.h"
auto builder = std::make_shared<builder::Builder>();
auto ptype = builder::PrimitiveType::S64();
std::vector<int64_t> shape = {3, 2};
builder::Type type(shape, ptype);
auto arg0 = builder->CreateInput(type);
auto arg1 = builder->CreateInput(type);
auto res = arg0 + arg1;
// set attributes of Op
res.SetAttribute("op_name", builder::Attribute("sum"));
res.SetAttribute("op_type", builder::Attribute("Add"));
builder->SetOutput({res});
// construct an Array Attribute
std::vector<builder::Attribute> min_shape_dim;
std::vector<int64_t> min_shape_0 = {1, 2};
auto dtype_i64 = builder::PrimitiveType::S64();
auto attr_0 = builder::Attribute(builder::Type({2}, dtype_i64),
                                 min_shape_0.data());
min_shape_dim.push_back(attr_0);
std::vector<int64_t> min_shape_1 = {1, 2};
auto attr_1 = builder::Attribute(builder::Type({2}, dtype_i64),
                                 min_shape_1.data());
min_shape_dim.push_back(attr_1);
// set attributes of function
builder->SetFuncAttribute("main", "input_min_shape_dim", min_shape_dim);
// set attributes of module
builder->SetModuleAttribute("module_id", builder::Attribute(int32_t(20)));
builder->Dump();
The built IR is:
module attributes {module_id = 20 : i32} {
  func @main(%arg0: tensor<3x2xi64>, %arg1: tensor<3x2xi64>) -> tensor<3x2xi64> attributes {input_min_shape_dim = [dense<[1, 2]> : tensor<2xi64>, dense<[1, 2]> : tensor<2xi64>]} {
    %0 = "dtu_hlir.add"(%arg0, %arg1) {op_name = "sum", op_type = "Add"} : (tensor<3x2xi64>, tensor<3x2xi64>) -> tensor<3x2xi64>
    return %0 : tensor<3x2xi64>
  }
}
1.5. Sample¶
Below we provide a minmal example which include the graph builder, graph compile and graph executable run.
The first step is to get and install the released dtu sdk.
# First, get and install the released sdk. such as:
dpkg -i tops-sdk_***.deb
# Once the sdk is installed, the location of header file for hlir builder is:
tree /usr/include/gcu/hlir/builder/
# |-- hlir_builder.h
# |-- hlir_builder_client_ops.h
# |-- hlir_builder_common.h
# |-- hlir_builder_ops.h
# |-- hlir_builder_structs.h
# and the lib is:
ls /usr/lib/libdtu_sdk.so
# /usr/lib/libdtu_sdk.so -> libdtu_sdk.so.2
ls /opt/tops/lib/libtopsrt.so
# /opt/tops/lib/libtopsrt.so -> libtopsrt.so.2
Next, we can write a minimal CMake build configuration to develop a small application that depends on dtu sdk. CMake is not a hard requirement for using dtu sdk, but it is the recommended and blessed build system and will be well supported into the future. A most basic CMakeLists.txt file could look like this:
cmake_minimum_required(VERSION 3.2)
project(hlir_builder_demo)
set(CMAKE_CXX_STANDARD 14)
# if libdtu_sdk.so is compiled with ABI=0(default), uncomment Line 6 and comment Line 7
add_compile_definitions(_GLIBCXX_USE_CXX11_ABI=0)   # Line 6
# if libdtu_sdk.so is compiled with ABI=1, comment Line 6 and uncomment Line 7
#add_compile_definitions(_GLIBCXX_USE_CXX11_ABI=1)  # Line 7
include_directories(/usr/include/gcu)
include_directories(/opt/tops/include)
aux_source_directory(${CMAKE_CURRENT_LIST_DIR}/src demo_src)
link_directories(/usr/lib)
link_directories(/opt/tops/lib)
add_executable(${PROJECT_NAME} ${demo_src})
target_link_libraries(${PROJECT_NAME} -ldtu_sdk -ltopsrt)
The implementation of our example(src/demo.cpp) will simply create a new MatMul op, compile and run it:
#include <string>
#include <vector>
#include <sstream>
#include <iostream>
#include "hlir/builder/hlir_builder.h"
#include "tops_graph_compiler/tops_graph_compiler.h"
#include "tops/tops_runtime.h"
#include "tops/tops_ext.h"
int main() {
  // stage 1: build the ir
  auto builder = std::make_shared<builder::Builder>();
  builder->SetShapeInference(true);
  auto ptype = builder::PrimitiveType::F32();
  std::vector<int64_t> shape = {2, 2};
  builder::Type type(shape, ptype);
  auto arg0 = builder->CreateInput(type);
  auto arg1 = builder->CreateInput(type);
  builder::DotDimensionNumbers dims_attr({}, {}, {1}, {0});
  auto res = builder::DotGeneral(arg0, arg1, dims_attr);
  res.SetAttribute("op_name", builder::Attribute("MatMul"));
  builder->SetOutput({res});
  builder->Dump();
  auto hlir_module = builder->GetModule();
  // stage 2: compile
  topsgraphProgram program;
  auto ret = topsgraphCreateProgramFromModule(&program, hlir_module.get());
  const char * options[] = {
      "-arch=gcu210",
      "-resource=1c4s",
      "-hlir=tops-hlir-pipeline{}"};
  topsgraphCompileProgram(program, 3, options);
  size_t binary_size = 0;
  topsgraphGetBinSize(program, &binary_size);
  char* binary = new char[binary_size];
  ret = topsgraphGetBin(program, binary);
  // stage 3: run
  topsInit(0);
  int device_id = 0;
  topsSetDevice(device_id);
  topsExecutable_t exec;
  topsCreateExecutable(&exec, binary, binary_size);
  delete [] binary;
  topsgraphDestroyProgram(&program);
  topsResource_t resource;
  topsCreateResourceForExecutable(&resource, exec);
  topsStream_t stream;
  topsStreamCreate(&stream);
  std::vector<int *> dev_inputs;
  std::vector<int *> dev_outputs;
  std::vector<float> lhs{0, 1, 2, 3};
  std::vector<float> rhs{1, 2, 3, 4};
  std::vector<void*> data_ptrs;
  data_ptrs.emplace_back(static_cast<void*>(lhs.data()));
  data_ptrs.emplace_back(static_cast<void*>(rhs.data()));
  uint64_t input_count = 0;
  topsExecutableQueryInfo(exec, topsExecutableInfoInputCount, &input_count);
  uint64_t *input_size_list = (uint64_t *)malloc(sizeof(uint64_t)* input_count);
  topsExecutableQueryInfo(exec, topsExecutableInfoInputSizeList,
                          input_size_list);
  for (size_t index = 0; index < input_count; index++) {
    auto input_size = (size_t)input_size_list[index];
    int *input = nullptr;
    topsMallocForResource((void**)&input, input_size, resource);
    topsMemcpyAsync(input, data_ptrs[index], input_size_list[index],
                    topsMemcpyHostToDevice, stream);
    dev_inputs.push_back(input);
  }
  uint64_t output_count = 0;
  topsExecutableQueryInfo(exec, topsExecutableInfoOutputCount, &output_count);
  auto output_size_list = (uint64_t *)malloc(sizeof(uint64_t)* output_count);
  topsExecutableQueryInfo(exec, topsExecutableInfoOutputSizeList,
                          output_size_list);
  for (size_t i = 0; i < output_count; i++) {
    uint64_t output_size = output_size_list[i];
    int *output = nullptr;
    topsMallocForResource((void**)&output, output_size, resource);
    dev_outputs.push_back(output);
  }
  topsLaunchExecutableV2(exec, resource,
                        (void**)dev_inputs.data(), dev_inputs.size(),
                        nullptr, nullptr,
                        (void**)dev_outputs.data(), dev_outputs.size(),
                        stream);
  auto output_rank_list = (uint64_t *)malloc(sizeof(uint64_t)* output_count);
  topsExecutableQueryInfo(exec, topsExecutableInfoOutputRank, output_rank_list);
  uint64_t output_dims_size =
          std::accumulate(output_rank_list, output_rank_list + output_count, 0);
  uint64_t *output_dim_list =
          (uint64_t *)malloc(sizeof(uint64_t) * output_dims_size);
  topsExecutableQueryInfo(exec, topsExecutableInfoOutputDimsList,
                          output_dim_list);
  uint64_t dim_index = 0;
  for (size_t i = 0; i < output_count; i++) {
    uint64_t output_size = output_size_list[i];
    std::vector<uint64_t> shape_v;
    for(size_t j =0; j < output_rank_list[i]; j++) {
      shape_v.push_back(output_dim_list[dim_index++]);
    }
    void *host_output = malloc(output_size);
    topsMemcpyAsync(host_output, dev_outputs[i], output_size,
                    topsMemcpyDeviceToHost, stream);
    topsStreamSynchronize(stream);
    float* output_data = static_cast<float*>(host_output);
    std::cout << "output data: ";
    for (int j = 0; j < 4; ++j) {
      std::cout << output_data[j] << ", ";
    }
    std::cout << std::endl;
    free(host_output);
  }
  for (auto dev_input : dev_inputs) {
    topsFree(dev_input);
  }
  for (auto dev_output : dev_outputs) {
    topsFree(dev_output);
  }
  topsStreamDestroy(stream);
  topsDestroyResource(resource);
  topsDestroyExecutable(exec);
  return 0;
}
The last step is to build the application. For this, assume our example directory is laid out like the following.
demo
|-- CMakeLists.txt
`-- src
    `-- demo.cpp
We can now run the following commands to build the application from within the demo/ folder:
mkdir build
cd build
cmake ..
make
If all goes well, it will look something like this:
root@c76cafeb287f:/home/develop/hlir_builder/demo/build# cmake ..
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/develop/hlir_builder/demo/build
root@c76cafeb287f:/home/develop/hlir_builder/demo/build# make
[ 50%] Building CXX object CMakeFiles/hlir_builder_demo.dir/src/demo.cpp.o
[100%] Linking CXX executable hlir_builder_demo
[100%] Built target hlir_builder_demo
Executing the resulting hlir_builder_demo binary found in the build folder should now merrily print the built IR and the calculation result:
root@c76cafeb287f:/home/develop/hlir_builder/demo/build# ./hlir_builder_demo
# dumped ir
module {
  func @main(%arg0: tensor<2x2xf32>, %arg1: tensor<2x2xf32>) -> tensor<2x2xf32> {
    %0 = "dtu_hlir.dot_general"(%arg0, %arg1) {dot_dimension_numbers = {lhs_batching_dimensions = dense<[]> : tensor<0xi64>, lhs_contracting_dimensions = dense<1> : tensor<1xi64>, rhs_batching_dimensions = dense<[]> : tensor<0xi64>, rhs_contracting_dimensions = dense<0> : tensor<1xi64>}, op_name = "MatMul"} : (tensor<2x2xf32>, tensor<2x2xf32>) -> tensor<2x2xf32>
    return %0 : tensor<2x2xf32>
  }
}
# calculation result
output data: 3, 4, 11, 16,