5.1. TopsIDEAS gcu mix_precision

描述

用于自动或手动调整topsinference混合精度时每个op的计算精度。

自动混精功能,用户可给定模型输入和输出等,自动找出哪些op需要用fp32计算。 手动混精功能,打开topsinference的统计开关,使用fp32和fp16分别进行一次推理,对比每个op的输出,给出逐元素比较和cos similarity的结果,用户可以通过这些信息自行判断哪些op需要用fp32计算。

注:使用本功能找到的混精方案(需要设置为fp32计算的op id)只在当前版本topsinference上有效,当变更topsinference版本时,需要重新验证或者重新调整。

命令行

准备

  1. 准备需要校准fp16混精计算的onnx,特指使用topsinference fp32计算结果精度正常而使用默认的fp16混精方案计算精度有损失的onnx

  2. 准备一些真实样本,方法:将推理部署时,经过前处理的数据,保存成npz格式,npz中以模型每个输入的名称作为key,输入数据的ndarray作为value。npz文件必须命名为sample{id}_inputs.npz,如sample0_inputs.npz。对于自动混精功能的命令行使用方式,也需要将golden输出(如cpu fp32计算结果)也保存成同格式、名称对应的npz文件,如sample0_outputs.npz。

  3. 准备一个workspace文件夹,校准过程中产生的中间文件会保存在workspace中,对同一个onnx文件、同一个编译设置的多步校准可以复用workspace,节省部分编译时间和空间占用,否则应清空workspace。

使用方法

usage: topsideas gcu mix_precision [-h] [--manual] --input_onnx INPUT_ONNX --sample_dir SAMPLE_DIR --workspace WORKSPACE [--clean] [--output_json OUTPUT_JSON]
                                   [--fp32_nodes [FP32_NODES ...]] [--finalize] [--compile_options COMPILE_OPTIONS] [--resource_mode RESOURCE_MODE]
                                   [--min_shapes [MIN_SHAPES ...]] [--max_shapes [MAX_SHAPES ...]] [--device DEVICE] [--cluster [CLUSTER ...]]
                                   [--mode {RAW,CALC}] [--rtol RTOL] [--atol ATOL] [--ntol NTOL] [--cos_sim COS_SIM] [--le_value_thres LE_VALUE_THRES]
                                   [--ge_value_thres GE_VALUE_THRES] [--op_id_filter [OP_ID_FILTER ...] | --op_id_bypass [OP_ID_BYPASS ...] | --op_type_filter
                                   [OP_TYPE_FILTER ...] | --op_type_bypass [OP_TYPE_BYPASS ...]]

参数

short long default help
-h --help show this help message and exit
--manual Auto or manual mode, default is auto mode
--input_onnx None Onnx path
--sample_dir None Directory containing sampleN_inputs/outputs.npz
--workspace ./workspace Set a workspace path to save intermediate results and inter-run cache
--clean Clean the workspace
--output_json None Report json path
--fp32_nodes None Set op ids to fp32 precision. Only effective when enable --manual
--finalize Enable this to generate executable for deployment. This will cancel the compile options used to compare statistic op results, and disable op outputs statistic feature. Only effective when enable --manual

Compile Arguments

short long default help
--compile_options {} TopsInference compile options, json format, see TopsInference docs for more info
--resource_mode None TopsInference compile option, overwrite setting in --compile_options
--min_shapes None Min input shapes. Format: --min_shapes NAME:SHAPE. For example: --min_shapes input1:[1,3,224,224] input2:[1,3,224,224]
--max_shapes None Max input shapes. Format: --max_shapes NAME:SHAPE. For example: --max_shapes input1:[1,3,224,224] input2:[1,3,224,224]

Run Arguments

short long default help
--device 0 Device id
--cluster [-1] Cluster ids

Statistic Arguments

short long default help
--mode RAW Statistic mode: CALC mode use topsinference built-in feature to statistic op outputs in fp32 and fp16-mix runs separately , including ratio of value exceed defined threshold (default and usually set as fp16 min max value) and Nan; RAW mode use topsinference to dump op-level raw data and do extending statistics mainly between fp32 and fp16-mix runs, including number of Nan/Inf, numpy allclose, cos similarity, and vector norm (length) relative ratio. Only effective when enable --manual
--rtol 0.01 Relative tolerance
--atol 0.01 Absolute tolerance
--ntol 0 Mismatch number tolerance, eg. ntol=0.01 means 1%% mismatch is allowed. Only effective when enable --manual
--cos_sim 0 Use cosine similarity instead of numerical tolerance. Only effective when enable --manual
--le_value_thres 5.9604645e-08 Statistic value number less or equal than thres. Only effective when enable --manual and --mode=CALC
--ge_value_thres 65504 Statistic value number greater or equal than thres. Only effective when enable --manual and --mode=CALC
--op_id_filter [] Only statistic specific op ids. Conflict with other filter/bypass options. Only effective when enable --manual
--op_id_bypass [] Statistic skip specific op ids. Conflict with other filter/bypass options. Only effective when enable --manual
--op_type_filter [] Only statistic specific op types. Conflict with other filter/bypass options. Only effective when enable --manual
--op_type_bypass [] Statistic skip specific op types. Conflict with other filter/bypass options. Only effective when enable --manual

示例1 自动混精

命令行方式默认使用自动混精方式。

以lprnet-pt-op13-fp32-N.onnx为例,准备好真实样本存放到lprnet-samples文件夹,准备一个空文件夹workspace。

步骤1. 以自动方式运行混精校准功能

topsideas gcu mix_precision --input_onnx lprnet-pt-op13-fp32-N.onnx --sample_dir ./lprnet-samples/ --workspace=./workspace --output_json result.json

自动混精工具查找出设置62号和63号节点后,精度达标 result

步骤2. 产生可用于部署的executable

参考示例2中步骤4.

示例2 手动混精

手动混精需要增加–manual参数

以lprnet-pt-op13-fp32-N.onnx为例,准备好真实样本存放到lprnet-samples文件夹,准备一个空文件夹workspace。

步骤1. 先运行一次混精校准功能,使用CALC模式

topsideas gcu mix_precision --input_onnx lprnet-pt-op13-fp32-N.onnx --manual --sample_dir ./lprnet-samples/ --workspace=./workspace --output_json result.json --mode=CALC

整网fp16 fail,fp32 pass

result

所有节点中,没有nan,有且只有62号出现上溢出 result result

步骤2. 将62号设为fp32,再运行一次

topsideas gcu mix_precision --input_onnx lprnet-pt-op13-fp32-N.onnx --manual --sample_dir ./lprnet-samples/ --workspace=./workspace --output_json result.json --mode=CALC --fp32_nodes 62

整网fp16 fail,fp32 pass

result

所有节点中,没有nan,除了已经设为fp32的62号,63也出现上溢出, result result

步骤3. 将62 63设为fp32

topsideas gcu mix_precision --input_onnx lprnet-pt-op13-fp32-N.onnx --manual --sample_dir ./lprnet-samples/ --workspace=./workspace --output_json result.json --mode=CALC --fp32_nodes 62 63

整网fp16 pass

result

步骤4. 关闭调试模式,产生可用于部署的executable

topsideas gcu mix_precision --input_onnx lprnet-pt-op13-fp32-N.onnx --manual --sample_dir ./lprnet-samples/ --workspace=./workspace --output_json result.json --mode=CALC --fp32_nodes 62 63 --finalize

RAW模式

上例中,CALC模式提供的信息已经足够判断哪些节点需要使用fp32,如果不够,可以用RAW模式,提供更多统计信息,如numpy allclose、cos similarity等

json格式报表

每次执行后,除了控制台输出外,也会输出json格式报表

CALC模式输出json格式:

{
  "op_statistic": [{
    "2": {
      "id": 2,
      "type": "conv_bias",
      "outputs": [
        {
          "fp16_ge_ratio": 0.0,
          "fp16_le_ratio": 0.7616068124771118,
          "fp16_clip_ratio": 0.0,
          "fp16_check_nan": false,
          "fp32_ge_ratio": 0.0,
          "fp32_le_ratio": 0.761599063873291,
          "fp32_clip_ratio": 0.0,
          "fp32_check_nan": false
        }
      ]
    },
    // ......
    "79": {
      "id": 79,
      "type": "div",
      "outputs": [
        {
          "fp16_ge_ratio": 0.0,
          "fp16_le_ratio": 0.0,
          "fp16_clip_ratio": 0.0,
          "fp16_check_nan": true,
          "fp32_ge_ratio": 0.0,
          "fp32_le_ratio": 0.9844771027565002,
          "fp32_clip_ratio": 0.0,
          "fp32_check_nan": false
        }
      ]
    }
  }],
  "output_files": {
    "fp16_executable": "/path/to/workspace/fp16.executable",
    "fp16_hlir": "/path/to/workspace/irdump/tops_ir_dump_xxxxxx.log",
    "fp32_executable": "/path/to/workspace/fp32.executable",
    "fp32_hlir": "/path/to/workspace/irdump/tops_ir_dump_xxxxxx.log"
  },
  "model_outputs": [{
    "output": {
      "fp16_golden": {
        "fp16_element_count": 1224,
        "fp16_nan_count": 0,
        "golden_element_count": 1224,
        "golden_nan_count": 0,
        "numpy_allclose": "\nNot equal to tolerance rtol=0.01, atol=0.01\n\nMismatched elements: 1224 / 1224 (100%)\nMax absolute difference: 16357.057\nMax relative difference: 9755.003\n x: array([-16376., -16376., -11216., ...,  -9968.,  -9544.,  -2160.],\n      dtype=float32)\n y: array([-42.526203, -24.896917, -36.32142 , ...,  -9.142527, -14.362114,\n         1.765432], dtype=float32)",
        "exceed_tol_count": 1224,
        "cos_sim_value": 0.9692484736442566,
        "norm_rel": 0.0026877305936068296,
        "pass": false
      },
      "fp32_golden": {
        "fp32_element_count": 1224,
        "fp32_nan_count": 0,
        "golden_element_count": 1224,
        "golden_nan_count": 0,
        "numpy_allclose": "\nNot equal to tolerance rtol=0.01, atol=0.01\n\nMismatched elements: 1224 / 1224 (100%)\nMax absolute difference: 33129.676\nMax relative difference: 14842.743\n x: array([-19224.795 , -18641.494 , -10029.526 , ...,  -4032.8682,\n        -3562.9648,   2456.1064], dtype=float32)\n y: array([-42.526203, -24.896917, -36.32142 , ...,  -9.142527, -14.362114,\n         1.765432], dtype=float32)",
        "exceed_tol_count": 1224,
        "cos_sim_value": 0.9645612835884094,
        "norm_rel": 0.002206918317824602,
        "pass": false
      },
      "fp16_fp32": {
        "fp16_element_count": 1224,
        "fp16_nan_count": 0,
        "fp32_element_count": 1224,
        "fp32_nan_count": 0,
        "numpy_allclose": "\nNot equal to tolerance rtol=0.01, atol=0.01\n\nMismatched elements: 1219 / 1224 (99.6%)\nMax absolute difference: 16795.21\nMax relative difference: 34.36054\n x: array([-16376., -16376., -11216., ...,  -9968.,  -9544.,  -2160.],\n      dtype=float32)\n y: array([-19224.795 , -18641.494 , -10029.526 , ...,  -4032.8682,\n        -3562.9648,   2456.1064], dtype=float32)",
        "exceed_tol_count": 1219,
        "cos_sim_value": 0.9701548218727112,
        "norm_rel": 1.2178659439086914,
        "pass": false
      }
    }
  }]
}

RAW模式输出json格式:

{
  "op_statistic": [{
    "2": {
      "id": 2,
      "type": "conv_bias",
      "outputs": [
        {
          "fp16_element_count": 802816,
          "fp16_nan_count": 0,
          "fp32_element_count": 802816,
          "fp32_nan_count": 0,
          "numpy_allclose": "pass",
          "exceed_tol_count": 0,
          "cos_sim_value": 0.9999958276748657,
          "norm_rel": 0.9999895095825195
        }
      ]
    },
    // ......
    "127": {
      "id": 127,
      "type": "dot_general_bias",
      "outputs": [
        {
          "fp16_element_count": 1000,
          "fp16_nan_count": 0,
          "fp32_element_count": 1000,
          "fp32_nan_count": 0,
          "numpy_allclose": "pass",
          "exceed_tol_count": 0,
          "cos_sim_value": 0.9999998211860657,
          "norm_rel": 1.0006778240203857
        }
      ]
    }
  }],
  "output_files": {
    "fp16_executable": "/path/to/workspace/fp16.executable",
    "fp16_hlir": "/path/to/workspace/irdump/tops_ir_dump_xxxxxx.log",
    "fp32_executable": "/path/to/workspace/fp32.executable",
    "fp32_hlir": "/path/to/workspace/irdump/tops_ir_dump_xxxxxx.log"
  },
  "model_outputs": [{
    "output": {
      "fp16_golden": {
        "fp16_element_count": 1224,
        "fp16_nan_count": 0,
        "golden_element_count": 1224,
        "golden_nan_count": 0,
        "numpy_allclose": "\nNot equal to tolerance rtol=0.01, atol=0.01\n\nMismatched elements: 1224 / 1224 (100%)\nMax absolute difference: 16357.057\nMax relative difference: 9755.003\n x: array([-16376., -16376., -11216., ...,  -9968.,  -9544.,  -2160.],\n      dtype=float32)\n y: array([-42.526203, -24.896917, -36.32142 , ...,  -9.142527, -14.362114,\n         1.765432], dtype=float32)",
        "exceed_tol_count": 1224,
        "cos_sim_value": 0.9692484736442566,
        "norm_rel": 0.0026877305936068296,
        "pass": false
      },
      "fp32_golden": {
        "fp32_element_count": 1224,
        "fp32_nan_count": 0,
        "golden_element_count": 1224,
        "golden_nan_count": 0,
        "numpy_allclose": "\nNot equal to tolerance rtol=0.01, atol=0.01\n\nMismatched elements: 1224 / 1224 (100%)\nMax absolute difference: 33129.676\nMax relative difference: 14842.743\n x: array([-19224.795 , -18641.494 , -10029.526 , ...,  -4032.8682,\n        -3562.9648,   2456.1064], dtype=float32)\n y: array([-42.526203, -24.896917, -36.32142 , ...,  -9.142527, -14.362114,\n         1.765432], dtype=float32)",
        "exceed_tol_count": 1224,
        "cos_sim_value": 0.9645612835884094,
        "norm_rel": 0.002206918317824602,
        "pass": false
      },
      "fp16_fp32": {
        "fp16_element_count": 1224,
        "fp16_nan_count": 0,
        "fp32_element_count": 1224,
        "fp32_nan_count": 0,
        "numpy_allclose": "\nNot equal to tolerance rtol=0.01, atol=0.01\n\nMismatched elements: 1219 / 1224 (99.6%)\nMax absolute difference: 16795.21\nMax relative difference: 34.36054\n x: array([-16376., -16376., -11216., ...,  -9968.,  -9544.,  -2160.],\n      dtype=float32)\n y: array([-19224.795 , -18641.494 , -10029.526 , ...,  -4032.8682,\n        -3562.9648,   2456.1064], dtype=float32)",
        "exceed_tol_count": 1219,
        "cos_sim_value": 0.9701548218727112,
        "norm_rel": 1.2178659439086914,
        "pass": false
      }
    }
  }]
}

API

AutoMixPrecision

AutoMixPrecision是自动混精工具入口。

参数

argument type default
input_onnx onnx.onnx_ml_pb2.ModelProto
input_names List[str]
input_shapes List[List[int]]
min_shapes List[List[int]] None
max_shapes List[List[int]] None

构造示例

import onnx
from topsideas.gcu.mix_precision.auto_mix_precision import AutoMixPrecision
model = onnx.load('onnx_path')
tool = AutoMixPrecision(model, ['input_name'], [[shape_0,shape_1,shape_2,shape_3]])

AutoMixPrecision.run

参数

argument type default
sample_inputs List[Any]
sample_outputs List[Any]
block_op_ids List[int] None
rtol float 1e-2
atol float 1e-2
val_func Callable[[Any, Any], bool] None
card_id int 0
cluster_id int -1
resource_mode str None
RETURN List[int]

说明

  • sample_inputs: onnx模型输入样本,可以是多个样本

  • sample_outputs: 对应每个输入样本的输出

  • block_op_ids:哪些node必须使用fp32,默认不包含任何node

  • rtol/atol:相对/绝对容忍度,如用户指定val_func,该项不起作用,否则用rtol和atol判断输出是否可接受。

  • val_func:可以是用户自定义评价函数,输入1:模型参考输出结果,输入2:当前混精输出结果,返回值为bool值,True表示结果可接受,False表示结果不可接受。函数体中可以对两个输入做任何后处理,用户可自行定义。

  • card_id:gcu card id。

  • cluster_id:gcu card cluster id。

  • resource_mode:TopsInference 编译选项,例如1c4s, 1c12s等。

  • 返回值:fp32 id list,包含为满足混精精度需要设置为fp32的节点id。

使用示例

···

def post(res):
    return res.astype(np.int32)

def val(cpu_out, gcu_out):
  Tp = 0
  Tn_1 = 0
  Tn_2 = 0
    for i, label in enumerate(gcu_out):
        if len(label) != len(cpu_out[i]):
            Tn_1 += 1
            continue
        if (np.asarray(gcu_out[i]) == np.asarray(label)).all():
            Tp += 1
        else:
            Tn_2 += 1
    Acc = Tp * 1.0 / (Tp + Tn_1 + Tn_2)
    if Acc >= 0.9:
        return True
    else:
        return False

def val_func(res0, res1):
    cpu_out = post(res0[0])
    gcu_out = post(res1[0])
    return val(cpu_out, gcu_out)

tool = AutoMixPrecision(model, ['input_name'], [[shape_0,shape_1,shape_2,shape_3]])

tool.run(sample_inputs, sample_outputs, val_func=val_func)
...

AutoMixPrecision.finalize

参数

argument type default
fp32_ids List[int]
card_id int 0
cluster_id int -1
resource_mode str None
RETURN TopsInference.PyEngine

使用示例

···

def post(res):
    return res.astype(np.int32)

def val(cpu_out, gcu_out):
  Tp = 0
  Tn_1 = 0
  Tn_2 = 0
    for i, label in enumerate(gcu_out):
        if len(label) != len(cpu_out[i]):
            Tn_1 += 1
            continue
        if (np.asarray(gcu_out[i]) == np.asarray(label)).all():
            Tp += 1
        else:
            Tn_2 += 1
    Acc = Tp * 1.0 / (Tp + Tn_1 + Tn_2)
    if Acc >= 0.9:
        return True
    else:
        return False

def val_func(res0, res1):
    cpu_out = post(res0[0])
    gcu_out = post(res1[0])
    return val(cpu_out, gcu_out)

tool = AutoMixPrecision(model, ['input_name'], [[shape_0,shape_1,shape_2,shape_3]])

fp32_ids = tool.run(sample_inputs, sample_outputs, val_func=val_func)

engine = tools.finalize(fp32_ids)