附录:数据准备

4.1. dagm2007 preporcessed Dataset Preparation Guide

Step

  1. Create an account on ‘https://hci.iwr.uni-heidelberg.de/node/3616’

  2. Download the Class1.zip file

  3. unzip -d <dagm2007/private/> <Class1.zip>, the directory structure looks like:

    dagm2007
        └── private
                └── Class1
                    |── Test
                    |   └── Label
                    |   |   |── 0002_label.PNG
                    |   |   |── ...
                    |   |   |── 0568_label.PNG
                    |   |   └── Labels.txt
                    |   |── 0001.PNG
                    |   |── ...
                    |   └── 0575.PNG
                    |── ...
                    └── test_list.csv
    
  4. python3 preprocess_dagm2007.py –data_dir=dagm2007/private/

4.2. dagm2007 数据集准备指南

步骤

  1. 在该网页 https://hci.iwr.uni-heidelberg.de/node/3616 中注册账户

  2. 注册完后获得下载链接,下载Class1.zip类别1的数据

  3. 解压到**dagm2007/private/**文件夹,unzip -d <dagm2007/private/> <Class1.zip>, 处理完成后的数据结构如下:

    dagm2007
        └── private
                └── Class1
                    |── Test
                    |   └── Label
                    |   |   |── 0002_label.PNG
                    |   |   |── ...
                    |   |   |── 0568_label.PNG
                    |   |   └── Labels.txt
                    |   |── 0001.PNG
                    |   |── ...
                    |   └── 0575.PNG
                    |── ...
                    └── test_list.csv
    
  4. 运行命令生成test_list.csv:python3 preprocess_dagm2007.py –data_dir=dagm2007/private/

4.3. OCR Recognition LMDB Datset Preparation Guide

Step

  • Download deep-text-recognition-benchmark dataset

  • Unzip the dataset

  • Extract evaluation data

Run commands as below

wget https://www.dropbox.com/sh/i39abvnefllx2si/AAAbAYRvxzRp3cIE5HzqUw3ra?dl=0 -O data.zip
mkdir ./tmp
unzip -d ./tmp data.zip
unzip ./tmp/evaluation.zip
mv ./evaluation ./data
rm -rf ./tmp

Processed Dataset Structure

data/
  ├── CUTE80
  │     ├── data.mdb
  │     └── lock.mdb
  ├── IC03_860
  │     ├── data.mdb
  │     └── lock.mdb
  ├── IC03_867
  │     ├── data.mdb
  │     └── lock.mdb
  ├── IC13_857
  │     ├── data.mdb
  │     └── lock.mdb
  ├── IC13_1015
  │     ├── data.mdb
  │     └── lock.mdb
  ├── IC15_1811
  │     ├── data.mdb
  │     └── lock.mdb
  ├── IC15_2077
  │     ├── data.mdb
  │     └── lock.mdb
  ├── IIIT5k_3000
  │     ├── data.mdb
  │     └── lock.mdb
  ├── SVT
  │     ├── data.mdb
  │     └── lock.mdb
  └── SVTP
        ├── data.mdb
        └── lock.mdb

4.4. ST-GCN preporcessed Dataset Preparation Guide

Step

st-gcn use author preprocessed kinectic and ntu datasets, which can be downloaded directly.

  1. download from https://drive.google.com/open?id=103NOL9YYZSW1hLoWmYnv5Fs8mK-Ij7qb

  2. unzip <path to st-gcn-processed-data.zip>

Processed Dataset Structure

data
|-- Kinetics
|   `-- kinetics-skeleton
|       |-- train_data.npy
|       |-- train_label.pkl
|       |-- val_data.npy
|       `-- val_label.pkl
`-- NTU-RGB-D
    |-- xsub
    |   |-- train_data.npy
    |   |-- train_label.pkl
    |   |-- val_data.npy
    |   `-- val_label.pkl
    `-- xview
        |-- train_data.npy
        |-- train_label.pkl
        |-- val_data.npy
        `-- val_label.pkl

4.5. PASCAL VOC 2012 Preparation Guide

Step

  1. Download archive.zip. Put it under a directory.

  2. Run commands below

pip3 install -r requirements.txt
python3 convert_voc2012.py --input_path=<path/to/the/directory/containing/the/file> --output_path=<path/to/data>

Processed Dataset Structure

data/
└── VOC2012

4.6. PASCAL VOC 2012 数据集准备指南

步骤

  1. 下载 archive.zip。把它放在一个文件夹下。

  2. 运行下面的命令

pip3 install -r requirements.txt
python3 convert_voc2012.py --input_path=<你刚创建的文件夹> --output_path=<目标路径>

处理完的数据结构

data/
└── VOC2012

4.7. sst2 preporcessed Dataset Preparation Guide

Step

  1. Please open the text of sst2 test data online

  2. Copy the text to the local sst2_test.tsv, the directory structure looks like:

   data/sst2
       └── sst2_test.tsv
   ```

4.8. sst2 数据集准备

步骤

  1. 打开 sst2 test data

  2. 复制文件内文本内容,保存到sst2_test.tsv中,目录结构如下:

    data/sst2
        └── sst2_test.tsv
    

4.9. China-people-daily-ner-corpus Dataset Preparation Guide

Step

  1. Download china-people-daily-ner-corpus dataset

$ cd ${project_root}
$ mkdir data
$ cd data
$ wget http://s3.bmio.net/kashgari/china-people-daily-ner-corpus.tar.gz
$ tar -zxvf china-people-daily-ner-corpus.tar.gz
  1. Download vocabulary

The vocabulary file which named vocab.txt is also need, and you can download it from huggingface webpage(https://huggingface.co/bert-base-chinese/blob/main/vocab.txt) by clicking the down arrow corresponding to vocab.txt, then you need to move vocab.txt to ./model/bert_crf/, please create the folders if not exist.

Processed Data Structure

After processing, the data folder will get the following files:

data
├── china-people-daily-ner-corpus
│   ├── example.dev
│   ├── example.test
│   └── example.train
model
└── bert_crf
    └── vocab.txt

中国人民日报实体识别数据集准备指南

步骤

  1. 下载中国人民日报实体识别数据集

$ cd ${project_root}
$ mkdir data
$ cd data
$ wget http://s3.bmio.net/kashgari/china-people-daily-ner-corpus.tar.gz
$ tar -zxvf china-people-daily-ner-corpus.tar.gz
  1. 下载词典文件

模型所需的词典文件来源于huggingface此页面(https://huggingface.co/bert-base-chinese/tree/main), 找到vocab.txt,点击其右侧向下的箭头即可下载,然后将下载得到的vocab.txt移动到./model/bert_crf文件夹(如果文件夹不存在,请予以创建)

处理后的目录结构

处理后的目录结构如下所示:

data
├── china-people-daily-ner-corpus
│   ├── example.dev
│   ├── example.test
│   └── example.train
model
└── bert_crf
    └── vocab.txt

4.11. AN4 Dataset Preparation Guide

Step

  1. Download an4. Put it under a directory.

  2. Run commands below

pip3 install -r requirements.txt
python3 sph2wav.py --dir_path=<path/to/the/directory/you/containing/an4_sphere.tar.gz>
python3 build_mainfest.py --dataset_path=<path/to/the/directory/you/containing/an4test_clstk> --dir_path=<path/to/the/directory/you/output/test_manifest.json>

As far as we know, the link of an4 is currently unaccessible.

Processed Data Structure

data/
└── an4
     ├── etc
     |     ├──an4_test.transcription
     |     └── ...
     ├── wav
     |    ├── an4_clstk
     |    └── an4test_clstk
     └──  test_manifest.json

4.12. Market 1501 Dataset Preparation Guide

Step

  1. Download the dataset from Baidu netdisk or Google Drive.

  2. Generate dataset with the following command

python3 convert_market1501.py --dataset <path/to/the/compressed/dataset> --output <path/to/output/directory>

Processed Dataset Structure

data/
├── query
├── gt_query
├── gt_bbox
├── bounding_box_train
└── bounding_box_test

4.13. Market 1501 数据集准备指南

步骤

  1. 百度网盘或者谷歌云盘下载数据集。

  2. 运行下面的命令

python3 convert_market1501.py --dataset <path/to/the/compressed/dataset> --output <path/to/output/directory>

处理完成的数据结构

data/
├── query
├── gt_query
├── gt_bbox
├── bounding_box_train
└── bounding_box_test

4.14. IIIT5K preporcessed Dataset Preparation Guide

Step

The IIIT 5K-word dataset is harvested from Google image search. Query words like billboards, signboard, house numbers, house name plates, movie posters were used to collect images. The dataset contains 5000 cropped word images from Scene Texts and born-digital images. The dataset is divided into train and test parts. This dataset can be used for large lexicon cropped word recognition. We also provide a lexicon of more than 0.5 million dictionary words with this dataset.

  1. download dataset from http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K-Word_V3.0.tar.gz

  2. tar xf <path to IIIT5K-Word_V3.0.tar.gz>

  3. download label from https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt

  4. put test_label.txt into <IIIT5K folder>

Processed Dataset Structure

IIIT5K
├── lexicon.txt
├── README
├── test
├── testCharBound.mat
├── testdata.mat
├── test_label.txt
├── train
├── trainCharBound.mat
└── traindata.mat

4.15. People daily ner corpus subset Preparation Guide

Step

The mentioned dataset can be downloaded directly:

  1. download from https://github.com/TodoListIOS/NER-PyTorch/archive/refs/heads/master.zip

  2. unzip <path to NER-PyTorch-master.zip>

  3. the dataset can be located in NER-PyTorch-master/data/ren_min_newspaper

You can also get the dataset by git clone:

  1. git clone https://github.com/TodoListIOS/NER-PyTorch.git

  2. cd data/ren_min_newspaper

The dev and train file under ren_min_newspaper are well processed data, which can be directly used for testing or trainning respectively.

Processed Dataset Structure

data/
├── dev
└── train

人民日报实体识别子数据集准备指南

步骤

上述数据集可直接下载得到:

  1. 从github下载:https://github.com/TodoListIOS/NER-PyTorch/archive/refs/heads/master.zip

  2. 解压 <path to NER-PyTorch-master.zip>

  3. 数据集位于NER-PyTorch-master/data/ren_min_newspaper目录下

您也可以通过git clone获得该数据集:

  1. git clone https://github.com/TodoListIOS/NER-PyTorch.git

  2. cd data/ren_min_newspaper

ren_min_newspaper文件夹中的dev和train文件是已经处理好的数据,分别直接用于测试或训练

处理完的数据集结构

data/
├── dev
└── train

4.17. voxceleb1 voxceleb2 dataset preparation guide

step

  1. Download the Dev files and Test files of voxceleb1 from the official website. Note that you need to apply for a download account yourself. If the official website download link is invalid, you can use https://mm.kaist.ac.kr/datasets/voxceleb/.

  2. Download all files and concatenate into zip file

    cat vox1_dev* > vox1_dev_wav.zip
    
  3. Download the Dev files and Test files of voxceleb2 from the official website. Note that you need to apply for a download account yourself. If the official website download link is invalid, you can use https://mm.kaist.ac.kr/datasets/voxceleb/.

  4. Download all files and concatenate into zip file

    cat vox2_dev_aac* > vox2_aac.zip
    
  5. After the download is complete, there will be four zip files vox2_test_mp4.zip vox2_aac.zip vox1_test_wav.zip vox1_dev_wav.zip.

  6. Execute the following command

    mkdir -p vox1_2/wav
    unzip vox1_dev_wav.zip -d vox1
    unzip vox1_test_wav.zip -d vox1
    
    unzip vox2_aac.zip -d vox2
    
    cp -r vox2/dev/aac/id*  vox1_2/wav
    cp -r vox1/wav/id* vox1_2/wav
    
    cp convert.sh vox1_2
    cd vox1_2
    bash ./convert.sh
    

processed data structure

vox1_2
├── convert.sh
└── wav
   ├── id00012
      ....
   └── id11251

4.18. voxceleb1 voxceleb2 数据集准备指南

步骤

  1. 从官网下载voxceleb1的Dev文件和Test文件.。注意需要自己申请下载账号。如果官网下载链接失效了,可以使用https://mm.kaist.ac.kr/datasets/voxceleb/

  2. 下载所有文件并拼接成zip文件

    cat vox1_dev* > vox1_dev_wav.zip
    
  3. 从官网下载voxceleb2的Dev文件和Test文件.。注意需要自己申请下载账号。如果官网下载链接失效了,可以使用https://mm.kaist.ac.kr/datasets/voxceleb/

  4. 载所有文件并拼接成zip文件

    cat vox2_dev_aac* > vox2_aac.zip
    
  5. 下载完成后会有vox2_test_mp4.zip vox2_aac.zip vox1_test_wav.zip vox1_dev_wav.zip四个zip文件。

  6. 执行以下命令

    mkdir -p vox1_2/wav
    unzip vox1_dev_wav.zip -d vox1
    unzip vox1_test_wav.zip -d vox1
    
    unzip vox2_aac.zip -d vox2
    
    cp -r vox2/dev/aac/id*  vox1_2/wav
    cp -r vox1/wav/id* vox1_2/wav
    
    cp convert.sh vox1_2
    cd vox1_2
    bash ./convert.sh
    

处理完的数据结构

vox1_2
├── convert.sh
└── wav
   ├── id00012
      ....
   └── id11251

4.19. Deep Fake Detection Challenge Dataset Preparation Guide

Step

  1. Apply for admission to Deep Fake Detection Challenge Dataset and download it.

  2. Generate retinaface onnx from this codebase with its script.

  3. Run commands below

python3 convert_dfdc.py --retinaface <path/to/retinaface/onnx> --dfdc_root <path/to/uncompressed/dfdc/dataset/root> --output <output/directory>

Processed Dataset Structure

data/dfdc
        └── dataset.pkl

4.20. Deep Fake Detection Challenge 数据集准备指南

步骤

  1. 申请访问Deep Fake Detection Challenge Dataset的权限,并下载这个数据集。

  2. 采用这个codebase脚本生成retinaface的onnx。

  3. 运行下面的命令

python3 convert_dfdc.py --retinaface <path/to/retinaface/onnx> --dfdc_root <path/to/uncompressed/dfdc/dataset/root> --output <output/directory>

处理完成的数据结构

data/dfdc
        └── dataset.pkl

4.21. kitti Dataset Preparation Guide

Step

Download kitti-dataset form kaggle or kitti official website, then uncompress it, the directory structure lookslike:

    data/
        └── kitti/
            ├──ImageSets/
            │   ├── test.txt
            │   ├── train.txt
            │   └── val.txt
            ├── training/
            │   ├── image_2/
            │   ├── calib/
            │   ├── label_2/
            │   └── velodyne/
            ├── testing/
            │   ├── image_2/
            │   ├── calib/
            │   └── velodyne/
            └── classes_names.txt

4.22. CITYSCAPES Dataset Preparation Guide

Step

  1. First of all, you need to register an account in this webpage , then you should activate and login your account

  2. click to download dataset and annotations at this page, the annotations is gtFine_trainvaltest.zip, the dataset is leftImg8bit_trainvaltest.zip

  3. unzip those two files which you downloaded in step 2, the directory structure looks like:

cityscapes/
├── gtFine
│   ├── test
│   ├── train
│   └── val
└── leftImg8bit
    ├── test
    ├── train
    └── val
  1. By convention, **labelTrainIds.png are used for cityscapes training. Open-mmlab provided a scripts based on cityscapesscripts to generate **labelTrainIds.png., you can refer to this tutorial to generate cityscapes labels. Attention, you may need to install mmcv and mmsegmentation before you can generate above labels, you could refer to this install guide.

4.23. CITYSCAPES 数据集准备

步骤

  1. 首先您需要在cityscapes官网注册一个账号,并激活、登录此账号

  2. 在该网页点击下载 dataset and annotations , 标注文件名是gtFine_trainvaltest.zip, 数据集名称是leftImg8bit_trainvaltest.zip

  3. 解压在第2步中下载得到的2个文件,最终使数据集目录结构如下:

    cityscapes/
    ├── gtFine
    │   ├── test
    │   ├── train
    │   └── val
    └── leftImg8bit
        ├── test
        ├── train
        └── val
    
  4. 通常情况下,**labelTrainIds.png 被用来训练 cityscapes。 基于 cityscapesscripts, open-mmlab提供了一个 脚本, 去生成 **labelTrainIds.png。你可以参考此网页去生成合适的cityscapes的labels。注意,在您能生成上述labels之前,您可能需要先安装mmcv和mmsegmentation,您可以参考这个安装向导

4.24. ICDAR 2015 Dataset Preparation Guide

Step

  1. Download ICDAR 2015 test set(Registration is required for downloading). After registering and logging in, download the “Test Set Images” and “Test Set Ground Truth” in section “Task 4.1: Text Localization (2015 edition)”. And, the content downloaded by Test Set Images is saved as the folder ch4_test_images and Test Set Ground Truth in folder ch4_test_localization_transcription_gt.

  1. Decompress the test set, as follows,

cd path/to/ch4_test_images
unzip ch4_test_images.zip
cd path/to/ch4_test_localization_transcription_gt
unzip ch4_test_localization_transcription_gt.zip
  1. Download the PaddleOCR format annotation file. Put it under the same folder of ‘ch4_test_images’.

  2. Download https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json. Put it under the same folder of ‘ch4_test_images’.And run python3 modify_directory.py。

Processed Dataset Structure

data/
├── ch4_test_images
│        ├── img_1.jpg
│        ├── img_2.jpg
|        └── ……
├── ch4_test_localization_transcription_gt
│        ├── gt_img_1.txt
│        ├── gt_img_2.txt
|        └── ……
├── instances_test.json
└── test_icdar2015_label.txt

4.25. ICDAR 2015 数据集准备指南

步骤

  1. 下载ICDAR 2015 测试数据集(下载需要注册)。注册完登录后,下载“Task 4.1: Text Localization (2015 edition)”中的“Test Set Images”和“Test Set Ground Truth”,其中,Test Set Images下载的内容保存到ch4_test_images文件夹内,Test Set Ground Truth放在ch4_test_localization_transcription_gt文件夹内。

  2. 解压下载的压缩文件:

cd path/to/ch4_test_images
unzip ch4_test_images.zip
cd path/to/ch4_test_localization_transcription_gt
unzip ch4_test_localization_transcription_gt.zip
  1. 下载label 文件,将其放到ch4_test_images同目录下。

  2. 下载https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json,将其放到ch4_test_images同目录下,并执行python3 modify_directory.py。

处理完成的数据结构

data/
├── ch4_test_images
│        ├── img_1.jpg
│        ├── img_2.jpg
|        └── ……
├── ch4_test_localization_transcription_gt
│        ├── gt_img_1.txt
│        ├── gt_img_2.txt
|        └── ……
├── instances_test.json
└── test_icdar2015_label.txt

4.26. brats2019 preporcessed Dataset Preparation Guide

Step

  1. Download archive.zip. Put it under a directory.

  2. unzip -d <data/> <archive.zip>, the directory structure looks like:

    data/
        └── MICCAI_BraTS_2019_Data_Training
            |-- HGG
            |   |-- BraTS19_2013_10_1
            |   |   |-- BraTS19_2013_10_1_flair.nii
            |   |   |-- BraTS19_2013_10_1_seg.nii
            |   |   |-- BraTS19_2013_10_1_t1.nii
            |   |   |-- BraTS19_2013_10_1_t1ce.nii
            |   |   └── BraTS19_2013_10_1_t2.nii
            |   |-- BraTS19_2013_11_1
            |   └── ...
            |-- LGG
            |   |-- BraTS19_2013_0_1
            |   |   |-- BraTS19_2013_0_1_flair.nii
            |   |   |-- BraTS19_2013_0_1_seg.nii
            |   |   |-- BraTS19_2013_0_1_t1.nii
            |   |   |-- BraTS19_2013_0_1_t1ce.nii
            |   |   └── BraTS19_2013_0_1_t2.nii
            |   |-- BraTS19_2013_15_1
            |   └── ...
            |── name_mapping.csv
            └── survival_data.csv
    
  3. gzip -r data/MICCAI_BraTS_2019_Data_Training/HGG/*, gzip -r data/MICCAI_BraTS_2019_Data_Training/LGG/*, compress .nii files into .nii.gz, the directory structure looks like:

    data/
        └── MICCAI_BraTS_2019_Data_Training
            |-- HGG
            |   |-- BraTS19_2013_10_1
            |   |   |-- BraTS19_2013_10_1_flair.nii.gz
            |   |   |-- BraTS19_2013_10_1_seg.nii.gz
            |   |   |-- BraTS19_2013_10_1_t1.nii.gz
            |   |   |-- BraTS19_2013_10_1_t1ce.nii.gz
            |   |   └── BraTS19_2013_10_1_t2.nii.gz
            |   |-- BraTS19_2013_11_1
            |   └── ...
            |-- LGG
            |   |-- BraTS19_2013_0_1
            |   |   |-- BraTS19_2013_0_1_flair.nii.gz
            |   |   |-- BraTS19_2013_0_1_seg.nii.gz
            |   |   |-- BraTS19_2013_0_1_t1.nii.gz
            |   |   |-- BraTS19_2013_0_1_t1ce.nii.gz
            |   |   └── BraTS19_2013_0_1_t2.nii.gz
            |   |-- BraTS19_2013_15_1
            |   └── ...
            |── name_mapping.csv
            └── survival_data.csv
    
  4. Download the model and unzip -d <data/> <fold_1.zip> and for later data preprocessing, the directory structure looks like:

    data/
        |-- MICCAI_BraTS_2019_Data_Training
        |   |-- HGG
        |   |   |-- BraTS19_2013_10_1
        |   |   |   |-- BraTS19_2013_10_1_flair.nii.gz
        |   |   |   |-- BraTS19_2013_10_1_seg.nii.gz
        |   |   |   |-- BraTS19_2013_10_1_t1.nii.gz
        |   |   |   |-- BraTS19_2013_10_1_t1ce.nii.gz
        |   |   |   └── BraTS19_2013_10_1_t2.nii.gz
        |   |   |-- BraTS19_2013_11_1
        |   |   └── ...
        |   |-- LGG
        |   |   |-- BraTS19_2013_0_1
        |   |   |   |-- BraTS19_2013_0_1_flair.nii.gz
        |   |   |   |-- BraTS19_2013_0_1_seg.nii.gz
        |   |   |   |-- BraTS19_2013_0_1_t1.nii.gz
        |   |   |   |-- BraTS19_2013_0_1_t1ce.nii.gz
        |   |   |   └── BraTS19_2013_0_1_t2.nii.gz
        |   |   |-- BraTS19_2013_15_1
        |   |   └── ...
        |   |── name_mapping.csv
        |   └── survival_data.csv
        |-- nnUNet
        |   └── 3d_fullres
        |       └── Task043_BraTS2019
        |           └── nnUNetTrainerV2__nnUNetPlansv2.mlperf.1
        |               |-- fold_1
        |               |   |-- debug.json
        |               |   |-- model_best.model
        |               |   |-- model_best.model.pkl
        |               |   |-- model_final_checkpoint.model
        |               |   |-- model_final_checkpoint.model.pkl
        |               |   |-- postprocessing.json
        |               |   |-- progress.png
        |               |   |-- training_log_2020_5_25_19_07_42.txt
        |               |   |-- training_log_2020_6_15_14_50_42.txt
        |               |   └── training_log_2020_6_8_08_12_03.txt
        |               └── plans.pkl
        └── joblog.log
    

4.27. brats2019 数据集准备指南

步骤

  1. 下载 archive.zip。 把它放到一个文件夹下面。

  2. 解压到**data/**文件夹, unzip -d <data/> <archive.zip>, 目录结构如下:

    data/
        └── MICCAI_BraTS_2019_Data_Training
            |-- HGG
            |   |-- BraTS19_2013_10_1
            |   |   |-- BraTS19_2013_10_1_flair.nii
            |   |   |-- BraTS19_2013_10_1_seg.nii
            |   |   |-- BraTS19_2013_10_1_t1.nii
            |   |   |-- BraTS19_2013_10_1_t1ce.nii
            |   |   └── BraTS19_2013_10_1_t2.nii
            |   |-- BraTS19_2013_11_1
            |   └── ...
            |-- LGG
            |   |-- BraTS19_2013_0_1
            |   |   |-- BraTS19_2013_0_1_flair.nii
            |   |   |-- BraTS19_2013_0_1_seg.nii
            |   |   |-- BraTS19_2013_0_1_t1.nii
            |   |   |-- BraTS19_2013_0_1_t1ce.nii
            |   |   └── BraTS19_2013_0_1_t2.nii
            |   |-- BraTS19_2013_15_1
            |   └── ...
            |── name_mapping.csv
            └── survival_data.csv
    
  3. data/MICCAI_BraTS_2019_Data_Training文件夹中的所有.nii文件压缩成.nii.gz文件格式, gzip -r data/MICCAI_BraTS_2019_Data_Training/HGG/*, gzip -r data/MICCAI_BraTS_2019_Data_Training/LGG/*, 目录结构如下:

    data/
        └── MICCAI_BraTS_2019_Data_Training
            |-- HGG
            |   |-- BraTS19_2013_10_1
            |   |   |-- BraTS19_2013_10_1_flair.nii.gz
            |   |   |-- BraTS19_2013_10_1_seg.nii.gz
            |   |   |-- BraTS19_2013_10_1_t1.nii.gz
            |   |   |-- BraTS19_2013_10_1_t1ce.nii.gz
            |   |   └── BraTS19_2013_10_1_t2.nii.gz
            |   |-- BraTS19_2013_11_1
            |   └── ...
            |-- LGG
            |   |-- BraTS19_2013_0_1
            |   |   |-- BraTS19_2013_0_1_flair.nii.gz
            |   |   |-- BraTS19_2013_0_1_seg.nii.gz
            |   |   |-- BraTS19_2013_0_1_t1.nii.gz
            |   |   |-- BraTS19_2013_0_1_t1ce.nii.gz
            |   |   └── BraTS19_2013_0_1_t2.nii.gz
            |   |-- BraTS19_2013_15_1
            |   └── ...
            |── name_mapping.csv
            └── survival_data.csv
    
  4. 提供数据预处理相关模型文件, 下载 模型 并解压到**data/**文件夹, unzip -d <data/> <fold_1.zip>, 目录结构如下:

    data/
        |-- MICCAI_BraTS_2019_Data_Training
        |   |-- HGG
        |   |   |-- BraTS19_2013_10_1
        |   |   |   |-- BraTS19_2013_10_1_flair.nii.gz
        |   |   |   |-- BraTS19_2013_10_1_seg.nii.gz
        |   |   |   |-- BraTS19_2013_10_1_t1.nii.gz
        |   |   |   |-- BraTS19_2013_10_1_t1ce.nii.gz
        |   |   |   └── BraTS19_2013_10_1_t2.nii.gz
        |   |   |-- BraTS19_2013_11_1
        |   |   └── ...
        |   |-- LGG
        |   |   |-- BraTS19_2013_0_1
        |   |   |   |-- BraTS19_2013_0_1_flair.nii.gz
        |   |   |   |-- BraTS19_2013_0_1_seg.nii.gz
        |   |   |   |-- BraTS19_2013_0_1_t1.nii.gz
        |   |   |   |-- BraTS19_2013_0_1_t1ce.nii.gz
        |   |   |   └── BraTS19_2013_0_1_t2.nii.gz
        |   |   |-- BraTS19_2013_15_1
        |   |   └── ...
        |   |── name_mapping.csv
        |   └── survival_data.csv
        |-- nnUNet
        |   └── 3d_fullres
        |       └── Task043_BraTS2019
        |           └── nnUNetTrainerV2__nnUNetPlansv2.mlperf.1
        |               |-- fold_1
        |               |   |-- debug.json
        |               |   |-- model_best.model
        |               |   |-- model_best.model.pkl
        |               |   |-- model_final_checkpoint.model
        |               |   |-- model_final_checkpoint.model.pkl
        |               |   |-- postprocessing.json
        |               |   |-- progress.png
        |               |   |-- training_log_2020_5_25_19_07_42.txt
        |               |   |-- training_log_2020_6_15_14_50_42.txt
        |               |   └── training_log_2020_6_8_08_12_03.txt
        |               └── plans.pkl
        └── joblog.log
    

4.28. WIKI zh_CN Dataset Preparation Guide

Step

  1. Download the checkpoint of the model you want to verify

  2. Generate dataset with the following command

pip3 install -r requirements.txt
python3 preprocess_wiki_zh.py --ckpt <path/to/ckpt>

Processed Dataset Structure

data/wiki_zh/
        └── wiki_zh_test.txt

4.29. WIKI zh_CN 数据集准备指南

步骤

  1. 下载要验证的模型的checkpoint

  2. 通过下面的命令生成数据集

pip3 install -r requirements.txt
python3 preprocess_wiki_zh.py --ckpt <path/to/ckpt>

处理完成的数据结构

data/wiki_zh/
        └── wiki_zh_test.txt

4.30. Widerface Dataset Preparation Guide

Step

  1. Download the following files and put them under one directory.

file

url

WIDER_val.zip

https://drive.google.com/file/d/1GUCogbp16PMGa39thoMMeWxp7Rp5oM8Q/view?usp=sharing

wider_easy_val.mat

https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_easy_val.mat

wider_face_val.mat

https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_face_val.mat

wider_hard_val.mat

https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_hard_val.mat

wider_medium_val.mat

https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_medium_val.mat

  1. Run commands below

pip3 install -r requirements.txt
python3 convert_widerface.py --input_path=<path/to/the/directory/containing/the/three/files> --output_path=<path/to/data>

Processed Dataset Structure

data/widerface/
        ├── annotations
        └── WIDER_val

4.31. Widerface 数据集准备指南

步骤

  1. 下载下面的文件并放到一个文件夹下面。

file

url

WIDER_val.zip

https://drive.google.com/file/d/1GUCogbp16PMGa39thoMMeWxp7Rp5oM8Q/view?usp=sharing

wider_face_split.zip

http://shuoyang1213.me/WIDERFACE/support/bbx_annotation/wider_face_split.zip

retinaface_gt_v1.1.zip

https://pan.baidu.com/s/1Laby0EctfuJGgGMgRRgykA

wider_easy_val.mat

https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_easy_val.mat

wider_face_val.mat

https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_face_val.mat

wider_hard_val.mat

https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_hard_val.mat

wider_medium_val.mat

https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_medium_val.mat

  1. 运行下面的命令

pip3 install -r requirements.txt
python3 convert_widerface.py --input_path=<你刚创建的文件夹> --output_path=<目标路径>

处理完的数据结构

data/widerface/
        ├── annotations
        └── WIDER_val

4.32. MOT-16 data Preparation Guide

Step

  1. MOT-16 dataset is from this website: https://motchallenge.net/data/MOT16/, the download link is https://motchallenge.net/data/MOT16.zip.

  2. After downloading the dataset, unzip the zip file.

Processed Dataset Structure

MOT16/
├── test
│   ├── MOT16-01
│   │   ├── det
│   │   │   └── det.txt
│   │   ├── img1
│   │   │   ├── 000001.jpg
│   │   │   ├── xxxxxx.jpg
│   │   │   └── 000450.jpg
│   │   └── seqinfo.ini
│   ├── MOT16-03
│   ├── MOT16-06
│   ├── MOT16-07
│   ├── MOT16-08
│   ├── MOT16-12
│   └── MOT16-14
│
└── train
    ├── MOT16-02
    │   ├── det
    │   │   └── det.txt
    │   ├── gt
    │   │   └── gt.txt
    │   ├── img1
    │   │   ├── 000001.jpg
    │   │   ├── xxxxxx.jpg
    │   │   └── 000600.jpg
    │   └── seqinfo.ini
    ├── MOT16-04
    ├── MOT16-05
    ├── MOT16-09
    ├── MOT16-10
    ├── MOT16-11
    └── MOT16-13

4.33. MOT-16 数据集准备指南

步骤

  1. MOT-16 数据集在该网页: https://motchallenge.net/data/MOT16/,下载链接:https://motchallenge.net/data/MOT16.zip

  2. 下载后,解压zip文件

数据集结构

MOT16/
├── test
│   ├── MOT16-01
│   │   ├── det
│   │   │   └── det.txt
│   │   ├── img1
│   │   │   ├── 000001.jpg
│   │   │   ├── xxxxxx.jpg
│   │   │   └── 000450.jpg
│   │   └── seqinfo.ini
│   ├── MOT16-03
│   ├── MOT16-06
│   ├── MOT16-07
│   ├── MOT16-08
│   ├── MOT16-12
│   └── MOT16-14
│
└── train
    ├── MOT16-02
    │   ├── det
    │   │   └── det.txt
    │   ├── gt
    │   │   └── gt.txt
    │   ├── img1
    │   │   ├── 000001.jpg
    │   │   ├── xxxxxx.jpg
    │   │   └── 000600.jpg
    │   └── seqinfo.ini
    ├── MOT16-04
    ├── MOT16-05
    ├── MOT16-09
    ├── MOT16-10
    ├── MOT16-11
    └── MOT16-13

4.34. Bert-qa preporcessed Dataset Preparation Guide

Step

  1. Please open the text of bert squad dev_data online, Copy file to local dev-v1.1.json

  2. Please open the evaluate script of evaluate-v1.1.py online, Copy file to local evaluate-v1.1.py

  3. Please download the vocab.txt from [BERT-Base, Uncased] (https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip) to local directory

  4. Copy files to the local assign directory, the directory structure looks like:

    data
      └──bert_qa
           |── vocab.txt
           |── evaluate-v1.1.py
           └── dev-v1.1.json

4.35. Bert-qa 数据集准备

步骤

  1. 打开 bert squad dev_data, 复制到本地dev-v1.1.json

  2. 打开 evaluate-v1.1.py, 复制到本地evaluate-v1.1.py

  3. 下载 [BERT-Base, Uncased] (https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip),复制vocab.txt文件到对应目录

  4. 所有文件保存到对应的目录后,目录结构如下:

     data
       └──bert_qa
            |── vocab.txt
            |── evaluate-v1.1.py
            └── dev-v1.1.json
    

4.63. LFW Dataset Preparation Guide

Step

  1. install facexlib

pip3 install facexlib==0.3.0
  1. downloadLFW and uncompress it.

  2. Run commands below

python3 preprocess_lfw_data.py --img_root <path/to/uncompressed/lfw/directory> --out_root <path/to/output/directory>

Processed Dataset Structure

root_dir/
        ├── real_imgs
        └── distort_imgs

4.64. LFW 数据集准备指南

步骤

  1. 安装 facexlib

pip3 install facexlib==0.3.0
  1. 下载LFW 并且解压缩

  2. 执行下面的命令

python3 preprocess_lfw_data.py --img_root <path/to/uncompressed/lfw/directory> --out_root <path/to/output/directory>

处理完成的数据结构

root_dir/
        ├── real_imgs
        └── distort_imgs

4.56. AFW preporcessed Dataset Preparation Guide

Step

  1. Download the images of AFW

  2. unzip -d <data/AFW> <afw_images.zip>, mv data/AFW/afw_images data/AFW/images, the directory structure looks like:

    data
        └── AFW
            └── images
                ├── 1004109301.jpg
                ├── 1051618982.jpg
                ├── ...
                ├── README
                └── anno.mat
    

4.57. AFW 数据集准备

步骤

  1. 下载 AFW

  2. 解压afw_images.zip文件 unzip -d <data/AFW> <afw_images.zip>, mv data/AFW/afw_images data/AFW/images,目录结构如下:

    data
        └── AFW
            └── images
                ├── 1004109301.jpg
                ├── 1051618982.jpg
                ├── ...
                ├── README
                └── anno.mat
    

4.56. AFW preporcessed Dataset Preparation Guide

Step

  1. Please download the text of tnews test data online

  2. Copy the text to the local test.txt, the directory structure looks like:

   data
       └── test.txt
   ```

4.57. AFW 数据集准备

步骤

  1. 下载 tnews test data

  2. 复制文件内文本内容,保存到test.txt中,目录结构如下:

    data
        └── test.txt
    

4.42. IJBB Dataset Preparation Guide

Step

  1. Download facial cropps from IJBB dataset

  2. Uncompress the downloaded ijb-testsuite.tar to get IJBB.zip. Uncompress IJBB.zip to get directory loose_crop and meta.

  3. organize the directories as the following structure.

Processed Dataset Structure

data/
├── loose_crop
└── meta

4.43. IJBB 数据集准备指南

步骤

  1. 下载IJBB数据集抠脸验证集

  2. 解压缩下载的ijb-testsuite.tar得到IJBB.zip。解压缩IJBB.zip获得loose_crop和meta文件夹。

  3. 把文件夹组织成如下结构。

处理完成的数据结构

data/
├── loose_crop
└── meta

4.44. Kinetics400 Dataset Preparation Guide

Step

  1. Download Kinetics 400 Validation Video compressed file kinetics_400_val_320.tar.

  2. Download Kinetcis 400 label file kinetics_val_list.txt

  3. Unzip kinetics_400_val_320.tar and rename directory kinetics_400_val_320 to val_320

  4. Organize the aforementioned files as the structure shown below.

Processed Dataset Structure

data/kinetics400/
        ├── label
        │     └── kinetics_val_list.txt
        └── val_320

4.45. Kinetics400 数据集准备指南

步骤

  1. 下载Kinetics评测视频压缩文件kinetics_400_val_320.tar.

  2. 下载Kinetics400标注文件kinetics_val_list.txt.

  3. 解压缩 kinetics_400_val_320.tar 并且把文件夹 kinetics_400_val_320 重命名为 val_320

  4. 组织上述文件为如下结构

处理完成的数据结构

data/kinetics400/
        ├── label
        │     └── kinetics_val_list.txt
        └── val_320

4.46. Imagenet Dataset Preparation Guide

Step

  1. download ILSVRC2012_img_val.tar from https://image-net.org/challenges/LSVRC/2012/ (you need register)

  2. extract

    mkdir val
    tar -xvf ILSVRC2012_img_val.tar -C val/
    
  3. download labels

    wget https://raw.githubusercontent.com/tensorflow/models/master/research/slim/datasets/imagenet_2012_validation_synset_labels.txt
    wget https://raw.githubusercontent.com/tensorflow/models/master/research/slim/datasets/imagenet_lsvrc_2015_synsets.txt
    
  4. put images into category folders (if a flatten dir structure is needed, skip)

    python3 preprocess_imagenet_validation_data.py val/ imagenet_2012_validation_synset_labels.txt imagenet_lsvrc_2015_synsets.txt
    cp imagenet_2012_validation_synset_labels.txt val/synset_labels.txt
    
  5. generate val_map.txt

    python3 convert_imagenet.py val/ imagenet_2012_validation_synset_labels.txt imagenet_lsvrc_2015_synsets.txt val/val_map.txt
    
  6. rename

    mv val data/
    

Processed Dataset Structure

data/val/
      ├── n01440764
      │   ├── ILSVRC2012_val_00000293.JPEG
      │   ├── ILSVRC2012_val_00002138.JPEG
      |   └── ……
      ……
      └── val_map.txt

val_map.txt contains image path and label relationship likes:

./n01751748/ILSVRC2012_val_00000001.JPEG 65
./n09193705/ILSVRC2012_val_00000002.JPEG 970
./n02105855/ILSVRC2012_val_00000003.JPEG 230
./n04263257/ILSVRC2012_val_00000004.JPEG 809
……

4.47. VCTK-Corpus 数据集准备

步骤

  1. 从kaggle网站下载VCTK-Corpus数据集

  2. 下载获得archive.zip文件, 解压到./data/下, 其目录结构如下:

    data/VCTK-Corpus/
            ├── COPYING
            ├── NOTE
            ├── README
            ├── speaker-info.txt
            ├── txt
            └── wav48
                ├── p225
                ├   ├── p225_001.wav
                ├   ├── ...
                ├   └── p225_366.wav
                ├── ...
                └── p376
    

4.48. DIV2k Dataset Preparation Guide

Step

  1. Download DIV2K_valid_HR.zip, DIV2K_valid_LR_bicubic_X4.zip. Put these files under one directory.

  2. Run commands below

pip3 install -r requirements.txt
python3 convert_div2k.py --input_path=<path/to/the/directory/containing/the/two/files> --output_path=<path/to/data>

Processed Dataset Structure

data/
├── DIV2K_valid_HR
└── DIV2K_valid_LR_bicubic

4.49. DIV2k 数据集准备指南

步骤

  1. 下载 DIV2K_valid_HR.zip, DIV2K_valid_LR_bicubic_X4.zip。把它们放到一个文件夹下。

  2. 运行下面的命令

pip3 install -r requirements.txt
python3 convert_div2k.py --input_path=<你刚创建的文件夹> --output_path=<目标路径>

处理完成的数据结构

data/
├── DIV2K_valid_HR
└── DIV2K_valid_LR_bicubic

4.50. Segment Anything Prompt Detaset Preparation Guide

Step

  • Install segment-anything

  • Download pretrained model

  • generate prompt dataset

Run commands as below

  • install segment-anything and copy images

git clone https://github.com/facebookresearch/segment-anything.git
cd segment-anything
git checkout 6fdee8f
python3 setup.py install
cp -r ./notebooks/images ../
  • install requirements

pip3 install -r requirements.txt
  • download pretrained model

mkdir models
cd models
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_l_0b3195.pth
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth
  • generate prompt dataset

mkdir -p SAM/prompt
python3 prepare_sam_prompt_data.py --checkpoint ./models/sam_vit_h_4b8939.pth --image_path ./images --save_path ./SAM/prompt
python3 prepare_sam_prompt_data.py --checkpoint ./models/sam_vit_l_0b3195.pth --image_path ./images --save_path ./SAM/prompt
python3 prepare_sam_prompt_data.py --checkpoint ./models/sam_vit_b_01ec64.pth --image_path ./images --save_path ./SAM/prompt

Processed Dataset Structure

./SAM/prompt
         ├── sam_vit_h
         │        ├── annotations
         │        │      ├── vit_h-dog-sample-0.npz
         │        │      ├── vit_h-dog-sample-1.npz
         │        │      ...
         │        │      └── vit_h-truck-sample-4.npz
         │        └── images
         │              ├── dog.jpg
         │              ├── groceries.jpg
         │              └── truck.jpg
         ├── sam_vit_b
         │        ├── annotations
         │        │      ├── vit_h-dog-sample-0.npz
         │        │      ├── vit_h-dog-sample-1.npz
         │        │      ...
         │        │      └── vit_h-truck-sample-4.npz
         │        └── images
         │              ├── dog.jpg
         │              ├── groceries.jpg
         │              └── truck.jpg
         └──sam_vit_l
                 ├── annotations
                 │      ├── vit_h-dog-sample-0.npz
                 │      ├── vit_h-dog-sample-1.npz
                 │      ...
                 │      └── vit_h-truck-sample-4.npz
                 └── images
                        ├── dog.jpg
                        ├── groceries.jpg
                        └── truck.jpg

4.51. mnist dataset preparation guide

step

  1. Download the Dev files and Test files of mnist

    wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
    wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
    wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
    wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
    

processed data structure

   data/
   ├── t10k-images-idx3-ubyte.gz
   ├── t10k-labels-idx1-ubyte.gz
   ├── train-images-idx3-ubyte.gz
   └── train-labels-idx1-ubyte.gz

4.52. mnist 数据集准备指南

步骤

  1. 从官网下载mnist数据集。

    wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
    wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
    wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
    wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
    

处理完的数据结构

   data/
   ├── t10k-images-idx3-ubyte.gz
   ├── t10k-labels-idx1-ubyte.gz
   ├── train-images-idx3-ubyte.gz
   └── train-labels-idx1-ubyte.gz

4.53. COCO 2017 Dataset Preparation Guide

Step

  1. Download COCO 2017. Put it under a directory.

  2. Run commands below

pip3 install -r requirements.txt
python3 convert_coco2017.py --input_path=<path/to/the/directory/you/containing/coco2017.zip> --output_path=<path/to/data>

Processed Data Structure

data/COCO/
        ├── annotations
        ├── test2017
        ├── train2017
        └── val2017

4.54. COCO 2017 数据集准备指南

步骤

  1. 下载 COCO 2017。把它放到一个文件夹下面。

  2. 运行下面的命令

pip3 install -r requirements.txt
python3 convert_coco2017.py --input_path=<你刚创建的文件夹> --output_path=<目标路径>

处理完成的数据结构

data/COCO/
        ├── annotations
        ├── test2017
        ├── train2017
        └── val2017

4.55. Segment Anything Prompt Dataset Preparation Guide

Step

  • Install requirements

  • Download dataset

  • Extract evaluation data

Run commands as below

  • install requirements

pip3 install -r requirements.txt
  • download dataset and extract evaluation data

python3 prepare_sam_automask_data.py

Processed Dataset Structure

./SAM/automask
         ├── sa_1.jpg
         ├── sa_1.json
         ├── sa_2.jpg
         ├── sa_2.json
         ...
         ├── sa_10.jpg
         └── sa_10.json

4.56. AFW preporcessed Dataset Preparation Guide

Step

  1. Please open the text of ernie dev_data online

  2. Copy the text to the local dev_1.txt, the directory structure looks like:

   data/ernie
           └── dev_1.txt
   ```

4.57. AFW 数据集准备

步骤

  1. 打开 ernie dev_data

  2. 复制文件内文本内容,保存到dev_1.txt中,目录结构如下:

    data/ernie
            └── dev_1.txt
    

4.58. Face MS1M validation data Preparation Guide

Step

  1. Download faces_ms1m_112x112.zip. Put it under a directory.

  2. Run commands below

# suggest python3.7
pip3 install -r requirements.txt
python3 convert_ms1m_face.py --input_data_dir=<path/to/the/directory/containing/the/file> --output_data_dir=<path/to/the/converted/data>

Processed Dataset Structure

converted_ms1m_face
   ├── agedb_30.bin
   ├── cfp_ff.bin
   ├── cfp_fp.bin
   └── lfw.bin

4.59. LPRNet validation data Preparation Guide

Step

  1. LPRNet validation data is in the repo: https://github.com/sirius-ai/LPRNet_Pytorch. Commit id is 7c976664b3f3879efabeaff59c7a117e49d5f29e.

  2. Run commands below

git clone https://github.com/sirius-ai/LPRNet_Pytorch.git
cd LPRNet_Pytorch/data/test
cp *.jpg <target/folder>

Processed Dataset Structure

<target folder>
   ├── 京PL3N67.jpg
   ├── 川JK0707.jpg
   ├── ...
   └── 鲁R8D57Z.jpg

4.60. LPRNet 数据集准备指南

步骤

  1. LPRNet 测试数据在该repo中: https://github.com/sirius-ai/LPRNet_Pytorch. Commit id是 7c976664b3f3879efabeaff59c7a117e49d5f29e.

  2. 运行如下命令

git clone https://github.com/sirius-ai/LPRNet_Pytorch.git
cd LPRNet_Pytorch/data/test
cp *.jpg <target/folder>

数据集结构

<target folder>
   ├── 京PL3N67.jpg
   ├── 川JK0707.jpg
   ├── ...
   └── 鲁R8D57Z.jpg

4.61. MiniGO Dataset Preparation Guide

Generation

pip3 install -r requirements.txt
for i in {001..100} ; do PYTHONPATH=../../../ python3 selfplay.py --load_file=minigo-op13-fp32-N.onnx --num_readouts 10 --verbose 3 --selfplay_dir=data/selfplay --holdout_dir=data/holdout --sgf_dir=data/sgf; done

Processed Dataset Structure

data/
├── selfplay
├── holdout
└── sgf

4.62. MiniGO 数据集生成指南

生成方法

pip3 install -r requirements.txt
for i in {001..100} ; do PYTHONPATH=../../../ python3 selfplay.py --load_file=minigo-op13-fp32-N.onnx --num_readouts 10 --verbose 3 --selfplay_dir=data/selfplay --holdout_dir=data/holdout --sgf_dir=data/sgf; done

处理好的数据集文件结构

data/
├── selfplay
├── holdout
└── sgf

4.63. LFW Dataset Preparation Guide

Step

  1. download code base here

  2. download lfw, pair list and uncompress it.

  3. Run commands below

cd <path/to/facenet>
pip3 install -r requirements.txt

for N in {1..4}
do
PYTHONPATH=src python3 src/align/align_dataset_mtcnn.py <path/to/uncompressed/lfw/directory> <path/to/output/directory> --image_size 160 --margin 32 --random_order --gpu_memory_fraction 0.25
done

mkdir data/lfw
cp -r <path/to/output/directory> data/lfw/lfw
cp <path/to/pairs.txt> data/lfw

Processed Dataset Structure

data/lfw/
        ├── lfw
        └── pairs.txt

4.64. LFW 数据集准备指南

步骤

  1. 下载facenet代码

  2. 下载LFW, pairs.txt,并且解压缩

  3. 执行下面的命令

cd <path/to/facenet>
pip3 install -r requirements.txt
for N in {1..4}
do
PYTHONPATH=src python3 src/align/align_dataset_mtcnn.py <path/to/uncompressed/lfw/directory> <path/to/output/directory> --image_size 160 --margin 32 --random_order --gpu_memory_fraction 0.25
done
mkdir data
cp -r <path/to/output/directory> data/lfw/lfw
cp <path/to/pairs.txt> data/lfw

处理完成的数据结构

data/lfw/
        ├── lfw
        └── pairs.txt

4.65. Librispeech preporcessed Dataset Preparation Guide

Step

1、Datasets download

run commands below to download the Librispeech datasets

mkdir -p data/LibriSpeech
python3 download_librispeech.py  ./librispeech-inference.csv ./data/LibriSpeech   -e ./data

2、Data Process

run commands below to turn the dataset into json file

python3 convert_librispeech.py --input_dir ./data/LibriSpeech/dev-clean --dest_dir ./data/dev-clean-wav --output_json ./data/dev-clean-wav.json

3、Processed Data Structure

./data
├── dev-clean-wav
│   ├── 1272
│   ├── 1462
│   ├── 1673
│   ├── 174
│   ├── 1919
│   ├── 1988
│   ├── 1993
│   ├── 2035
│   ├── 2078
│   ├── 2086
│   ├── 2277
│   ├── 2412
│   ├── 2428
│   ├── 251
│   ├── 2803
│   ├── 2902
│   ├── 3000
│   ├── 3081
│   ├── 3170
│   ├── 3536
│   ├── 3576
│   ├── 3752
│   ├── 3853
│   ├── 422
│   ├── 5338
│   ├── 5536
│   ├── 5694
│   ├── 5895
│   ├── 6241
│   ├── 6295
│   ├── 6313
│   ├── 6319
│   ├── 6345
│   ├── 652
│   ├── 777
│   ├── 7850
│   ├── 7976
│   ├── 8297
│   ├── 84
│   └── 8842
├── dev-clean-wav.json
└── LibriSpeech
    ├── BOOKS.TXT
    ├── CHAPTERS.TXT
    ├── dev-clean
    ├── dev-clean.tar.gz
    ├── LICENSE.TXT
    ├── README.TXT
    └── SPEAKERS.TXT

4.66. Librispeech数据集准备指南

步骤

1、下载数据集

运行下面的命令,将Librispeech下载至本地

mkdir -p data/LibriSpeech
python3 download_librispeech.py  ./librispeech-inference.csv ./data/LibriSpeech   -e ./data

2、数据处理

参照下面的脚本,将Librispeech处理为json格式

python3 convert_librispeech.py --input_dir ./data/LibriSpeech/dev-clean --dest_dir ./data/dev-clean-wav --output_json ./data/dev-clean-wav.json

3、目录结构

完成上述的数据集下载和代码处理后,data目录应该为如下的结构

./data
├── dev-clean-wav
│   ├── 1272
│   ├── 1462
│   ├── 1673
│   ├── 174
│   ├── 1919
│   ├── 1988
│   ├── 1993
│   ├── 2035
│   ├── 2078
│   ├── 2086
│   ├── 2277
│   ├── 2412
│   ├── 2428
│   ├── 251
│   ├── 2803
│   ├── 2902
│   ├── 3000
│   ├── 3081
│   ├── 3170
│   ├── 3536
│   ├── 3576
│   ├── 3752
│   ├── 3853
│   ├── 422
│   ├── 5338
│   ├── 5536
│   ├── 5694
│   ├── 5895
│   ├── 6241
│   ├── 6295
│   ├── 6313
│   ├── 6319
│   ├── 6345
│   ├── 652
│   ├── 777
│   ├── 7850
│   ├── 7976
│   ├── 8297
│   ├── 84
│   └── 8842
├── dev-clean-wav.json
└── LibriSpeech
    ├── BOOKS.TXT
    ├── CHAPTERS.TXT
    ├── dev-clean
    ├── dev-clean.tar.gz
    ├── LICENSE.TXT
    ├── README.TXT
    └── SPEAKERS.TXT

4.67. PASCAL preporcessed Dataset Preparation Guide

Step

  1. Download the images of PASCAL

  2. unzip -d <data/PASCAL> <pascal_images.zip>, mv data/PASCAL/pascal_images data/PASCAL/images, the directory structure looks like:

    data
        └── PASCAL
            └── images
                ├── 2007_000272.jpg
                ├── 2007_000664.jpg
                ├── ...
                ├── 2011_003272.jpg
                └── 2011_003273.jpg
    

4.68. PASCAL 数据集准备

步骤

  1. 下载 PASCAL

  2. 解压pascal_images.zip文件 unzip -d <data/PASCAL> <pascal_images.zip>, mv data/PASCAL/pascal_images data/PASCAL/images,目录结构如下:

    data
        └── PASCAL
            └── images
                ├── 2007_000272.jpg
                ├── 2007_000664.jpg
                ├── ...
                ├── 2011_003272.jpg
                └── 2011_003273.jpg
    

4.69. Criteo Dataset Preparation Guide

Step

  1. Login in Criteo, click the day_2.gz download. Put it under a directory.

  2. Run commands according to repo

git clone http://git.enflame.cn/sse_ard/Algo_internals/tree/master/scripts/34_hugectr

cd Algo_internals/scripts/34_hugectr/preprocess

bash preprocess.sh

Processed Data Structure

data/critero_data/
        └── val
             ├── sparse_embedding0.data
             ├── sparse_embedding1.data
             ├── ...
             └── sparse_embedding111.data

4.70. Criteo 数据集准备指南

步骤

  1. 登录 Criteo。点击 day_2.gz下载按钮,把它放到一个文件夹下面。

  2. 运行下面repo中的数据处理命令

git clone http://git.enflame.cn/sse_ard/Algo_internals/tree/master/scripts/34_hugectr

cd Algo_internals/scripts/34_hugectr/preprocess

bash preprocess.sh

处理完成的数据结构

data/critero_data/
        └── val
             ├── sparse_embedding0.data
             ├── sparse_embedding1.data
             ├── ...
             └── sparse_embedding111.data