附录：数据准备¶

4.1. dagm2007 preporcessed Dataset Preparation Guide¶

Step¶

Create an account on ‘https://hci.iwr.uni-heidelberg.de/node/3616’
Download the Class1.zip file

unzip -d <dagm2007/private/> <Class1.zip>, the directory structure looks like:

dagm2007
    └── private
            └── Class1
                |── Test
                |   └── Label
                |   |   |── 0002_label.PNG
                |   |   |── ...
                |   |   |── 0568_label.PNG
                |   |   └── Labels.txt
                |   |── 0001.PNG
                |   |── ...
                |   └── 0575.PNG
                |── ...
                └── test_list.csv

python3 preprocess_dagm2007.py –data_dir=dagm2007/private/

4.2. dagm2007 数据集准备指南¶

步骤¶

在该网页 https://hci.iwr.uni-heidelberg.de/node/3616 中注册账户
注册完后获得下载链接，下载Class1.zip类别1的数据

解压到**dagm2007/private/**文件夹，unzip -d <dagm2007/private/> <Class1.zip>, 处理完成后的数据结构如下:

dagm2007
    └── private
            └── Class1
                |── Test
                |   └── Label
                |   |   |── 0002_label.PNG
                |   |   |── ...
                |   |   |── 0568_label.PNG
                |   |   └── Labels.txt
                |   |── 0001.PNG
                |   |── ...
                |   └── 0575.PNG
                |── ...
                └── test_list.csv

运行命令生成test_list.csv：python3 preprocess_dagm2007.py –data_dir=dagm2007/private/

4.3. OCR Recognition LMDB Datset Preparation Guide¶

Step¶

Download deep-text-recognition-benchmark dataset
Unzip the dataset
Extract evaluation data

Run commands as below¶

wget https://www.dropbox.com/sh/i39abvnefllx2si/AAAbAYRvxzRp3cIE5HzqUw3ra?dl=0 -O data.zip
mkdir ./tmp
unzip -d ./tmp data.zip
unzip ./tmp/evaluation.zip
mv ./evaluation ./data
rm -rf ./tmp

Processed Dataset Structure¶

data/
  ├── CUTE80
  │     ├── data.mdb
  │     └── lock.mdb
  ├── IC03_860
  │     ├── data.mdb
  │     └── lock.mdb
  ├── IC03_867
  │     ├── data.mdb
  │     └── lock.mdb
  ├── IC13_857
  │     ├── data.mdb
  │     └── lock.mdb
  ├── IC13_1015
  │     ├── data.mdb
  │     └── lock.mdb
  ├── IC15_1811
  │     ├── data.mdb
  │     └── lock.mdb
  ├── IC15_2077
  │     ├── data.mdb
  │     └── lock.mdb
  ├── IIIT5k_3000
  │     ├── data.mdb
  │     └── lock.mdb
  ├── SVT
  │     ├── data.mdb
  │     └── lock.mdb
  └── SVTP
        ├── data.mdb
        └── lock.mdb

4.4. ST-GCN preporcessed Dataset Preparation Guide¶

Step¶

st-gcn use author preprocessed kinectic and ntu datasets, which can be downloaded directly.

download from https://drive.google.com/open?id=103NOL9YYZSW1hLoWmYnv5Fs8mK-Ij7qb
unzip <path to st-gcn-processed-data.zip>

Processed Dataset Structure¶

data
|-- Kinetics
|   `-- kinetics-skeleton
|       |-- train_data.npy
|       |-- train_label.pkl
|       |-- val_data.npy
|       `-- val_label.pkl
`-- NTU-RGB-D
    |-- xsub
    |   |-- train_data.npy
    |   |-- train_label.pkl
    |   |-- val_data.npy
    |   `-- val_label.pkl
    `-- xview
        |-- train_data.npy
        |-- train_label.pkl
        |-- val_data.npy
        `-- val_label.pkl

4.5. PASCAL VOC 2012 Preparation Guide¶

Step¶

Download archive.zip. Put it under a directory.
Run commands below

pip3 install -r requirements.txt
python3 convert_voc2012.py --input_path=<path/to/the/directory/containing/the/file> --output_path=<path/to/data>

Processed Dataset Structure¶

data/
└── VOC2012

4.6. PASCAL VOC 2012 数据集准备指南¶

步骤¶

下载 archive.zip。把它放在一个文件夹下。
运行下面的命令

pip3 install -r requirements.txt
python3 convert_voc2012.py --input_path=<你刚创建的文件夹> --output_path=<目标路径>

处理完的数据结构¶

data/
└── VOC2012

4.7. sst2 preporcessed Dataset Preparation Guide¶

Step¶

Please open the text of sst2 test data online
Copy the text to the local sst2_test.tsv, the directory structure looks like:

   data/sst2
       └── sst2_test.tsv
   ```

4.8. sst2 数据集准备¶

步骤¶

打开 sst2 test data
复制文件内文本内容，保存到sst2_test.tsv中，目录结构如下：
```
data/sst2
    └── sst2_test.tsv
```

4.9. China-people-daily-ner-corpus Dataset Preparation Guide¶

Step¶

Download china-people-daily-ner-corpus dataset

$ cd ${project_root}
$ mkdir data
$ cd data
$ wget http://s3.bmio.net/kashgari/china-people-daily-ner-corpus.tar.gz
$ tar -zxvf china-people-daily-ner-corpus.tar.gz

Download vocabulary

The vocabulary file which named vocab.txt is also need, and you can download it from huggingface webpage(https://huggingface.co/bert-base-chinese/blob/main/vocab.txt) by clicking the down arrow corresponding to vocab.txt, then you need to move vocab.txt to ./model/bert_crf/, please create the folders if not exist.

Processed Data Structure¶

After processing, the data folder will get the following files:

data
├── china-people-daily-ner-corpus
│   ├── example.dev
│   ├── example.test
│   └── example.train

model
└── bert_crf
    └── vocab.txt

中国人民日报实体识别数据集准备指南¶

步骤¶

下载中国人民日报实体识别数据集

$ cd ${project_root}
$ mkdir data
$ cd data
$ wget http://s3.bmio.net/kashgari/china-people-daily-ner-corpus.tar.gz
$ tar -zxvf china-people-daily-ner-corpus.tar.gz

下载词典文件

模型所需的词典文件来源于huggingface此页面(https://huggingface.co/bert-base-chinese/tree/main), 找到vocab.txt，点击其右侧向下的箭头即可下载，然后将下载得到的vocab.txt移动到./model/bert_crf文件夹（如果文件夹不存在，请予以创建）

处理后的目录结构¶

处理后的目录结构如下所示：

data
├── china-people-daily-ner-corpus
│   ├── example.dev
│   ├── example.test
│   └── example.train

model
└── bert_crf
    └── vocab.txt

4.11. AN4 Dataset Preparation Guide¶

Step¶

Download an4. Put it under a directory.
Run commands below

pip3 install -r requirements.txt
python3 sph2wav.py --dir_path=<path/to/the/directory/you/containing/an4_sphere.tar.gz>
python3 build_mainfest.py --dataset_path=<path/to/the/directory/you/containing/an4test_clstk> --dir_path=<path/to/the/directory/you/output/test_manifest.json>

As far as we know, the link of an4 is currently unaccessible.

Processed Data Structure¶

data/
└── an4
     ├── etc
     |     ├──an4_test.transcription
     |     └── ...
     ├── wav
     |    ├── an4_clstk
     |    └── an4test_clstk
     └──  test_manifest.json

4.12. Market 1501 Dataset Preparation Guide¶

Step¶

Download the dataset from Baidu netdisk or Google Drive.
Generate dataset with the following command

python3 convert_market1501.py --dataset <path/to/the/compressed/dataset> --output <path/to/output/directory>

Processed Dataset Structure¶

data/
├── query
├── gt_query
├── gt_bbox
├── bounding_box_train
└── bounding_box_test

4.13. Market 1501 数据集准备指南¶

步骤¶

从百度网盘或者谷歌云盘下载数据集。
运行下面的命令

python3 convert_market1501.py --dataset <path/to/the/compressed/dataset> --output <path/to/output/directory>

处理完成的数据结构¶

data/
├── query
├── gt_query
├── gt_bbox
├── bounding_box_train
└── bounding_box_test

4.14. IIIT5K preporcessed Dataset Preparation Guide¶

Step¶

The IIIT 5K-word dataset is harvested from Google image search. Query words like billboards, signboard, house numbers, house name plates, movie posters were used to collect images. The dataset contains 5000 cropped word images from Scene Texts and born-digital images. The dataset is divided into train and test parts. This dataset can be used for large lexicon cropped word recognition. We also provide a lexicon of more than 0.5 million dictionary words with this dataset.

download dataset from http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K-Word_V3.0.tar.gz
tar xf <path to IIIT5K-Word_V3.0.tar.gz>
download label from https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt
put test_label.txt into <IIIT5K folder>

Processed Dataset Structure¶

IIIT5K
├── lexicon.txt
├── README
├── test
├── testCharBound.mat
├── testdata.mat
├── test_label.txt
├── train
├── trainCharBound.mat
└── traindata.mat

4.15. People daily ner corpus subset Preparation Guide¶

Step¶

The mentioned dataset can be downloaded directly:

download from https://github.com/TodoListIOS/NER-PyTorch/archive/refs/heads/master.zip
unzip <path to NER-PyTorch-master.zip>
the dataset can be located in NER-PyTorch-master/data/ren_min_newspaper

You can also get the dataset by git clone:

git clone https://github.com/TodoListIOS/NER-PyTorch.git
cd data/ren_min_newspaper

The dev and train file under ren_min_newspaper are well processed data, which can be directly used for testing or trainning respectively.

Processed Dataset Structure¶

data/
├── dev
└── train

人民日报实体识别子数据集准备指南¶

步骤¶

上述数据集可直接下载得到：

从github下载：https://github.com/TodoListIOS/NER-PyTorch/archive/refs/heads/master.zip
解压 <path to NER-PyTorch-master.zip>
数据集位于NER-PyTorch-master/data/ren_min_newspaper目录下

您也可以通过git clone获得该数据集：

git clone https://github.com/TodoListIOS/NER-PyTorch.git
cd data/ren_min_newspaper

ren_min_newspaper文件夹中的dev和train文件是已经处理好的数据，分别直接用于测试或训练

处理完的数据集结构¶

data/
├── dev
└── train

4.17. voxceleb1 voxceleb2 dataset preparation guide¶

step¶

Download the Dev files and Test files of voxceleb1 from the official website. Note that you need to apply for a download account yourself. If the official website download link is invalid, you can use https://mm.kaist.ac.kr/datasets/voxceleb/.
Download all files and concatenate into zip file
```
cat vox1_dev* > vox1_dev_wav.zip
```
Download the Dev files and Test files of voxceleb2 from the official website. Note that you need to apply for a download account yourself. If the official website download link is invalid, you can use https://mm.kaist.ac.kr/datasets/voxceleb/.
Download all files and concatenate into zip file
```
cat vox2_dev_aac* > vox2_aac.zip
```
After the download is complete, there will be four zip files vox2_test_mp4.zip vox2_aac.zip vox1_test_wav.zip vox1_dev_wav.zip.

Execute the following command

mkdir -p vox1_2/wav
unzip vox1_dev_wav.zip -d vox1
unzip vox1_test_wav.zip -d vox1

unzip vox2_aac.zip -d vox2

cp -r vox2/dev/aac/id*  vox1_2/wav
cp -r vox1/wav/id* vox1_2/wav

cp convert.sh vox1_2
cd vox1_2
bash ./convert.sh

processed data structure¶

vox1_2
├── convert.sh
└── wav
   ├── id00012
      ....
   └── id11251

4.18. voxceleb1 voxceleb2 数据集准备指南¶

步骤¶

从官网下载voxceleb1的Dev文件和Test文件.。注意需要自己申请下载账号。如果官网下载链接失效了，可以使用https://mm.kaist.ac.kr/datasets/voxceleb/。
下载所有文件并拼接成zip文件
```
cat vox1_dev* > vox1_dev_wav.zip
```
从官网下载voxceleb2的Dev文件和Test文件.。注意需要自己申请下载账号。如果官网下载链接失效了，可以使用https://mm.kaist.ac.kr/datasets/voxceleb/。
载所有文件并拼接成zip文件
```
cat vox2_dev_aac* > vox2_aac.zip
```
下载完成后会有vox2_test_mp4.zip vox2_aac.zip vox1_test_wav.zip vox1_dev_wav.zip四个zip文件。

执行以下命令

mkdir -p vox1_2/wav
unzip vox1_dev_wav.zip -d vox1
unzip vox1_test_wav.zip -d vox1

unzip vox2_aac.zip -d vox2

cp -r vox2/dev/aac/id*  vox1_2/wav
cp -r vox1/wav/id* vox1_2/wav

cp convert.sh vox1_2
cd vox1_2
bash ./convert.sh

处理完的数据结构¶

vox1_2
├── convert.sh
└── wav
   ├── id00012
      ....
   └── id11251

4.19. Deep Fake Detection Challenge Dataset Preparation Guide¶

Step¶

Apply for admission to Deep Fake Detection Challenge Dataset and download it.
Generate retinaface onnx from this codebase with its script.
Run commands below

python3 convert_dfdc.py --retinaface <path/to/retinaface/onnx> --dfdc_root <path/to/uncompressed/dfdc/dataset/root> --output <output/directory>

Processed Dataset Structure¶

data/dfdc
        └── dataset.pkl

4.20. Deep Fake Detection Challenge 数据集准备指南¶

步骤¶

申请访问Deep Fake Detection Challenge Dataset的权限，并下载这个数据集。
采用这个codebase的脚本生成retinaface的onnx。
运行下面的命令

python3 convert_dfdc.py --retinaface <path/to/retinaface/onnx> --dfdc_root <path/to/uncompressed/dfdc/dataset/root> --output <output/directory>

处理完成的数据结构¶

data/dfdc
        └── dataset.pkl

4.21. kitti Dataset Preparation Guide¶

Step¶

Download kitti-dataset form kaggle or kitti official website, then uncompress it, the directory structure lookslike:

    data/
        └── kitti/
            ├──ImageSets/
            │   ├── test.txt
            │   ├── train.txt
            │   └── val.txt
            ├── training/
            │   ├── image_2/
            │   ├── calib/
            │   ├── label_2/
            │   └── velodyne/
            ├── testing/
            │   ├── image_2/
            │   ├── calib/
            │   └── velodyne/
            └── classes_names.txt

4.22. CITYSCAPES Dataset Preparation Guide¶

Step¶

First of all, you need to register an account in this webpage , then you should activate and login your account
click to download dataset and annotations at this page, the annotations is gtFine_trainvaltest.zip, the dataset is leftImg8bit_trainvaltest.zip
unzip those two files which you downloaded in step 2, the directory structure looks like:

cityscapes/
├── gtFine
│   ├── test
│   ├── train
│   └── val
└── leftImg8bit
    ├── test
    ├── train
    └── val

By convention, **labelTrainIds.png are used for cityscapes training. Open-mmlab provided a scripts based on cityscapesscripts to generate **labelTrainIds.png., you can refer to this tutorial to generate cityscapes labels. Attention, you may need to install mmcv and mmsegmentation before you can generate above labels, you could refer to this install guide.

4.23. CITYSCAPES 数据集准备¶

步骤¶

首先您需要在cityscapes官网注册一个账号，并激活、登录此账号
在该网页点击下载 dataset and annotations , 标注文件名是gtFine_trainvaltest.zip, 数据集名称是leftImg8bit_trainvaltest.zip

解压在第2步中下载得到的2个文件，最终使数据集目录结构如下：

cityscapes/
├── gtFine
│   ├── test
│   ├── train
│   └── val
└── leftImg8bit
    ├── test
    ├── train
    └── val

通常情况下，**labelTrainIds.png 被用来训练 cityscapes。基于 cityscapesscripts, open-mmlab提供了一个脚本, 去生成 **labelTrainIds.png。你可以参考此网页去生成合适的cityscapes的labels。注意，在您能生成上述labels之前，您可能需要先安装mmcv和mmsegmentation，您可以参考这个安装向导。

4.24. ICDAR 2015 Dataset Preparation Guide¶

Step¶

Download ICDAR 2015 test set(Registration is required for downloading). After registering and logging in, download the “Test Set Images” and “Test Set Ground Truth” in section “Task 4.1: Text Localization (2015 edition)”. And, the content downloaded by Test Set Images is saved as the folder ch4_test_images and Test Set Ground Truth in folder ch4_test_localization_transcription_gt.

Decompress the test set, as follows,

cd path/to/ch4_test_images
unzip ch4_test_images.zip
cd path/to/ch4_test_localization_transcription_gt
unzip ch4_test_localization_transcription_gt.zip

Download the PaddleOCR format annotation file. Put it under the same folder of ‘ch4_test_images’.
Download https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json. Put it under the same folder of ‘ch4_test_images’.And run python3 modify_directory.py。

Processed Dataset Structure¶

data/
├── ch4_test_images
│        ├── img_1.jpg
│        ├── img_2.jpg
|        └── ……
├── ch4_test_localization_transcription_gt
│        ├── gt_img_1.txt
│        ├── gt_img_2.txt
|        └── ……
├── instances_test.json
└── test_icdar2015_label.txt

4.25. ICDAR 2015 数据集准备指南¶

步骤¶

下载ICDAR 2015 测试数据集(下载需要注册)。注册完登录后，下载“Task 4.1: Text Localization (2015 edition)”中的“Test Set Images”和“Test Set Ground Truth”，其中，Test Set Images下载的内容保存到ch4_test_images文件夹内，Test Set Ground Truth放在ch4_test_localization_transcription_gt文件夹内。
解压下载的压缩文件：

cd path/to/ch4_test_images
unzip ch4_test_images.zip
cd path/to/ch4_test_localization_transcription_gt
unzip ch4_test_localization_transcription_gt.zip

下载label 文件，将其放到ch4_test_images同目录下。
下载https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json，将其放到ch4_test_images同目录下，并执行python3 modify_directory.py。

处理完成的数据结构¶

data/
├── ch4_test_images
│        ├── img_1.jpg
│        ├── img_2.jpg
|        └── ……
├── ch4_test_localization_transcription_gt
│        ├── gt_img_1.txt
│        ├── gt_img_2.txt
|        └── ……
├── instances_test.json
└── test_icdar2015_label.txt

4.26. brats2019 preporcessed Dataset Preparation Guide¶

Step¶

Download archive.zip. Put it under a directory.

unzip -d <data/> <archive.zip>, the directory structure looks like:

data/
    └── MICCAI_BraTS_2019_Data_Training
        |-- HGG
        |   |-- BraTS19_2013_10_1
        |   |   |-- BraTS19_2013_10_1_flair.nii
        |   |   |-- BraTS19_2013_10_1_seg.nii
        |   |   |-- BraTS19_2013_10_1_t1.nii
        |   |   |-- BraTS19_2013_10_1_t1ce.nii
        |   |   └── BraTS19_2013_10_1_t2.nii
        |   |-- BraTS19_2013_11_1
        |   └── ...
        |-- LGG
        |   |-- BraTS19_2013_0_1
        |   |   |-- BraTS19_2013_0_1_flair.nii
        |   |   |-- BraTS19_2013_0_1_seg.nii
        |   |   |-- BraTS19_2013_0_1_t1.nii
        |   |   |-- BraTS19_2013_0_1_t1ce.nii
        |   |   └── BraTS19_2013_0_1_t2.nii
        |   |-- BraTS19_2013_15_1
        |   └── ...
        |── name_mapping.csv
        └── survival_data.csv

gzip -r data/MICCAI_BraTS_2019_Data_Training/HGG/*, gzip -r data/MICCAI_BraTS_2019_Data_Training/LGG/*, compress .nii files into .nii.gz, the directory structure looks like:

data/
    └── MICCAI_BraTS_2019_Data_Training
        |-- HGG
        |   |-- BraTS19_2013_10_1
        |   |   |-- BraTS19_2013_10_1_flair.nii.gz
        |   |   |-- BraTS19_2013_10_1_seg.nii.gz
        |   |   |-- BraTS19_2013_10_1_t1.nii.gz
        |   |   |-- BraTS19_2013_10_1_t1ce.nii.gz
        |   |   └── BraTS19_2013_10_1_t2.nii.gz
        |   |-- BraTS19_2013_11_1
        |   └── ...
        |-- LGG
        |   |-- BraTS19_2013_0_1
        |   |   |-- BraTS19_2013_0_1_flair.nii.gz
        |   |   |-- BraTS19_2013_0_1_seg.nii.gz
        |   |   |-- BraTS19_2013_0_1_t1.nii.gz
        |   |   |-- BraTS19_2013_0_1_t1ce.nii.gz
        |   |   └── BraTS19_2013_0_1_t2.nii.gz
        |   |-- BraTS19_2013_15_1
        |   └── ...
        |── name_mapping.csv
        └── survival_data.csv

Download the model and unzip -d <data/> <fold_1.zip> and for later data preprocessing, the directory structure looks like:

data/
    |-- MICCAI_BraTS_2019_Data_Training
    |   |-- HGG
    |   |   |-- BraTS19_2013_10_1
    |   |   |   |-- BraTS19_2013_10_1_flair.nii.gz
    |   |   |   |-- BraTS19_2013_10_1_seg.nii.gz
    |   |   |   |-- BraTS19_2013_10_1_t1.nii.gz
    |   |   |   |-- BraTS19_2013_10_1_t1ce.nii.gz
    |   |   |   └── BraTS19_2013_10_1_t2.nii.gz
    |   |   |-- BraTS19_2013_11_1
    |   |   └── ...
    |   |-- LGG
    |   |   |-- BraTS19_2013_0_1
    |   |   |   |-- BraTS19_2013_0_1_flair.nii.gz
    |   |   |   |-- BraTS19_2013_0_1_seg.nii.gz
    |   |   |   |-- BraTS19_2013_0_1_t1.nii.gz
    |   |   |   |-- BraTS19_2013_0_1_t1ce.nii.gz
    |   |   |   └── BraTS19_2013_0_1_t2.nii.gz
    |   |   |-- BraTS19_2013_15_1
    |   |   └── ...
    |   |── name_mapping.csv
    |   └── survival_data.csv
    |-- nnUNet
    |   └── 3d_fullres
    |       └── Task043_BraTS2019
    |           └── nnUNetTrainerV2__nnUNetPlansv2.mlperf.1
    |               |-- fold_1
    |               |   |-- debug.json
    |               |   |-- model_best.model
    |               |   |-- model_best.model.pkl
    |               |   |-- model_final_checkpoint.model
    |               |   |-- model_final_checkpoint.model.pkl
    |               |   |-- postprocessing.json
    |               |   |-- progress.png
    |               |   |-- training_log_2020_5_25_19_07_42.txt
    |               |   |-- training_log_2020_6_15_14_50_42.txt
    |               |   └── training_log_2020_6_8_08_12_03.txt
    |               └── plans.pkl
    └── joblog.log

4.27. brats2019 数据集准备指南¶

步骤¶

下载 archive.zip。把它放到一个文件夹下面。

解压到**data/**文件夹， unzip -d <data/> <archive.zip>，目录结构如下:

data/
    └── MICCAI_BraTS_2019_Data_Training
        |-- HGG
        |   |-- BraTS19_2013_10_1
        |   |   |-- BraTS19_2013_10_1_flair.nii
        |   |   |-- BraTS19_2013_10_1_seg.nii
        |   |   |-- BraTS19_2013_10_1_t1.nii
        |   |   |-- BraTS19_2013_10_1_t1ce.nii
        |   |   └── BraTS19_2013_10_1_t2.nii
        |   |-- BraTS19_2013_11_1
        |   └── ...
        |-- LGG
        |   |-- BraTS19_2013_0_1
        |   |   |-- BraTS19_2013_0_1_flair.nii
        |   |   |-- BraTS19_2013_0_1_seg.nii
        |   |   |-- BraTS19_2013_0_1_t1.nii
        |   |   |-- BraTS19_2013_0_1_t1ce.nii
        |   |   └── BraTS19_2013_0_1_t2.nii
        |   |-- BraTS19_2013_15_1
        |   └── ...
        |── name_mapping.csv
        └── survival_data.csv

将data/MICCAI_BraTS_2019_Data_Training文件夹中的所有.nii文件压缩成.nii.gz文件格式, gzip -r data/MICCAI_BraTS_2019_Data_Training/HGG/*, gzip -r data/MICCAI_BraTS_2019_Data_Training/LGG/*，目录结构如下:

data/
    └── MICCAI_BraTS_2019_Data_Training
        |-- HGG
        |   |-- BraTS19_2013_10_1
        |   |   |-- BraTS19_2013_10_1_flair.nii.gz
        |   |   |-- BraTS19_2013_10_1_seg.nii.gz
        |   |   |-- BraTS19_2013_10_1_t1.nii.gz
        |   |   |-- BraTS19_2013_10_1_t1ce.nii.gz
        |   |   └── BraTS19_2013_10_1_t2.nii.gz
        |   |-- BraTS19_2013_11_1
        |   └── ...
        |-- LGG
        |   |-- BraTS19_2013_0_1
        |   |   |-- BraTS19_2013_0_1_flair.nii.gz
        |   |   |-- BraTS19_2013_0_1_seg.nii.gz
        |   |   |-- BraTS19_2013_0_1_t1.nii.gz
        |   |   |-- BraTS19_2013_0_1_t1ce.nii.gz
        |   |   └── BraTS19_2013_0_1_t2.nii.gz
        |   |-- BraTS19_2013_15_1
        |   └── ...
        |── name_mapping.csv
        └── survival_data.csv

提供数据预处理相关模型文件，下载模型并解压到**data/**文件夹， unzip -d <data/> <fold_1.zip>，目录结构如下:

data/
    |-- MICCAI_BraTS_2019_Data_Training
    |   |-- HGG
    |   |   |-- BraTS19_2013_10_1
    |   |   |   |-- BraTS19_2013_10_1_flair.nii.gz
    |   |   |   |-- BraTS19_2013_10_1_seg.nii.gz
    |   |   |   |-- BraTS19_2013_10_1_t1.nii.gz
    |   |   |   |-- BraTS19_2013_10_1_t1ce.nii.gz
    |   |   |   └── BraTS19_2013_10_1_t2.nii.gz
    |   |   |-- BraTS19_2013_11_1
    |   |   └── ...
    |   |-- LGG
    |   |   |-- BraTS19_2013_0_1
    |   |   |   |-- BraTS19_2013_0_1_flair.nii.gz
    |   |   |   |-- BraTS19_2013_0_1_seg.nii.gz
    |   |   |   |-- BraTS19_2013_0_1_t1.nii.gz
    |   |   |   |-- BraTS19_2013_0_1_t1ce.nii.gz
    |   |   |   └── BraTS19_2013_0_1_t2.nii.gz
    |   |   |-- BraTS19_2013_15_1
    |   |   └── ...
    |   |── name_mapping.csv
    |   └── survival_data.csv
    |-- nnUNet
    |   └── 3d_fullres
    |       └── Task043_BraTS2019
    |           └── nnUNetTrainerV2__nnUNetPlansv2.mlperf.1
    |               |-- fold_1
    |               |   |-- debug.json
    |               |   |-- model_best.model
    |               |   |-- model_best.model.pkl
    |               |   |-- model_final_checkpoint.model
    |               |   |-- model_final_checkpoint.model.pkl
    |               |   |-- postprocessing.json
    |               |   |-- progress.png
    |               |   |-- training_log_2020_5_25_19_07_42.txt
    |               |   |-- training_log_2020_6_15_14_50_42.txt
    |               |   └── training_log_2020_6_8_08_12_03.txt
    |               └── plans.pkl
    └── joblog.log

4.28. WIKI zh_CN Dataset Preparation Guide¶

Step¶

Download the checkpoint of the model you want to verify
Generate dataset with the following command

pip3 install -r requirements.txt
python3 preprocess_wiki_zh.py --ckpt <path/to/ckpt>

Processed Dataset Structure¶

data/wiki_zh/
        └── wiki_zh_test.txt

4.29. WIKI zh_CN 数据集准备指南¶

步骤¶

下载要验证的模型的checkpoint
通过下面的命令生成数据集

pip3 install -r requirements.txt
python3 preprocess_wiki_zh.py --ckpt <path/to/ckpt>

处理完成的数据结构¶

data/wiki_zh/
        └── wiki_zh_test.txt

4.30. Widerface Dataset Preparation Guide¶

Step¶

Download the following files and put them under one directory.

file	url
WIDER_val.zip	https://drive.google.com/file/d/1GUCogbp16PMGa39thoMMeWxp7Rp5oM8Q/view?usp=sharing
wider_easy_val.mat	https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_easy_val.mat
wider_face_val.mat	https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_face_val.mat
wider_hard_val.mat	https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_hard_val.mat
wider_medium_val.mat	https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_medium_val.mat

Run commands below

pip3 install -r requirements.txt
python3 convert_widerface.py --input_path=<path/to/the/directory/containing/the/three/files> --output_path=<path/to/data>

Processed Dataset Structure¶

data/widerface/
        ├── annotations
        └── WIDER_val

4.31. Widerface 数据集准备指南¶

步骤¶

下载下面的文件并放到一个文件夹下面。

file	url
WIDER_val.zip	https://drive.google.com/file/d/1GUCogbp16PMGa39thoMMeWxp7Rp5oM8Q/view?usp=sharing
wider_face_split.zip	http://shuoyang1213.me/WIDERFACE/support/bbx_annotation/wider_face_split.zip
retinaface_gt_v1.1.zip	https://pan.baidu.com/s/1Laby0EctfuJGgGMgRRgykA
wider_easy_val.mat	https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_easy_val.mat
wider_face_val.mat	https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_face_val.mat
wider_hard_val.mat	https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_hard_val.mat
wider_medium_val.mat	https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_medium_val.mat

运行下面的命令

pip3 install -r requirements.txt
python3 convert_widerface.py --input_path=<你刚创建的文件夹> --output_path=<目标路径>

处理完的数据结构¶

data/widerface/
        ├── annotations
        └── WIDER_val

4.32. MOT-16 data Preparation Guide¶

Step¶

MOT-16 dataset is from this website: https://motchallenge.net/data/MOT16/, the download link is https://motchallenge.net/data/MOT16.zip.
After downloading the dataset, unzip the zip file.

Processed Dataset Structure¶

MOT16/
├── test
│   ├── MOT16-01
│   │   ├── det
│   │   │   └── det.txt
│   │   ├── img1
│   │   │   ├── 000001.jpg
│   │   │   ├── xxxxxx.jpg
│   │   │   └── 000450.jpg
│   │   └── seqinfo.ini
│   ├── MOT16-03
│   ├── MOT16-06
│   ├── MOT16-07
│   ├── MOT16-08
│   ├── MOT16-12
│   └── MOT16-14
│
└── train
    ├── MOT16-02
    │   ├── det
    │   │   └── det.txt
    │   ├── gt
    │   │   └── gt.txt
    │   ├── img1
    │   │   ├── 000001.jpg
    │   │   ├── xxxxxx.jpg
    │   │   └── 000600.jpg
    │   └── seqinfo.ini
    ├── MOT16-04
    ├── MOT16-05
    ├── MOT16-09
    ├── MOT16-10
    ├── MOT16-11
    └── MOT16-13

4.33. MOT-16 数据集准备指南¶

步骤¶

MOT-16 数据集在该网页: https://motchallenge.net/data/MOT16/，下载链接：https://motchallenge.net/data/MOT16.zip
下载后，解压zip文件

数据集结构¶

MOT16/
├── test
│   ├── MOT16-01
│   │   ├── det
│   │   │   └── det.txt
│   │   ├── img1
│   │   │   ├── 000001.jpg
│   │   │   ├── xxxxxx.jpg
│   │   │   └── 000450.jpg
│   │   └── seqinfo.ini
│   ├── MOT16-03
│   ├── MOT16-06
│   ├── MOT16-07
│   ├── MOT16-08
│   ├── MOT16-12
│   └── MOT16-14
│
└── train
    ├── MOT16-02
    │   ├── det
    │   │   └── det.txt
    │   ├── gt
    │   │   └── gt.txt
    │   ├── img1
    │   │   ├── 000001.jpg
    │   │   ├── xxxxxx.jpg
    │   │   └── 000600.jpg
    │   └── seqinfo.ini
    ├── MOT16-04
    ├── MOT16-05
    ├── MOT16-09
    ├── MOT16-10
    ├── MOT16-11
    └── MOT16-13

4.34. Bert-qa preporcessed Dataset Preparation Guide¶

Step¶

Please open the text of bert squad dev_data online, Copy file to local dev-v1.1.json
Please open the evaluate script of evaluate-v1.1.py online, Copy file to local evaluate-v1.1.py
Please download the vocab.txt from [BERT-Base, Uncased] (https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip) to local directory
Copy files to the local assign directory, the directory structure looks like:

    data
      └──bert_qa
           |── vocab.txt
           |── evaluate-v1.1.py
           └── dev-v1.1.json

4.35. Bert-qa 数据集准备¶

步骤¶

打开 bert squad dev_data, 复制到本地dev-v1.1.json
打开 evaluate-v1.1.py, 复制到本地evaluate-v1.1.py
下载 [BERT-Base, Uncased] (https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)，复制vocab.txt文件到对应目录

所有文件保存到对应的目录后，目录结构如下：

 data
   └──bert_qa
        |── vocab.txt
        |── evaluate-v1.1.py
        └── dev-v1.1.json

4.63. LFW Dataset Preparation Guide¶

Step¶

install facexlib

pip3 install facexlib==0.3.0

downloadLFW and uncompress it.
Run commands below

python3 preprocess_lfw_data.py --img_root <path/to/uncompressed/lfw/directory> --out_root <path/to/output/directory>

Processed Dataset Structure¶

root_dir/
        ├── real_imgs
        └── distort_imgs

4.64. LFW 数据集准备指南¶

步骤¶

安装 facexlib

pip3 install facexlib==0.3.0

下载LFW 并且解压缩
执行下面的命令

python3 preprocess_lfw_data.py --img_root <path/to/uncompressed/lfw/directory> --out_root <path/to/output/directory>

处理完成的数据结构¶

root_dir/
        ├── real_imgs
        └── distort_imgs

4.56. AFW preporcessed Dataset Preparation Guide¶

Step¶

Download the images of AFW

unzip -d <data/AFW> <afw_images.zip>, mv data/AFW/afw_images data/AFW/images, the directory structure looks like:

data
    └── AFW
        └── images
            ├── 1004109301.jpg
            ├── 1051618982.jpg
            ├── ...
            ├── README
            └── anno.mat

4.57. AFW 数据集准备¶

步骤¶

下载 AFW

解压afw_images.zip文件 unzip -d <data/AFW> <afw_images.zip>, mv data/AFW/afw_images data/AFW/images，目录结构如下：

data
    └── AFW
        └── images
            ├── 1004109301.jpg
            ├── 1051618982.jpg
            ├── ...
            ├── README
            └── anno.mat

4.56. AFW preporcessed Dataset Preparation Guide¶

Step¶

Please download the text of tnews test data online
Copy the text to the local test.txt, the directory structure looks like:

   data
       └── test.txt
   ```

4.57. AFW 数据集准备¶

步骤¶

下载 tnews test data
复制文件内文本内容，保存到test.txt中，目录结构如下：
```
data
    └── test.txt
```

4.42. IJBB Dataset Preparation Guide¶

Step¶

Download facial cropps from IJBB dataset
Uncompress the downloaded ijb-testsuite.tar to get IJBB.zip. Uncompress IJBB.zip to get directory loose_crop and meta.
organize the directories as the following structure.

Processed Dataset Structure¶

data/
├── loose_crop
└── meta

4.43. IJBB 数据集准备指南¶

步骤¶

下载IJBB数据集抠脸验证集。
解压缩下载的ijb-testsuite.tar得到IJBB.zip。解压缩IJBB.zip获得loose_crop和meta文件夹。
把文件夹组织成如下结构。

处理完成的数据结构¶

data/
├── loose_crop
└── meta

4.44. Kinetics400 Dataset Preparation Guide¶

Step¶

Download Kinetics 400 Validation Video compressed file kinetics_400_val_320.tar.
Download Kinetcis 400 label file kinetics_val_list.txt
Unzip kinetics_400_val_320.tar and rename directory kinetics_400_val_320 to val_320
Organize the aforementioned files as the structure shown below.

Processed Dataset Structure¶

data/kinetics400/
        ├── label
        │     └── kinetics_val_list.txt
        └── val_320

4.45. Kinetics400 数据集准备指南¶

步骤¶

下载Kinetics评测视频压缩文件kinetics_400_val_320.tar.
下载Kinetics400标注文件kinetics_val_list.txt.
解压缩 kinetics_400_val_320.tar 并且把文件夹 kinetics_400_val_320 重命名为 val_320
组织上述文件为如下结构

处理完成的数据结构¶

data/kinetics400/
        ├── label
        │     └── kinetics_val_list.txt
        └── val_320

4.46. Imagenet Dataset Preparation Guide¶

Step¶

download ILSVRC2012_img_val.tar from https://image-net.org/challenges/LSVRC/2012/ (you need register)

extract

mkdir val
tar -xvf ILSVRC2012_img_val.tar -C val/

download labels

wget https://raw.githubusercontent.com/tensorflow/models/master/research/slim/datasets/imagenet_2012_validation_synset_labels.txt
wget https://raw.githubusercontent.com/tensorflow/models/master/research/slim/datasets/imagenet_lsvrc_2015_synsets.txt

put images into category folders (if a flatten dir structure is needed, skip)

python3 preprocess_imagenet_validation_data.py val/ imagenet_2012_validation_synset_labels.txt imagenet_lsvrc_2015_synsets.txt
cp imagenet_2012_validation_synset_labels.txt val/synset_labels.txt

generate val_map.txt

python3 convert_imagenet.py val/ imagenet_2012_validation_synset_labels.txt imagenet_lsvrc_2015_synsets.txt val/val_map.txt

rename
```
mv val data/
```

Processed Dataset Structure¶

data/val/
      ├── n01440764
      │   ├── ILSVRC2012_val_00000293.JPEG
      │   ├── ILSVRC2012_val_00002138.JPEG
      |   └── ……
      ……
      └── val_map.txt

val_map.txt contains image path and label relationship likes:

./n01751748/ILSVRC2012_val_00000001.JPEG 65
./n09193705/ILSVRC2012_val_00000002.JPEG 970
./n02105855/ILSVRC2012_val_00000003.JPEG 230
./n04263257/ILSVRC2012_val_00000004.JPEG 809
……

4.47. VCTK-Corpus 数据集准备¶

步骤¶

从kaggle网站下载VCTK-Corpus数据集

下载获得archive.zip文件, 解压到./data/下, 其目录结构如下：

data/VCTK-Corpus/
        ├── COPYING
        ├── NOTE
        ├── README
        ├── speaker-info.txt
        ├── txt
        └── wav48
            ├── p225
            ├   ├── p225_001.wav
            ├   ├── ...
            ├   └── p225_366.wav
            ├── ...
            └── p376

4.48. DIV2k Dataset Preparation Guide¶

Step¶

Download DIV2K_valid_HR.zip, DIV2K_valid_LR_bicubic_X4.zip. Put these files under one directory.
Run commands below

pip3 install -r requirements.txt
python3 convert_div2k.py --input_path=<path/to/the/directory/containing/the/two/files> --output_path=<path/to/data>

Processed Dataset Structure¶

data/
├── DIV2K_valid_HR
└── DIV2K_valid_LR_bicubic

4.49. DIV2k 数据集准备指南¶

步骤¶

下载 DIV2K_valid_HR.zip, DIV2K_valid_LR_bicubic_X4.zip。把它们放到一个文件夹下。
运行下面的命令

pip3 install -r requirements.txt
python3 convert_div2k.py --input_path=<你刚创建的文件夹> --output_path=<目标路径>

处理完成的数据结构¶

data/
├── DIV2K_valid_HR
└── DIV2K_valid_LR_bicubic

4.50. Segment Anything Prompt Detaset Preparation Guide¶

Step¶

Install segment-anything
Download pretrained model
generate prompt dataset

Run commands as below¶

install segment-anything and copy images

git clone https://github.com/facebookresearch/segment-anything.git
cd segment-anything
git checkout 6fdee8f
python3 setup.py install
cp -r ./notebooks/images ../

install requirements

pip3 install -r requirements.txt

download pretrained model

mkdir models
cd models
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_l_0b3195.pth
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth

generate prompt dataset

mkdir -p SAM/prompt
python3 prepare_sam_prompt_data.py --checkpoint ./models/sam_vit_h_4b8939.pth --image_path ./images --save_path ./SAM/prompt
python3 prepare_sam_prompt_data.py --checkpoint ./models/sam_vit_l_0b3195.pth --image_path ./images --save_path ./SAM/prompt
python3 prepare_sam_prompt_data.py --checkpoint ./models/sam_vit_b_01ec64.pth --image_path ./images --save_path ./SAM/prompt

Processed Dataset Structure¶

./SAM/prompt
         ├── sam_vit_h
         │        ├── annotations
         │        │      ├── vit_h-dog-sample-0.npz
         │        │      ├── vit_h-dog-sample-1.npz
         │        │      ...
         │        │      └── vit_h-truck-sample-4.npz
         │        └── images
         │              ├── dog.jpg
         │              ├── groceries.jpg
         │              └── truck.jpg
         ├── sam_vit_b
         │        ├── annotations
         │        │      ├── vit_h-dog-sample-0.npz
         │        │      ├── vit_h-dog-sample-1.npz
         │        │      ...
         │        │      └── vit_h-truck-sample-4.npz
         │        └── images
         │              ├── dog.jpg
         │              ├── groceries.jpg
         │              └── truck.jpg
         └──sam_vit_l
                 ├── annotations
                 │      ├── vit_h-dog-sample-0.npz
                 │      ├── vit_h-dog-sample-1.npz
                 │      ...
                 │      └── vit_h-truck-sample-4.npz
                 └── images
                        ├── dog.jpg
                        ├── groceries.jpg
                        └── truck.jpg

4.51. mnist dataset preparation guide¶

step¶

Download the Dev files and Test files of mnist

wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

processed data structure¶

   data/
   ├── t10k-images-idx3-ubyte.gz
   ├── t10k-labels-idx1-ubyte.gz
   ├── train-images-idx3-ubyte.gz
   └── train-labels-idx1-ubyte.gz

4.52. mnist 数据集准备指南¶

步骤¶

从官网下载mnist数据集。

wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

处理完的数据结构¶

   data/
   ├── t10k-images-idx3-ubyte.gz
   ├── t10k-labels-idx1-ubyte.gz
   ├── train-images-idx3-ubyte.gz
   └── train-labels-idx1-ubyte.gz

4.53. COCO 2017 Dataset Preparation Guide¶

Step¶

Download COCO 2017. Put it under a directory.
Run commands below

pip3 install -r requirements.txt
python3 convert_coco2017.py --input_path=<path/to/the/directory/you/containing/coco2017.zip> --output_path=<path/to/data>

Processed Data Structure¶

data/COCO/
        ├── annotations
        ├── test2017
        ├── train2017
        └── val2017

4.54. COCO 2017 数据集准备指南¶

步骤¶

下载 COCO 2017。把它放到一个文件夹下面。
运行下面的命令

pip3 install -r requirements.txt
python3 convert_coco2017.py --input_path=<你刚创建的文件夹> --output_path=<目标路径>

处理完成的数据结构¶

data/COCO/
        ├── annotations
        ├── test2017
        ├── train2017
        └── val2017

4.55. Segment Anything Prompt Dataset Preparation Guide¶

Step¶

Install requirements
Download dataset
Extract evaluation data

Run commands as below¶

install requirements

pip3 install -r requirements.txt

download dataset and extract evaluation data

python3 prepare_sam_automask_data.py

Processed Dataset Structure¶

./SAM/automask
         ├── sa_1.jpg
         ├── sa_1.json
         ├── sa_2.jpg
         ├── sa_2.json
         ...
         ├── sa_10.jpg
         └── sa_10.json

4.56. AFW preporcessed Dataset Preparation Guide¶

Step¶

Please open the text of ernie dev_data online
Copy the text to the local dev_1.txt, the directory structure looks like:

   data/ernie
           └── dev_1.txt
   ```

4.57. AFW 数据集准备¶

步骤¶

打开 ernie dev_data
复制文件内文本内容，保存到dev_1.txt中，目录结构如下：
```
data/ernie
        └── dev_1.txt
```

4.58. Face MS1M validation data Preparation Guide¶

Step¶

Download faces_ms1m_112x112.zip. Put it under a directory.
Run commands below

# suggest python3.7
pip3 install -r requirements.txt
python3 convert_ms1m_face.py --input_data_dir=<path/to/the/directory/containing/the/file> --output_data_dir=<path/to/the/converted/data>

Processed Dataset Structure¶

converted_ms1m_face
   ├── agedb_30.bin
   ├── cfp_ff.bin
   ├── cfp_fp.bin
   └── lfw.bin

4.59. LPRNet validation data Preparation Guide¶

Step¶

LPRNet validation data is in the repo: https://github.com/sirius-ai/LPRNet_Pytorch. Commit id is 7c976664b3f3879efabeaff59c7a117e49d5f29e.
Run commands below

git clone https://github.com/sirius-ai/LPRNet_Pytorch.git
cd LPRNet_Pytorch/data/test
cp *.jpg <target/folder>

Processed Dataset Structure¶

<target folder>
   ├── 京PL3N67.jpg
   ├── 川JK0707.jpg
   ├── ...
   └── 鲁R8D57Z.jpg

4.60. LPRNet 数据集准备指南¶

步骤¶

LPRNet 测试数据在该repo中: https://github.com/sirius-ai/LPRNet_Pytorch. Commit id是 7c976664b3f3879efabeaff59c7a117e49d5f29e.
运行如下命令

git clone https://github.com/sirius-ai/LPRNet_Pytorch.git
cd LPRNet_Pytorch/data/test
cp *.jpg <target/folder>

数据集结构¶

<target folder>
   ├── 京PL3N67.jpg
   ├── 川JK0707.jpg
   ├── ...
   └── 鲁R8D57Z.jpg

4.61. MiniGO Dataset Preparation Guide¶

Generation¶

pip3 install -r requirements.txt
for i in {001..100} ; do PYTHONPATH=../../../ python3 selfplay.py --load_file=minigo-op13-fp32-N.onnx --num_readouts 10 --verbose 3 --selfplay_dir=data/selfplay --holdout_dir=data/holdout --sgf_dir=data/sgf; done

Processed Dataset Structure¶

data/
├── selfplay
├── holdout
└── sgf

4.62. MiniGO 数据集生成指南¶

生成方法¶

pip3 install -r requirements.txt
for i in {001..100} ; do PYTHONPATH=../../../ python3 selfplay.py --load_file=minigo-op13-fp32-N.onnx --num_readouts 10 --verbose 3 --selfplay_dir=data/selfplay --holdout_dir=data/holdout --sgf_dir=data/sgf; done

处理好的数据集文件结构¶

data/
├── selfplay
├── holdout
└── sgf

4.63. LFW Dataset Preparation Guide¶

Step¶

download code base here
download lfw, pair list and uncompress it.
Run commands below

cd <path/to/facenet>
pip3 install -r requirements.txt

for N in {1..4}
do
PYTHONPATH=src python3 src/align/align_dataset_mtcnn.py <path/to/uncompressed/lfw/directory> <path/to/output/directory> --image_size 160 --margin 32 --random_order --gpu_memory_fraction 0.25
done

mkdir data/lfw
cp -r <path/to/output/directory> data/lfw/lfw
cp <path/to/pairs.txt> data/lfw

Processed Dataset Structure¶

data/lfw/
        ├── lfw
        └── pairs.txt

4.64. LFW 数据集准备指南¶

步骤¶

下载facenet代码
下载LFW, pairs.txt，并且解压缩
执行下面的命令

cd <path/to/facenet>
pip3 install -r requirements.txt
for N in {1..4}
do
PYTHONPATH=src python3 src/align/align_dataset_mtcnn.py <path/to/uncompressed/lfw/directory> <path/to/output/directory> --image_size 160 --margin 32 --random_order --gpu_memory_fraction 0.25
done
mkdir data
cp -r <path/to/output/directory> data/lfw/lfw
cp <path/to/pairs.txt> data/lfw

处理完成的数据结构¶

data/lfw/
        ├── lfw
        └── pairs.txt

4.65. Librispeech preporcessed Dataset Preparation Guide¶

Step¶

1、Datasets download¶

run commands below to download the Librispeech datasets

mkdir -p data/LibriSpeech
python3 download_librispeech.py  ./librispeech-inference.csv ./data/LibriSpeech   -e ./data

2、Data Process¶

run commands below to turn the dataset into json file

python3 convert_librispeech.py --input_dir ./data/LibriSpeech/dev-clean --dest_dir ./data/dev-clean-wav --output_json ./data/dev-clean-wav.json

3、Processed Data Structure¶

./data
├── dev-clean-wav
│   ├── 1272
│   ├── 1462
│   ├── 1673
│   ├── 174
│   ├── 1919
│   ├── 1988
│   ├── 1993
│   ├── 2035
│   ├── 2078
│   ├── 2086
│   ├── 2277
│   ├── 2412
│   ├── 2428
│   ├── 251
│   ├── 2803
│   ├── 2902
│   ├── 3000
│   ├── 3081
│   ├── 3170
│   ├── 3536
│   ├── 3576
│   ├── 3752
│   ├── 3853
│   ├── 422
│   ├── 5338
│   ├── 5536
│   ├── 5694
│   ├── 5895
│   ├── 6241
│   ├── 6295
│   ├── 6313
│   ├── 6319
│   ├── 6345
│   ├── 652
│   ├── 777
│   ├── 7850
│   ├── 7976
│   ├── 8297
│   ├── 84
│   └── 8842
├── dev-clean-wav.json
└── LibriSpeech
    ├── BOOKS.TXT
    ├── CHAPTERS.TXT
    ├── dev-clean
    ├── dev-clean.tar.gz
    ├── LICENSE.TXT
    ├── README.TXT
    └── SPEAKERS.TXT

4.66. Librispeech数据集准备指南¶

步骤¶

1、下载数据集¶

运行下面的命令，将Librispeech下载至本地

mkdir -p data/LibriSpeech
python3 download_librispeech.py  ./librispeech-inference.csv ./data/LibriSpeech   -e ./data

2、数据处理¶

参照下面的脚本，将Librispeech处理为json格式

python3 convert_librispeech.py --input_dir ./data/LibriSpeech/dev-clean --dest_dir ./data/dev-clean-wav --output_json ./data/dev-clean-wav.json

3、目录结构¶

完成上述的数据集下载和代码处理后，data目录应该为如下的结构

./data
├── dev-clean-wav
│   ├── 1272
│   ├── 1462
│   ├── 1673
│   ├── 174
│   ├── 1919
│   ├── 1988
│   ├── 1993
│   ├── 2035
│   ├── 2078
│   ├── 2086
│   ├── 2277
│   ├── 2412
│   ├── 2428
│   ├── 251
│   ├── 2803
│   ├── 2902
│   ├── 3000
│   ├── 3081
│   ├── 3170
│   ├── 3536
│   ├── 3576
│   ├── 3752
│   ├── 3853
│   ├── 422
│   ├── 5338
│   ├── 5536
│   ├── 5694
│   ├── 5895
│   ├── 6241
│   ├── 6295
│   ├── 6313
│   ├── 6319
│   ├── 6345
│   ├── 652
│   ├── 777
│   ├── 7850
│   ├── 7976
│   ├── 8297
│   ├── 84
│   └── 8842
├── dev-clean-wav.json
└── LibriSpeech
    ├── BOOKS.TXT
    ├── CHAPTERS.TXT
    ├── dev-clean
    ├── dev-clean.tar.gz
    ├── LICENSE.TXT
    ├── README.TXT
    └── SPEAKERS.TXT

4.67. PASCAL preporcessed Dataset Preparation Guide¶

Step¶

Download the images of PASCAL

unzip -d <data/PASCAL> <pascal_images.zip>, mv data/PASCAL/pascal_images data/PASCAL/images, the directory structure looks like:

data
    └── PASCAL
        └── images
            ├── 2007_000272.jpg
            ├── 2007_000664.jpg
            ├── ...
            ├── 2011_003272.jpg
            └── 2011_003273.jpg

4.68. PASCAL 数据集准备¶

步骤¶

下载 PASCAL

解压pascal_images.zip文件 unzip -d <data/PASCAL> <pascal_images.zip>, mv data/PASCAL/pascal_images data/PASCAL/images，目录结构如下：

data
    └── PASCAL
        └── images
            ├── 2007_000272.jpg
            ├── 2007_000664.jpg
            ├── ...
            ├── 2011_003272.jpg
            └── 2011_003273.jpg

4.69. Criteo Dataset Preparation Guide¶

Step¶

Login in Criteo, click the day_2.gz download. Put it under a directory.
Run commands according to repo

git clone http://git.enflame.cn/sse_ard/Algo_internals/tree/master/scripts/34_hugectr

cd Algo_internals/scripts/34_hugectr/preprocess

bash preprocess.sh

Processed Data Structure¶

data/critero_data/
        └── val
             ├── sparse_embedding0.data
             ├── sparse_embedding1.data
             ├── ...
             └── sparse_embedding111.data

4.70. Criteo 数据集准备指南¶

步骤¶

登录 Criteo。点击 day_2.gz下载按钮，把它放到一个文件夹下面。
运行下面repo中的数据处理命令

git clone http://git.enflame.cn/sse_ard/Algo_internals/tree/master/scripts/34_hugectr

cd Algo_internals/scripts/34_hugectr/preprocess

bash preprocess.sh

处理完成的数据结构¶

data/critero_data/
        └── val
             ├── sparse_embedding0.data
             ├── sparse_embedding1.data
             ├── ...
             └── sparse_embedding111.data