附录:数据准备¶
4.1. dagm2007 preporcessed Dataset Preparation Guide¶
Step¶
Create an account on ‘https://hci.iwr.uni-heidelberg.de/node/3616’
Download the Class1.zip file
unzip -d <dagm2007/private/> <Class1.zip>
, the directory structure looks like:dagm2007 └── private └── Class1 |── Test | └── Label | | |── 0002_label.PNG | | |── ... | | |── 0568_label.PNG | | └── Labels.txt | |── 0001.PNG | |── ... | └── 0575.PNG |── ... └── test_list.csv
python3 preprocess_dagm2007.py –data_dir=dagm2007/private/
4.2. dagm2007 数据集准备指南¶
步骤¶
在该网页 https://hci.iwr.uni-heidelberg.de/node/3616 中注册账户
注册完后获得下载链接,下载Class1.zip类别1的数据
解压到**dagm2007/private/**文件夹,
unzip -d <dagm2007/private/> <Class1.zip>
, 处理完成后的数据结构如下:dagm2007 └── private └── Class1 |── Test | └── Label | | |── 0002_label.PNG | | |── ... | | |── 0568_label.PNG | | └── Labels.txt | |── 0001.PNG | |── ... | └── 0575.PNG |── ... └── test_list.csv
运行命令生成test_list.csv:python3 preprocess_dagm2007.py –data_dir=dagm2007/private/
4.3. OCR Recognition LMDB Datset Preparation Guide¶
Step¶
Download deep-text-recognition-benchmark dataset
Unzip the dataset
Extract evaluation data
Run commands as below¶
wget https://www.dropbox.com/sh/i39abvnefllx2si/AAAbAYRvxzRp3cIE5HzqUw3ra?dl=0 -O data.zip
mkdir ./tmp
unzip -d ./tmp data.zip
unzip ./tmp/evaluation.zip
mv ./evaluation ./data
rm -rf ./tmp
Processed Dataset Structure¶
data/
├── CUTE80
│ ├── data.mdb
│ └── lock.mdb
├── IC03_860
│ ├── data.mdb
│ └── lock.mdb
├── IC03_867
│ ├── data.mdb
│ └── lock.mdb
├── IC13_857
│ ├── data.mdb
│ └── lock.mdb
├── IC13_1015
│ ├── data.mdb
│ └── lock.mdb
├── IC15_1811
│ ├── data.mdb
│ └── lock.mdb
├── IC15_2077
│ ├── data.mdb
│ └── lock.mdb
├── IIIT5k_3000
│ ├── data.mdb
│ └── lock.mdb
├── SVT
│ ├── data.mdb
│ └── lock.mdb
└── SVTP
├── data.mdb
└── lock.mdb
4.4. ST-GCN preporcessed Dataset Preparation Guide¶
Step¶
st-gcn use author preprocessed kinectic and ntu datasets, which can be downloaded directly.
download from https://drive.google.com/open?id=103NOL9YYZSW1hLoWmYnv5Fs8mK-Ij7qb
unzip <path to st-gcn-processed-data.zip>
Processed Dataset Structure¶
data
|-- Kinetics
| `-- kinetics-skeleton
| |-- train_data.npy
| |-- train_label.pkl
| |-- val_data.npy
| `-- val_label.pkl
`-- NTU-RGB-D
|-- xsub
| |-- train_data.npy
| |-- train_label.pkl
| |-- val_data.npy
| `-- val_label.pkl
`-- xview
|-- train_data.npy
|-- train_label.pkl
|-- val_data.npy
`-- val_label.pkl
4.5. PASCAL VOC 2012 Preparation Guide¶
Step¶
Download archive.zip. Put it under a directory.
Run commands below
pip3 install -r requirements.txt
python3 convert_voc2012.py --input_path=<path/to/the/directory/containing/the/file> --output_path=<path/to/data>
Processed Dataset Structure¶
data/
└── VOC2012
4.6. PASCAL VOC 2012 数据集准备指南¶
步骤¶
下载 archive.zip。把它放在一个文件夹下。
运行下面的命令
pip3 install -r requirements.txt
python3 convert_voc2012.py --input_path=<你刚创建的文件夹> --output_path=<目标路径>
处理完的数据结构¶
data/
└── VOC2012
4.7. sst2 preporcessed Dataset Preparation Guide¶
Step¶
Please open the text of sst2 test data online
Copy the text to the local sst2_test.tsv, the directory structure looks like:
data/sst2
└── sst2_test.tsv
```
4.8. sst2 数据集准备¶
步骤¶
复制文件内文本内容,保存到sst2_test.tsv中,目录结构如下:
data/sst2 └── sst2_test.tsv
4.9. China-people-daily-ner-corpus Dataset Preparation Guide¶
Step¶
Download china-people-daily-ner-corpus dataset
$ cd ${project_root}
$ mkdir data
$ cd data
$ wget http://s3.bmio.net/kashgari/china-people-daily-ner-corpus.tar.gz
$ tar -zxvf china-people-daily-ner-corpus.tar.gz
Download vocabulary
The vocabulary file which named vocab.txt is also need, and you can download it from huggingface webpage(https://huggingface.co/bert-base-chinese/blob/main/vocab.txt) by clicking the down arrow corresponding to vocab.txt, then you need to move vocab.txt to ./model/bert_crf/, please create the folders if not exist.
Processed Data Structure¶
After processing, the data folder will get the following files:
data
├── china-people-daily-ner-corpus
│ ├── example.dev
│ ├── example.test
│ └── example.train
model
└── bert_crf
└── vocab.txt
中国人民日报实体识别数据集准备指南¶
步骤¶
下载中国人民日报实体识别数据集
$ cd ${project_root}
$ mkdir data
$ cd data
$ wget http://s3.bmio.net/kashgari/china-people-daily-ner-corpus.tar.gz
$ tar -zxvf china-people-daily-ner-corpus.tar.gz
下载词典文件
模型所需的词典文件来源于huggingface此页面(https://huggingface.co/bert-base-chinese/tree/main), 找到vocab.txt,点击其右侧向下的箭头即可下载,然后将下载得到的vocab.txt移动到./model/bert_crf文件夹(如果文件夹不存在,请予以创建)
处理后的目录结构¶
处理后的目录结构如下所示:
data
├── china-people-daily-ner-corpus
│ ├── example.dev
│ ├── example.test
│ └── example.train
model
└── bert_crf
└── vocab.txt
4.11. AN4 Dataset Preparation Guide¶
Step¶
Download an4. Put it under a directory.
Run commands below
pip3 install -r requirements.txt
python3 sph2wav.py --dir_path=<path/to/the/directory/you/containing/an4_sphere.tar.gz>
python3 build_mainfest.py --dataset_path=<path/to/the/directory/you/containing/an4test_clstk> --dir_path=<path/to/the/directory/you/output/test_manifest.json>
As far as we know, the link of an4
is currently unaccessible.
Processed Data Structure¶
data/
└── an4
├── etc
| ├──an4_test.transcription
| └── ...
├── wav
| ├── an4_clstk
| └── an4test_clstk
└── test_manifest.json
4.12. Market 1501 Dataset Preparation Guide¶
Step¶
Download the dataset from Baidu netdisk or Google Drive.
Generate dataset with the following command
python3 convert_market1501.py --dataset <path/to/the/compressed/dataset> --output <path/to/output/directory>
Processed Dataset Structure¶
data/
├── query
├── gt_query
├── gt_bbox
├── bounding_box_train
└── bounding_box_test
4.13. Market 1501 数据集准备指南¶
步骤¶
python3 convert_market1501.py --dataset <path/to/the/compressed/dataset> --output <path/to/output/directory>
处理完成的数据结构¶
data/
├── query
├── gt_query
├── gt_bbox
├── bounding_box_train
└── bounding_box_test
4.14. IIIT5K preporcessed Dataset Preparation Guide¶
Step¶
The IIIT 5K-word dataset is harvested from Google image search. Query words like billboards, signboard, house numbers, house name plates, movie posters were used to collect images. The dataset contains 5000 cropped word images from Scene Texts and born-digital images. The dataset is divided into train and test parts. This dataset can be used for large lexicon cropped word recognition. We also provide a lexicon of more than 0.5 million dictionary words with this dataset.
download dataset from http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K-Word_V3.0.tar.gz
tar xf <path to IIIT5K-Word_V3.0.tar.gz>
download label from https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt
put test_label.txt into
<IIIT5K folder>
Processed Dataset Structure¶
IIIT5K
├── lexicon.txt
├── README
├── test
├── testCharBound.mat
├── testdata.mat
├── test_label.txt
├── train
├── trainCharBound.mat
└── traindata.mat
4.15. People daily ner corpus subset Preparation Guide¶
Step¶
The mentioned dataset can be downloaded directly:
download from https://github.com/TodoListIOS/NER-PyTorch/archive/refs/heads/master.zip
unzip
<path to NER-PyTorch-master.zip>
the dataset can be located in NER-PyTorch-master/data/ren_min_newspaper
You can also get the dataset by git clone:
git clone https://github.com/TodoListIOS/NER-PyTorch.git
cd data/ren_min_newspaper
The dev and train file under ren_min_newspaper are well processed data, which can be directly used for testing or trainning respectively.
Processed Dataset Structure¶
data/
├── dev
└── train
人民日报实体识别子数据集准备指南¶
步骤¶
上述数据集可直接下载得到:
从github下载:https://github.com/TodoListIOS/NER-PyTorch/archive/refs/heads/master.zip
解压
<path to NER-PyTorch-master.zip>
数据集位于NER-PyTorch-master/data/ren_min_newspaper目录下
您也可以通过git clone获得该数据集:
git clone https://github.com/TodoListIOS/NER-PyTorch.git
cd data/ren_min_newspaper
ren_min_newspaper文件夹中的dev和train文件是已经处理好的数据,分别直接用于测试或训练
处理完的数据集结构¶
data/
├── dev
└── train
4.17. voxceleb1 voxceleb2 dataset preparation guide¶
step¶
Download the Dev files and Test files of voxceleb1 from the official website. Note that you need to apply for a download account yourself. If the official website download link is invalid, you can use https://mm.kaist.ac.kr/datasets/voxceleb/.
Download all files and concatenate into zip file
cat vox1_dev* > vox1_dev_wav.zip
Download the Dev files and Test files of voxceleb2 from the official website. Note that you need to apply for a download account yourself. If the official website download link is invalid, you can use https://mm.kaist.ac.kr/datasets/voxceleb/.
Download all files and concatenate into zip file
cat vox2_dev_aac* > vox2_aac.zip
After the download is complete, there will be four zip files vox2_test_mp4.zip vox2_aac.zip vox1_test_wav.zip vox1_dev_wav.zip.
Execute the following command
mkdir -p vox1_2/wav unzip vox1_dev_wav.zip -d vox1 unzip vox1_test_wav.zip -d vox1 unzip vox2_aac.zip -d vox2 cp -r vox2/dev/aac/id* vox1_2/wav cp -r vox1/wav/id* vox1_2/wav cp convert.sh vox1_2 cd vox1_2 bash ./convert.sh
processed data structure¶
vox1_2
├── convert.sh
└── wav
├── id00012
....
└── id11251
4.18. voxceleb1 voxceleb2 数据集准备指南¶
步骤¶
从官网下载voxceleb1的Dev文件和Test文件.。注意需要自己申请下载账号。如果官网下载链接失效了,可以使用https://mm.kaist.ac.kr/datasets/voxceleb/。
下载所有文件并拼接成zip文件
cat vox1_dev* > vox1_dev_wav.zip
从官网下载voxceleb2的Dev文件和Test文件.。注意需要自己申请下载账号。如果官网下载链接失效了,可以使用https://mm.kaist.ac.kr/datasets/voxceleb/。
载所有文件并拼接成zip文件
cat vox2_dev_aac* > vox2_aac.zip
下载完成后会有vox2_test_mp4.zip vox2_aac.zip vox1_test_wav.zip vox1_dev_wav.zip四个zip文件。
执行以下命令
mkdir -p vox1_2/wav unzip vox1_dev_wav.zip -d vox1 unzip vox1_test_wav.zip -d vox1 unzip vox2_aac.zip -d vox2 cp -r vox2/dev/aac/id* vox1_2/wav cp -r vox1/wav/id* vox1_2/wav cp convert.sh vox1_2 cd vox1_2 bash ./convert.sh
处理完的数据结构¶
vox1_2
├── convert.sh
└── wav
├── id00012
....
└── id11251
4.19. Deep Fake Detection Challenge Dataset Preparation Guide¶
Step¶
Apply for admission to Deep Fake Detection Challenge Dataset and download it.
Generate retinaface onnx from this codebase with its script.
Run commands below
python3 convert_dfdc.py --retinaface <path/to/retinaface/onnx> --dfdc_root <path/to/uncompressed/dfdc/dataset/root> --output <output/directory>
Processed Dataset Structure¶
data/dfdc
└── dataset.pkl
4.20. Deep Fake Detection Challenge 数据集准备指南¶
步骤¶
申请访问Deep Fake Detection Challenge Dataset的权限,并下载这个数据集。
运行下面的命令
python3 convert_dfdc.py --retinaface <path/to/retinaface/onnx> --dfdc_root <path/to/uncompressed/dfdc/dataset/root> --output <output/directory>
处理完成的数据结构¶
data/dfdc
└── dataset.pkl
4.21. kitti Dataset Preparation Guide¶
Step¶
Download kitti-dataset form kaggle or kitti official website, then uncompress it, the directory structure lookslike:
data/
└── kitti/
├──ImageSets/
│ ├── test.txt
│ ├── train.txt
│ └── val.txt
├── training/
│ ├── image_2/
│ ├── calib/
│ ├── label_2/
│ └── velodyne/
├── testing/
│ ├── image_2/
│ ├── calib/
│ └── velodyne/
└── classes_names.txt
4.22. CITYSCAPES Dataset Preparation Guide¶
Step¶
First of all, you need to register an account in this webpage , then you should activate and login your account
click to download dataset and annotations at this page, the annotations is gtFine_trainvaltest.zip, the dataset is leftImg8bit_trainvaltest.zip
unzip those two files which you downloaded in step 2, the directory structure looks like:
cityscapes/
├── gtFine
│ ├── test
│ ├── train
│ └── val
└── leftImg8bit
├── test
├── train
└── val
By convention,
**labelTrainIds.png
are used for cityscapes training. Open-mmlab provided a scripts based on cityscapesscripts to generate**labelTrainIds.png
., you can refer to this tutorial to generate cityscapes labels. Attention, you may need to install mmcv and mmsegmentation before you can generate above labels, you could refer to this install guide.
4.23. CITYSCAPES 数据集准备¶
步骤¶
首先您需要在cityscapes官网注册一个账号,并激活、登录此账号
在该网页点击下载 dataset and annotations , 标注文件名是gtFine_trainvaltest.zip, 数据集名称是leftImg8bit_trainvaltest.zip
解压在第2步中下载得到的2个文件,最终使数据集目录结构如下:
cityscapes/ ├── gtFine │ ├── test │ ├── train │ └── val └── leftImg8bit ├── test ├── train └── val
通常情况下,
**labelTrainIds.png
被用来训练 cityscapes。 基于 cityscapesscripts, open-mmlab提供了一个 脚本, 去生成**labelTrainIds.png
。你可以参考此网页去生成合适的cityscapes的labels。注意,在您能生成上述labels之前,您可能需要先安装mmcv和mmsegmentation,您可以参考这个安装向导。
4.24. ICDAR 2015 Dataset Preparation Guide¶
Step¶
Download ICDAR 2015 test set(Registration is required for downloading). After registering and logging in, download the “Test Set Images” and “Test Set Ground Truth” in section “Task 4.1: Text Localization (2015 edition)”. And, the content downloaded by Test Set Images is saved as the folder ch4_test_images and Test Set Ground Truth in folder ch4_test_localization_transcription_gt.
Decompress the test set, as follows,
cd path/to/ch4_test_images
unzip ch4_test_images.zip
cd path/to/ch4_test_localization_transcription_gt
unzip ch4_test_localization_transcription_gt.zip
Download the PaddleOCR format annotation file. Put it under the same folder of ‘ch4_test_images’.
Download https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json. Put it under the same folder of ‘ch4_test_images’.And run python3 modify_directory.py。
Processed Dataset Structure¶
data/
├── ch4_test_images
│ ├── img_1.jpg
│ ├── img_2.jpg
| └── ……
├── ch4_test_localization_transcription_gt
│ ├── gt_img_1.txt
│ ├── gt_img_2.txt
| └── ……
├── instances_test.json
└── test_icdar2015_label.txt
4.25. ICDAR 2015 数据集准备指南¶
步骤¶
下载ICDAR 2015 测试数据集(下载需要注册)。注册完登录后,下载“Task 4.1: Text Localization (2015 edition)”中的“Test Set Images”和“Test Set Ground Truth”,其中,Test Set Images下载的内容保存到ch4_test_images文件夹内,Test Set Ground Truth放在ch4_test_localization_transcription_gt文件夹内。
解压下载的压缩文件:
cd path/to/ch4_test_images
unzip ch4_test_images.zip
cd path/to/ch4_test_localization_transcription_gt
unzip ch4_test_localization_transcription_gt.zip
下载label 文件,将其放到ch4_test_images同目录下。
下载https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json,将其放到ch4_test_images同目录下,并执行python3 modify_directory.py。
处理完成的数据结构¶
data/
├── ch4_test_images
│ ├── img_1.jpg
│ ├── img_2.jpg
| └── ……
├── ch4_test_localization_transcription_gt
│ ├── gt_img_1.txt
│ ├── gt_img_2.txt
| └── ……
├── instances_test.json
└── test_icdar2015_label.txt
4.26. brats2019 preporcessed Dataset Preparation Guide¶
Step¶
Download archive.zip. Put it under a directory.
unzip -d <data/> <archive.zip>
, the directory structure looks like:data/ └── MICCAI_BraTS_2019_Data_Training |-- HGG | |-- BraTS19_2013_10_1 | | |-- BraTS19_2013_10_1_flair.nii | | |-- BraTS19_2013_10_1_seg.nii | | |-- BraTS19_2013_10_1_t1.nii | | |-- BraTS19_2013_10_1_t1ce.nii | | └── BraTS19_2013_10_1_t2.nii | |-- BraTS19_2013_11_1 | └── ... |-- LGG | |-- BraTS19_2013_0_1 | | |-- BraTS19_2013_0_1_flair.nii | | |-- BraTS19_2013_0_1_seg.nii | | |-- BraTS19_2013_0_1_t1.nii | | |-- BraTS19_2013_0_1_t1ce.nii | | └── BraTS19_2013_0_1_t2.nii | |-- BraTS19_2013_15_1 | └── ... |── name_mapping.csv └── survival_data.csv
gzip -r data/MICCAI_BraTS_2019_Data_Training/HGG/*
,gzip -r data/MICCAI_BraTS_2019_Data_Training/LGG/*
, compress .nii files into .nii.gz, the directory structure looks like:data/ └── MICCAI_BraTS_2019_Data_Training |-- HGG | |-- BraTS19_2013_10_1 | | |-- BraTS19_2013_10_1_flair.nii.gz | | |-- BraTS19_2013_10_1_seg.nii.gz | | |-- BraTS19_2013_10_1_t1.nii.gz | | |-- BraTS19_2013_10_1_t1ce.nii.gz | | └── BraTS19_2013_10_1_t2.nii.gz | |-- BraTS19_2013_11_1 | └── ... |-- LGG | |-- BraTS19_2013_0_1 | | |-- BraTS19_2013_0_1_flair.nii.gz | | |-- BraTS19_2013_0_1_seg.nii.gz | | |-- BraTS19_2013_0_1_t1.nii.gz | | |-- BraTS19_2013_0_1_t1ce.nii.gz | | └── BraTS19_2013_0_1_t2.nii.gz | |-- BraTS19_2013_15_1 | └── ... |── name_mapping.csv └── survival_data.csv
Download the model and
unzip -d <data/> <fold_1.zip>
and for later data preprocessing, the directory structure looks like:data/ |-- MICCAI_BraTS_2019_Data_Training | |-- HGG | | |-- BraTS19_2013_10_1 | | | |-- BraTS19_2013_10_1_flair.nii.gz | | | |-- BraTS19_2013_10_1_seg.nii.gz | | | |-- BraTS19_2013_10_1_t1.nii.gz | | | |-- BraTS19_2013_10_1_t1ce.nii.gz | | | └── BraTS19_2013_10_1_t2.nii.gz | | |-- BraTS19_2013_11_1 | | └── ... | |-- LGG | | |-- BraTS19_2013_0_1 | | | |-- BraTS19_2013_0_1_flair.nii.gz | | | |-- BraTS19_2013_0_1_seg.nii.gz | | | |-- BraTS19_2013_0_1_t1.nii.gz | | | |-- BraTS19_2013_0_1_t1ce.nii.gz | | | └── BraTS19_2013_0_1_t2.nii.gz | | |-- BraTS19_2013_15_1 | | └── ... | |── name_mapping.csv | └── survival_data.csv |-- nnUNet | └── 3d_fullres | └── Task043_BraTS2019 | └── nnUNetTrainerV2__nnUNetPlansv2.mlperf.1 | |-- fold_1 | | |-- debug.json | | |-- model_best.model | | |-- model_best.model.pkl | | |-- model_final_checkpoint.model | | |-- model_final_checkpoint.model.pkl | | |-- postprocessing.json | | |-- progress.png | | |-- training_log_2020_5_25_19_07_42.txt | | |-- training_log_2020_6_15_14_50_42.txt | | └── training_log_2020_6_8_08_12_03.txt | └── plans.pkl └── joblog.log
4.27. brats2019 数据集准备指南¶
步骤¶
下载 archive.zip。 把它放到一个文件夹下面。
解压到**data/**文件夹,
unzip -d <data/> <archive.zip>
, 目录结构如下:data/ └── MICCAI_BraTS_2019_Data_Training |-- HGG | |-- BraTS19_2013_10_1 | | |-- BraTS19_2013_10_1_flair.nii | | |-- BraTS19_2013_10_1_seg.nii | | |-- BraTS19_2013_10_1_t1.nii | | |-- BraTS19_2013_10_1_t1ce.nii | | └── BraTS19_2013_10_1_t2.nii | |-- BraTS19_2013_11_1 | └── ... |-- LGG | |-- BraTS19_2013_0_1 | | |-- BraTS19_2013_0_1_flair.nii | | |-- BraTS19_2013_0_1_seg.nii | | |-- BraTS19_2013_0_1_t1.nii | | |-- BraTS19_2013_0_1_t1ce.nii | | └── BraTS19_2013_0_1_t2.nii | |-- BraTS19_2013_15_1 | └── ... |── name_mapping.csv └── survival_data.csv
将data/MICCAI_BraTS_2019_Data_Training文件夹中的所有.nii文件压缩成.nii.gz文件格式,
gzip -r data/MICCAI_BraTS_2019_Data_Training/HGG/*
,gzip -r data/MICCAI_BraTS_2019_Data_Training/LGG/*
, 目录结构如下:data/ └── MICCAI_BraTS_2019_Data_Training |-- HGG | |-- BraTS19_2013_10_1 | | |-- BraTS19_2013_10_1_flair.nii.gz | | |-- BraTS19_2013_10_1_seg.nii.gz | | |-- BraTS19_2013_10_1_t1.nii.gz | | |-- BraTS19_2013_10_1_t1ce.nii.gz | | └── BraTS19_2013_10_1_t2.nii.gz | |-- BraTS19_2013_11_1 | └── ... |-- LGG | |-- BraTS19_2013_0_1 | | |-- BraTS19_2013_0_1_flair.nii.gz | | |-- BraTS19_2013_0_1_seg.nii.gz | | |-- BraTS19_2013_0_1_t1.nii.gz | | |-- BraTS19_2013_0_1_t1ce.nii.gz | | └── BraTS19_2013_0_1_t2.nii.gz | |-- BraTS19_2013_15_1 | └── ... |── name_mapping.csv └── survival_data.csv
提供数据预处理相关模型文件, 下载 模型 并解压到**data/**文件夹,
unzip -d <data/> <fold_1.zip>
, 目录结构如下:data/ |-- MICCAI_BraTS_2019_Data_Training | |-- HGG | | |-- BraTS19_2013_10_1 | | | |-- BraTS19_2013_10_1_flair.nii.gz | | | |-- BraTS19_2013_10_1_seg.nii.gz | | | |-- BraTS19_2013_10_1_t1.nii.gz | | | |-- BraTS19_2013_10_1_t1ce.nii.gz | | | └── BraTS19_2013_10_1_t2.nii.gz | | |-- BraTS19_2013_11_1 | | └── ... | |-- LGG | | |-- BraTS19_2013_0_1 | | | |-- BraTS19_2013_0_1_flair.nii.gz | | | |-- BraTS19_2013_0_1_seg.nii.gz | | | |-- BraTS19_2013_0_1_t1.nii.gz | | | |-- BraTS19_2013_0_1_t1ce.nii.gz | | | └── BraTS19_2013_0_1_t2.nii.gz | | |-- BraTS19_2013_15_1 | | └── ... | |── name_mapping.csv | └── survival_data.csv |-- nnUNet | └── 3d_fullres | └── Task043_BraTS2019 | └── nnUNetTrainerV2__nnUNetPlansv2.mlperf.1 | |-- fold_1 | | |-- debug.json | | |-- model_best.model | | |-- model_best.model.pkl | | |-- model_final_checkpoint.model | | |-- model_final_checkpoint.model.pkl | | |-- postprocessing.json | | |-- progress.png | | |-- training_log_2020_5_25_19_07_42.txt | | |-- training_log_2020_6_15_14_50_42.txt | | └── training_log_2020_6_8_08_12_03.txt | └── plans.pkl └── joblog.log
4.28. WIKI zh_CN Dataset Preparation Guide¶
Step¶
Download the checkpoint of the model you want to verify
Generate dataset with the following command
pip3 install -r requirements.txt
python3 preprocess_wiki_zh.py --ckpt <path/to/ckpt>
Processed Dataset Structure¶
data/wiki_zh/
└── wiki_zh_test.txt
4.29. WIKI zh_CN 数据集准备指南¶
步骤¶
下载要验证的模型的checkpoint
通过下面的命令生成数据集
pip3 install -r requirements.txt
python3 preprocess_wiki_zh.py --ckpt <path/to/ckpt>
处理完成的数据结构¶
data/wiki_zh/
└── wiki_zh_test.txt
4.30. Widerface Dataset Preparation Guide¶
Step¶
Download the following files and put them under one directory.
file |
url |
---|---|
WIDER_val.zip |
https://drive.google.com/file/d/1GUCogbp16PMGa39thoMMeWxp7Rp5oM8Q/view?usp=sharing |
wider_easy_val.mat |
https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_easy_val.mat |
wider_face_val.mat |
https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_face_val.mat |
wider_hard_val.mat |
https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_hard_val.mat |
wider_medium_val.mat |
https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_medium_val.mat |
Run commands below
pip3 install -r requirements.txt
python3 convert_widerface.py --input_path=<path/to/the/directory/containing/the/three/files> --output_path=<path/to/data>
Processed Dataset Structure¶
data/widerface/
├── annotations
└── WIDER_val
4.31. Widerface 数据集准备指南¶
步骤¶
下载下面的文件并放到一个文件夹下面。
file |
url |
---|---|
WIDER_val.zip |
https://drive.google.com/file/d/1GUCogbp16PMGa39thoMMeWxp7Rp5oM8Q/view?usp=sharing |
wider_face_split.zip |
http://shuoyang1213.me/WIDERFACE/support/bbx_annotation/wider_face_split.zip |
retinaface_gt_v1.1.zip |
https://pan.baidu.com/s/1Laby0EctfuJGgGMgRRgykA |
wider_easy_val.mat |
https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_easy_val.mat |
wider_face_val.mat |
https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_face_val.mat |
wider_hard_val.mat |
https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_hard_val.mat |
wider_medium_val.mat |
https://github.com/biubug6/Pytorch_Retinaface/raw/master/widerface_evaluate/ground_truth/wider_medium_val.mat |
运行下面的命令
pip3 install -r requirements.txt
python3 convert_widerface.py --input_path=<你刚创建的文件夹> --output_path=<目标路径>
处理完的数据结构¶
data/widerface/
├── annotations
└── WIDER_val
4.32. MOT-16 data Preparation Guide¶
Step¶
MOT-16 dataset is from this website: https://motchallenge.net/data/MOT16/, the download link is https://motchallenge.net/data/MOT16.zip.
After downloading the dataset, unzip the zip file.
Processed Dataset Structure¶
MOT16/
├── test
│ ├── MOT16-01
│ │ ├── det
│ │ │ └── det.txt
│ │ ├── img1
│ │ │ ├── 000001.jpg
│ │ │ ├── xxxxxx.jpg
│ │ │ └── 000450.jpg
│ │ └── seqinfo.ini
│ ├── MOT16-03
│ ├── MOT16-06
│ ├── MOT16-07
│ ├── MOT16-08
│ ├── MOT16-12
│ └── MOT16-14
│
└── train
├── MOT16-02
│ ├── det
│ │ └── det.txt
│ ├── gt
│ │ └── gt.txt
│ ├── img1
│ │ ├── 000001.jpg
│ │ ├── xxxxxx.jpg
│ │ └── 000600.jpg
│ └── seqinfo.ini
├── MOT16-04
├── MOT16-05
├── MOT16-09
├── MOT16-10
├── MOT16-11
└── MOT16-13
4.33. MOT-16 数据集准备指南¶
步骤¶
MOT-16 数据集在该网页: https://motchallenge.net/data/MOT16/,下载链接:https://motchallenge.net/data/MOT16.zip
下载后,解压zip文件
数据集结构¶
MOT16/
├── test
│ ├── MOT16-01
│ │ ├── det
│ │ │ └── det.txt
│ │ ├── img1
│ │ │ ├── 000001.jpg
│ │ │ ├── xxxxxx.jpg
│ │ │ └── 000450.jpg
│ │ └── seqinfo.ini
│ ├── MOT16-03
│ ├── MOT16-06
│ ├── MOT16-07
│ ├── MOT16-08
│ ├── MOT16-12
│ └── MOT16-14
│
└── train
├── MOT16-02
│ ├── det
│ │ └── det.txt
│ ├── gt
│ │ └── gt.txt
│ ├── img1
│ │ ├── 000001.jpg
│ │ ├── xxxxxx.jpg
│ │ └── 000600.jpg
│ └── seqinfo.ini
├── MOT16-04
├── MOT16-05
├── MOT16-09
├── MOT16-10
├── MOT16-11
└── MOT16-13
4.34. Bert-qa preporcessed Dataset Preparation Guide¶
Step¶
Please open the text of bert squad dev_data online, Copy file to local dev-v1.1.json
Please open the evaluate script of evaluate-v1.1.py online, Copy file to local evaluate-v1.1.py
Please download the vocab.txt from [BERT-Base, Uncased] (https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip) to local directory
Copy files to the local assign directory, the directory structure looks like:
data
└──bert_qa
|── vocab.txt
|── evaluate-v1.1.py
└── dev-v1.1.json
4.35. Bert-qa 数据集准备¶
步骤¶
打开 bert squad dev_data, 复制到本地dev-v1.1.json
打开 evaluate-v1.1.py, 复制到本地evaluate-v1.1.py
下载 [BERT-Base, Uncased] (https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip),复制vocab.txt文件到对应目录
所有文件保存到对应的目录后,目录结构如下:
data └──bert_qa |── vocab.txt |── evaluate-v1.1.py └── dev-v1.1.json
4.63. LFW Dataset Preparation Guide¶
Step¶
install facexlib
pip3 install facexlib==0.3.0
downloadLFW and uncompress it.
Run commands below
python3 preprocess_lfw_data.py --img_root <path/to/uncompressed/lfw/directory> --out_root <path/to/output/directory>
Processed Dataset Structure¶
root_dir/
├── real_imgs
└── distort_imgs
4.64. LFW 数据集准备指南¶
步骤¶
安装 facexlib
pip3 install facexlib==0.3.0
下载LFW 并且解压缩
执行下面的命令
python3 preprocess_lfw_data.py --img_root <path/to/uncompressed/lfw/directory> --out_root <path/to/output/directory>
处理完成的数据结构¶
root_dir/
├── real_imgs
└── distort_imgs
4.56. AFW preporcessed Dataset Preparation Guide¶
Step¶
Download the images of AFW
unzip -d <data/AFW> <afw_images.zip>
,mv data/AFW/afw_images data/AFW/images
, the directory structure looks like:data └── AFW └── images ├── 1004109301.jpg ├── 1051618982.jpg ├── ... ├── README └── anno.mat
4.57. AFW 数据集准备¶
步骤¶
下载 AFW
解压afw_images.zip文件
unzip -d <data/AFW> <afw_images.zip>
,mv data/AFW/afw_images data/AFW/images
,目录结构如下:data └── AFW └── images ├── 1004109301.jpg ├── 1051618982.jpg ├── ... ├── README └── anno.mat
4.56. AFW preporcessed Dataset Preparation Guide¶
Step¶
Please download the text of tnews test data online
Copy the text to the local test.txt, the directory structure looks like:
data
└── test.txt
```
4.57. AFW 数据集准备¶
步骤¶
复制文件内文本内容,保存到test.txt中,目录结构如下:
data └── test.txt
4.42. IJBB Dataset Preparation Guide¶
Step¶
Download facial cropps from IJBB dataset
Uncompress the downloaded ijb-testsuite.tar to get IJBB.zip. Uncompress IJBB.zip to get directory loose_crop and meta.
organize the directories as the following structure.
Processed Dataset Structure¶
data/
├── loose_crop
└── meta
4.43. IJBB 数据集准备指南¶
步骤¶
下载IJBB数据集抠脸验证集。
解压缩下载的ijb-testsuite.tar得到IJBB.zip。解压缩IJBB.zip获得loose_crop和meta文件夹。
把文件夹组织成如下结构。
处理完成的数据结构¶
data/
├── loose_crop
└── meta
4.44. Kinetics400 Dataset Preparation Guide¶
Step¶
Download Kinetics 400 Validation Video compressed file kinetics_400_val_320.tar.
Download Kinetcis 400 label file kinetics_val_list.txt
Unzip kinetics_400_val_320.tar and rename directory kinetics_400_val_320 to val_320
Organize the aforementioned files as the structure shown below.
Processed Dataset Structure¶
data/kinetics400/
├── label
│ └── kinetics_val_list.txt
└── val_320
4.45. Kinetics400 数据集准备指南¶
步骤¶
下载Kinetics评测视频压缩文件kinetics_400_val_320.tar.
下载Kinetics400标注文件kinetics_val_list.txt.
解压缩 kinetics_400_val_320.tar 并且把文件夹 kinetics_400_val_320 重命名为 val_320
组织上述文件为如下结构
处理完成的数据结构¶
data/kinetics400/
├── label
│ └── kinetics_val_list.txt
└── val_320
4.46. Imagenet Dataset Preparation Guide¶
Step¶
download ILSVRC2012_img_val.tar from https://image-net.org/challenges/LSVRC/2012/ (you need register)
extract
mkdir val tar -xvf ILSVRC2012_img_val.tar -C val/
download labels
wget https://raw.githubusercontent.com/tensorflow/models/master/research/slim/datasets/imagenet_2012_validation_synset_labels.txt wget https://raw.githubusercontent.com/tensorflow/models/master/research/slim/datasets/imagenet_lsvrc_2015_synsets.txt
put images into category folders (if a flatten dir structure is needed, skip)
python3 preprocess_imagenet_validation_data.py val/ imagenet_2012_validation_synset_labels.txt imagenet_lsvrc_2015_synsets.txt cp imagenet_2012_validation_synset_labels.txt val/synset_labels.txt
generate val_map.txt
python3 convert_imagenet.py val/ imagenet_2012_validation_synset_labels.txt imagenet_lsvrc_2015_synsets.txt val/val_map.txt
rename
mv val data/
Processed Dataset Structure¶
data/val/
├── n01440764
│ ├── ILSVRC2012_val_00000293.JPEG
│ ├── ILSVRC2012_val_00002138.JPEG
| └── ……
……
└── val_map.txt
val_map.txt contains image path and label relationship likes:
./n01751748/ILSVRC2012_val_00000001.JPEG 65
./n09193705/ILSVRC2012_val_00000002.JPEG 970
./n02105855/ILSVRC2012_val_00000003.JPEG 230
./n04263257/ILSVRC2012_val_00000004.JPEG 809
……
4.47. VCTK-Corpus 数据集准备¶
步骤¶
从kaggle网站下载VCTK-Corpus数据集
下载获得archive.zip文件, 解压到./data/下, 其目录结构如下:
data/VCTK-Corpus/ ├── COPYING ├── NOTE ├── README ├── speaker-info.txt ├── txt └── wav48 ├── p225 ├ ├── p225_001.wav ├ ├── ... ├ └── p225_366.wav ├── ... └── p376
4.48. DIV2k Dataset Preparation Guide¶
Step¶
Download DIV2K_valid_HR.zip, DIV2K_valid_LR_bicubic_X4.zip. Put these files under one directory.
Run commands below
pip3 install -r requirements.txt
python3 convert_div2k.py --input_path=<path/to/the/directory/containing/the/two/files> --output_path=<path/to/data>
Processed Dataset Structure¶
data/
├── DIV2K_valid_HR
└── DIV2K_valid_LR_bicubic
4.49. DIV2k 数据集准备指南¶
步骤¶
下载 DIV2K_valid_HR.zip, DIV2K_valid_LR_bicubic_X4.zip。把它们放到一个文件夹下。
运行下面的命令
pip3 install -r requirements.txt
python3 convert_div2k.py --input_path=<你刚创建的文件夹> --output_path=<目标路径>
处理完成的数据结构¶
data/
├── DIV2K_valid_HR
└── DIV2K_valid_LR_bicubic
4.50. Segment Anything Prompt Detaset Preparation Guide¶
Step¶
Install segment-anything
Download pretrained model
generate prompt dataset
Run commands as below¶
install segment-anything and copy images
git clone https://github.com/facebookresearch/segment-anything.git
cd segment-anything
git checkout 6fdee8f
python3 setup.py install
cp -r ./notebooks/images ../
install requirements
pip3 install -r requirements.txt
download pretrained model
mkdir models
cd models
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_l_0b3195.pth
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth
generate prompt dataset
mkdir -p SAM/prompt
python3 prepare_sam_prompt_data.py --checkpoint ./models/sam_vit_h_4b8939.pth --image_path ./images --save_path ./SAM/prompt
python3 prepare_sam_prompt_data.py --checkpoint ./models/sam_vit_l_0b3195.pth --image_path ./images --save_path ./SAM/prompt
python3 prepare_sam_prompt_data.py --checkpoint ./models/sam_vit_b_01ec64.pth --image_path ./images --save_path ./SAM/prompt
Processed Dataset Structure¶
./SAM/prompt
├── sam_vit_h
│ ├── annotations
│ │ ├── vit_h-dog-sample-0.npz
│ │ ├── vit_h-dog-sample-1.npz
│ │ ...
│ │ └── vit_h-truck-sample-4.npz
│ └── images
│ ├── dog.jpg
│ ├── groceries.jpg
│ └── truck.jpg
├── sam_vit_b
│ ├── annotations
│ │ ├── vit_h-dog-sample-0.npz
│ │ ├── vit_h-dog-sample-1.npz
│ │ ...
│ │ └── vit_h-truck-sample-4.npz
│ └── images
│ ├── dog.jpg
│ ├── groceries.jpg
│ └── truck.jpg
└──sam_vit_l
├── annotations
│ ├── vit_h-dog-sample-0.npz
│ ├── vit_h-dog-sample-1.npz
│ ...
│ └── vit_h-truck-sample-4.npz
└── images
├── dog.jpg
├── groceries.jpg
└── truck.jpg
4.51. mnist dataset preparation guide¶
step¶
Download the Dev files and Test files of mnist
wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
processed data structure¶
data/
├── t10k-images-idx3-ubyte.gz
├── t10k-labels-idx1-ubyte.gz
├── train-images-idx3-ubyte.gz
└── train-labels-idx1-ubyte.gz
4.52. mnist 数据集准备指南¶
步骤¶
从官网下载mnist数据集。
wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
处理完的数据结构¶
data/
├── t10k-images-idx3-ubyte.gz
├── t10k-labels-idx1-ubyte.gz
├── train-images-idx3-ubyte.gz
└── train-labels-idx1-ubyte.gz
4.53. COCO 2017 Dataset Preparation Guide¶
Step¶
Download COCO 2017. Put it under a directory.
Run commands below
pip3 install -r requirements.txt
python3 convert_coco2017.py --input_path=<path/to/the/directory/you/containing/coco2017.zip> --output_path=<path/to/data>
Processed Data Structure¶
data/COCO/
├── annotations
├── test2017
├── train2017
└── val2017
4.54. COCO 2017 数据集准备指南¶
步骤¶
下载 COCO 2017。把它放到一个文件夹下面。
运行下面的命令
pip3 install -r requirements.txt
python3 convert_coco2017.py --input_path=<你刚创建的文件夹> --output_path=<目标路径>
处理完成的数据结构¶
data/COCO/
├── annotations
├── test2017
├── train2017
└── val2017
4.55. Segment Anything Prompt Dataset Preparation Guide¶
Step¶
Install requirements
Download dataset
Extract evaluation data
Run commands as below¶
install requirements
pip3 install -r requirements.txt
download dataset and extract evaluation data
python3 prepare_sam_automask_data.py
Processed Dataset Structure¶
./SAM/automask
├── sa_1.jpg
├── sa_1.json
├── sa_2.jpg
├── sa_2.json
...
├── sa_10.jpg
└── sa_10.json
4.56. AFW preporcessed Dataset Preparation Guide¶
Step¶
Please open the text of ernie dev_data online
Copy the text to the local dev_1.txt, the directory structure looks like:
data/ernie
└── dev_1.txt
```
4.57. AFW 数据集准备¶
步骤¶
复制文件内文本内容,保存到dev_1.txt中,目录结构如下:
data/ernie └── dev_1.txt
4.58. Face MS1M validation data Preparation Guide¶
Step¶
Download faces_ms1m_112x112.zip. Put it under a directory.
Run commands below
# suggest python3.7
pip3 install -r requirements.txt
python3 convert_ms1m_face.py --input_data_dir=<path/to/the/directory/containing/the/file> --output_data_dir=<path/to/the/converted/data>
Processed Dataset Structure¶
converted_ms1m_face
├── agedb_30.bin
├── cfp_ff.bin
├── cfp_fp.bin
└── lfw.bin
4.59. LPRNet validation data Preparation Guide¶
Step¶
LPRNet validation data is in the repo: https://github.com/sirius-ai/LPRNet_Pytorch. Commit id is 7c976664b3f3879efabeaff59c7a117e49d5f29e.
Run commands below
git clone https://github.com/sirius-ai/LPRNet_Pytorch.git
cd LPRNet_Pytorch/data/test
cp *.jpg <target/folder>
Processed Dataset Structure¶
<target folder>
├── 京PL3N67.jpg
├── 川JK0707.jpg
├── ...
└── 鲁R8D57Z.jpg
4.60. LPRNet 数据集准备指南¶
步骤¶
LPRNet 测试数据在该repo中: https://github.com/sirius-ai/LPRNet_Pytorch. Commit id是 7c976664b3f3879efabeaff59c7a117e49d5f29e.
运行如下命令
git clone https://github.com/sirius-ai/LPRNet_Pytorch.git
cd LPRNet_Pytorch/data/test
cp *.jpg <target/folder>
数据集结构¶
<target folder>
├── 京PL3N67.jpg
├── 川JK0707.jpg
├── ...
└── 鲁R8D57Z.jpg
4.61. MiniGO Dataset Preparation Guide¶
Generation¶
pip3 install -r requirements.txt
for i in {001..100} ; do PYTHONPATH=../../../ python3 selfplay.py --load_file=minigo-op13-fp32-N.onnx --num_readouts 10 --verbose 3 --selfplay_dir=data/selfplay --holdout_dir=data/holdout --sgf_dir=data/sgf; done
Processed Dataset Structure¶
data/
├── selfplay
├── holdout
└── sgf
4.62. MiniGO 数据集生成指南¶
生成方法¶
pip3 install -r requirements.txt
for i in {001..100} ; do PYTHONPATH=../../../ python3 selfplay.py --load_file=minigo-op13-fp32-N.onnx --num_readouts 10 --verbose 3 --selfplay_dir=data/selfplay --holdout_dir=data/holdout --sgf_dir=data/sgf; done
处理好的数据集文件结构¶
data/
├── selfplay
├── holdout
└── sgf
4.63. LFW Dataset Preparation Guide¶
Step¶
cd <path/to/facenet>
pip3 install -r requirements.txt
for N in {1..4}
do
PYTHONPATH=src python3 src/align/align_dataset_mtcnn.py <path/to/uncompressed/lfw/directory> <path/to/output/directory> --image_size 160 --margin 32 --random_order --gpu_memory_fraction 0.25
done
mkdir data/lfw
cp -r <path/to/output/directory> data/lfw/lfw
cp <path/to/pairs.txt> data/lfw
Processed Dataset Structure¶
data/lfw/
├── lfw
└── pairs.txt
4.64. LFW 数据集准备指南¶
步骤¶
cd <path/to/facenet>
pip3 install -r requirements.txt
for N in {1..4}
do
PYTHONPATH=src python3 src/align/align_dataset_mtcnn.py <path/to/uncompressed/lfw/directory> <path/to/output/directory> --image_size 160 --margin 32 --random_order --gpu_memory_fraction 0.25
done
mkdir data
cp -r <path/to/output/directory> data/lfw/lfw
cp <path/to/pairs.txt> data/lfw
处理完成的数据结构¶
data/lfw/
├── lfw
└── pairs.txt
4.65. Librispeech preporcessed Dataset Preparation Guide¶
Step¶
1、Datasets download¶
run commands below to download the Librispeech
datasets
mkdir -p data/LibriSpeech
python3 download_librispeech.py ./librispeech-inference.csv ./data/LibriSpeech -e ./data
2、Data Process¶
run commands below to turn the dataset into json
file
python3 convert_librispeech.py --input_dir ./data/LibriSpeech/dev-clean --dest_dir ./data/dev-clean-wav --output_json ./data/dev-clean-wav.json
3、Processed Data Structure¶
./data
├── dev-clean-wav
│ ├── 1272
│ ├── 1462
│ ├── 1673
│ ├── 174
│ ├── 1919
│ ├── 1988
│ ├── 1993
│ ├── 2035
│ ├── 2078
│ ├── 2086
│ ├── 2277
│ ├── 2412
│ ├── 2428
│ ├── 251
│ ├── 2803
│ ├── 2902
│ ├── 3000
│ ├── 3081
│ ├── 3170
│ ├── 3536
│ ├── 3576
│ ├── 3752
│ ├── 3853
│ ├── 422
│ ├── 5338
│ ├── 5536
│ ├── 5694
│ ├── 5895
│ ├── 6241
│ ├── 6295
│ ├── 6313
│ ├── 6319
│ ├── 6345
│ ├── 652
│ ├── 777
│ ├── 7850
│ ├── 7976
│ ├── 8297
│ ├── 84
│ └── 8842
├── dev-clean-wav.json
└── LibriSpeech
├── BOOKS.TXT
├── CHAPTERS.TXT
├── dev-clean
├── dev-clean.tar.gz
├── LICENSE.TXT
├── README.TXT
└── SPEAKERS.TXT
4.66. Librispeech数据集准备指南¶
步骤¶
1、下载数据集¶
运行下面的命令,将Librispeech
下载至本地
mkdir -p data/LibriSpeech
python3 download_librispeech.py ./librispeech-inference.csv ./data/LibriSpeech -e ./data
2、数据处理¶
参照下面的脚本,将Librispeech
处理为json
格式
python3 convert_librispeech.py --input_dir ./data/LibriSpeech/dev-clean --dest_dir ./data/dev-clean-wav --output_json ./data/dev-clean-wav.json
3、目录结构¶
完成上述的数据集下载和代码处理后,data
目录应该为如下的结构
./data
├── dev-clean-wav
│ ├── 1272
│ ├── 1462
│ ├── 1673
│ ├── 174
│ ├── 1919
│ ├── 1988
│ ├── 1993
│ ├── 2035
│ ├── 2078
│ ├── 2086
│ ├── 2277
│ ├── 2412
│ ├── 2428
│ ├── 251
│ ├── 2803
│ ├── 2902
│ ├── 3000
│ ├── 3081
│ ├── 3170
│ ├── 3536
│ ├── 3576
│ ├── 3752
│ ├── 3853
│ ├── 422
│ ├── 5338
│ ├── 5536
│ ├── 5694
│ ├── 5895
│ ├── 6241
│ ├── 6295
│ ├── 6313
│ ├── 6319
│ ├── 6345
│ ├── 652
│ ├── 777
│ ├── 7850
│ ├── 7976
│ ├── 8297
│ ├── 84
│ └── 8842
├── dev-clean-wav.json
└── LibriSpeech
├── BOOKS.TXT
├── CHAPTERS.TXT
├── dev-clean
├── dev-clean.tar.gz
├── LICENSE.TXT
├── README.TXT
└── SPEAKERS.TXT
4.67. PASCAL preporcessed Dataset Preparation Guide¶
Step¶
Download the images of PASCAL
unzip -d <data/PASCAL> <pascal_images.zip>
,mv data/PASCAL/pascal_images data/PASCAL/images
, the directory structure looks like:data └── PASCAL └── images ├── 2007_000272.jpg ├── 2007_000664.jpg ├── ... ├── 2011_003272.jpg └── 2011_003273.jpg
4.68. PASCAL 数据集准备¶
步骤¶
下载 PASCAL
解压pascal_images.zip文件
unzip -d <data/PASCAL> <pascal_images.zip>
,mv data/PASCAL/pascal_images data/PASCAL/images
,目录结构如下:data └── PASCAL └── images ├── 2007_000272.jpg ├── 2007_000664.jpg ├── ... ├── 2011_003272.jpg └── 2011_003273.jpg
4.69. Criteo Dataset Preparation Guide¶
Step¶
Login in Criteo, click the day_2.gz download. Put it under a directory.
Run commands according to repo
git clone http://git.enflame.cn/sse_ard/Algo_internals/tree/master/scripts/34_hugectr
cd Algo_internals/scripts/34_hugectr/preprocess
bash preprocess.sh
Processed Data Structure¶
data/critero_data/
└── val
├── sparse_embedding0.data
├── sparse_embedding1.data
├── ...
└── sparse_embedding111.data
4.70. Criteo 数据集准备指南¶
步骤¶
登录 Criteo。点击 day_2.gz下载按钮,把它放到一个文件夹下面。
运行下面repo中的数据处理命令
git clone http://git.enflame.cn/sse_ard/Algo_internals/tree/master/scripts/34_hugectr
cd Algo_internals/scripts/34_hugectr/preprocess
bash preprocess.sh
处理完成的数据结构¶
data/critero_data/
└── val
├── sparse_embedding0.data
├── sparse_embedding1.data
├── ...
└── sparse_embedding111.data