Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out efficient training/inference of large-scale dialogue generation models.
- July 2020: We are opening PLATO-2, a large-scale generative model with latent space for open-domain dialogue systems.
Carry out local training with a configuration file. You can specify GPU by export CUDA_VISIBLE_DEVICES=XXX
in ./scripts/local/train.sh
. You can specify other environment variables in the script.
./scripts/local/train.sh ${TRAIN_CONF}
An example of training configuration files is ./package/dialog_en/plato/24L_train.conf
. It contains three sections: job
, task
and training
.
This section defines:
job_script
: the main script of this task; use ./scripts/distributed/train.sh
for training task.
This section defines:
model
: the used model class
task
: task name
vocab_path
: vocabulary path
tokenizer related: spm_model_file
for SentencePieces Tokenizer, and so on.
dataset files related: train_file
, valid_file
, data_format
and file_format
.
config_path
: model configuration file.
Choices of data_format
:
-
raw
: untokenized data tsv file, example:./data/train.tsv
, each column is a field. -
tokenized
: tokenized data tsv, example:./data/train_tokenized.tsv
which is generated by./tools/pre_tokenized.sh
. -
numerical
: each line contains numerical data (token_ids
,type_ids
andpos_ids
,role_ids
for optional) , example:./data/train.numerical.tsv
which is generated by./tools/pre_numericalize.sh
.
It also supports the file with .gz
suffix which is compressed by gzip
command.
Choices of file_format
:
-
file
: a file only. -
filelist
: contains multiple files, each line is a file, example:./data/train_filelist
.
This section defines training related settings:
init_params
: initialized parameters.
init_checkpoint
: initialized checkpoints (contains not only the parameters of the model, but also the persitables of the optimizer) . You can also set train_args="--start_step 1000"
for better display of log (if you continue training from step 1000) , but this is not necessary.
batch_size
, lr
, num_epochs
and so on.
log_dir
: the output path of training logs, include the log file (${log_dir}/workerlog.${DEV_ID}
) of each GPU trainer. If log_dir=""
, then the output of all GPU trainers will output to standard output.
save_path
: the output path of saved parameters.
You can define other arguments in training script, such as:
train_args="--max_src_len 384 --max_seq_len 512"
This project aims to facilitate further research progress in dialogue generation. Baidu is not responsible for the 3rd party's generation with the pre-trained system.
For help or issues using Knover, please submit a GitHub issue.