Knover

Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out efficient training/inference of large-scale dialogue generation models.

What's New:

July 2020: We are opening PLATO-2, a large-scale generative model with latent space for open-domain dialogue systems.

Basic usage:

Training

Carry out local training with a configuration file. You can specify GPU by export CUDA_VISIBLE_DEVICES=XXX in ./scripts/local/train.sh. You can specify other environment variables in the script.

./scripts/local/train.sh ${TRAIN_CONF}

An example of training configuration files is ./package/dialog_en/plato/24L_train.conf. It contains three sections: job, task and training.

job

This section defines:

job_script: the main script of this task; use ./scripts/distributed/train.sh for training task.

task

This section defines:

model: the used model class

task: task name

vocab_path: vocabulary path

tokenizer related: spm_model_file for SentencePieces Tokenizer, and so on.

dataset files related: train_file, valid_file, data_format and file_format.

config_path: model configuration file.

Choices of data_format:

raw: untokenized data tsv file, example: ./data/train.tsv, each column is a field.
tokenized: tokenized data tsv, example: ./data/train_tokenized.tsv which is generated by ./tools/pre_tokenized.sh.
numerical: each line contains numerical data (token_ids, type_ids and pos_ids, role_ids for optional) , example: ./data/train.numerical.tsv which is generated by ./tools/pre_numericalize.sh.

It also supports the file with .gz suffix which is compressed by gzip command.

Choices of file_format:

file: a file only.
filelist: contains multiple files, each line is a file, example: ./data/train_filelist.

training

This section defines training related settings:

init_params: initialized parameters.

init_checkpoint: initialized checkpoints (contains not only the parameters of the model, but also the persitables of the optimizer) . You can also set train_args="--start_step 1000" for better display of log (if you continue training from step 1000) , but this is not necessary.

batch_size, lr, num_epochs and so on.

log_dir: the output path of training logs, include the log file (${log_dir}/workerlog.${DEV_ID}) of each GPU trainer. If log_dir="", then the output of all GPU trainers will output to standard output.

save_path: the output path of saved parameters.

You can define other arguments in training script, such as:

train_args="--max_src_len 384 --max_seq_len 512"

Disclaimer

This project aims to facilitate further research progress in dialogue generation. Baidu is not responsible for the 3rd party's generation with the pre-trained system.

Contact information

For help or issues using Knover, please submit a GitHub issue.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
models		models
package/dialog_en		package/dialog_en
plato-2		plato-2
readers		readers
scripts		scripts
tasks		tasks
tools		tools
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
infer.py		infer.py
interaction.py		interaction.py
save_inference_model.py		save_inference_model.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knover

What's New:

Basic usage:

Training

job

task

training

Disclaimer

Contact information

About

Releases

Packages

Languages

License

xiemoyuan/Knover

Folders and files

Latest commit

History

Repository files navigation

Knover

What's New:

Basic usage:

Training

job

task

training

Disclaimer

Contact information

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages