Easily turn large English text datasets into Japanese text datasets using open LLMs.
Fig: Japanese translation of the Abirate/english_quotes dataset using the llm-jp/llm-jp-3-3.7b-instruct model.text2dataset is a tool designed to convert datasets by translating the data in the "txt" column using an Open LLM, such as gemma2 with vLLM, and adding a new column called "txt_ja" that contains the translated text in Japanese.
By utilizing the fast LLM inference library vLLM, this tool enables the fast translation of large English datasets into Japanese. You can also use text2dataset for any translation tasks (e.g. paraphrase) by modifying the prompt template accordingly.
This tool is inspired by img2dataset.
- Save the intermediate results in shards:
- By setting the
number_sample_per_shard
parameter, the dataset can be saved in shards as specified by the number of samples per shard.
- By setting the
- Resume from checkpoint:
- By setting the
resume_from_checkpoint
parameter, the translation can be resumed from where it left off.
- By setting the
- Logging with wandb:
- By setting the
use_wandb
parameter, the metrics such as examples_per_sec and count can be logged to wandb.
- By setting the
- Push to Hugging Face Hub:
- By setting the
push_to_hub
parameter, the translated dataset can be pushed to the Hugging Face Hub.
- By setting the
- Custom Prompt Template:
- By specifying the
prompt_template_path
parameter, you can customize the prompt template for any translation task (e.g., paraphrasing, summarization etc.).
- By specifying the
$ git clone https://github.com/llm-jp/text2dataset.git
$ cd text2dataset
$ rye sync
$ python src/text2dataset/main.py \
--model_id llm-jp/llm-jp-3-3.7b-instruct \
--batch_size 16384 \
--input_path data/english_quotes.json \
--source_column text \
--target_column text_ja \
--push_to_hub True \
--push_to_hub_path speed/english_quotes_ja \
--output_dir data/english_quotes_ja \
--output_format json
Using the llm-jp/llm-jp-3-3.7b-instruct
model on an A100 GPU, 2508 English quotes were translated into Japanese in just 21 seconds.
The result dataset is available at speed/english_quotes_ja.
You can also use text2dataset to paraphrase texts by changing the prompt template with specifying the prompt_template_path
parameter.
$ python src/text2dataset/main.py \
--model_id google/gemma-2-2b-it \
--batch_size 16384 \
--input_path data/english_quotes.json \
--source_column text \
--target_column text_paraphrase \
--push_to_hub True \
--push_to_hub_path speed/english_quotes_paraphrase \
--output_dir data/english_quotes_paraphrase \
--output_format json \
--prompt_template_path config/paraphrase.yaml
The result dataset is available at speed/english_quotes_paraphrase.
Translation of neuralwork/arxiver dataset
You can directly translate datasets in Hugging Face by specifying the path name in input_path
.
In this example, the abstract
column of the neuralwork/arxiver
dataset is translated by specifying the input_path
as neuralwork/arxiver
and the source_column
parameter as abstract
.
$ python src/text2dataset/main.py \
--model_id google/gemma-2-2b-it \
--batch_size 16384 \
--input_path neuralwork/arxiver \
--source_column abstract \
--target_column abstract_ja \
--push_to_hub True \
--push_to_hub_path speed/arxiver_ja \
--output_dir data/arxiver_ja \
--output_format json \
--use_wandb True \
--wandb_run_name arxiver
neuralwork/arxiver
dataset contains 138k rows of abstracts, and it took 2.5 hours to translate them into Japanese using the google/gemma-2-2b-it
model on a A100 GPU. The result dataset is available at speed/arxiver_ja.
- Translation on Multiple GPUs in Parallel
To run translations on multiple GPUs concurrently, split the input dataset into several shards (directories) and execute the translation for each shard in parallel. Remember to set the gpu_id parameter to the corresponding GPU ID for each shard.
Currently, we need to manually split the input dataset into shards and run the translation for each shard in parallel to utilize multiple GPUs. It would be great to have a built-in feature to automatically split the input dataset into shards and run the translation on multiple GPUs in parallel. If you have any ideas or suggestions, please feel free to open an issue or Pull Request.
When using this tool, please pay attention to the license of both the dataset being translated and the LLM you use.
Welcome to any contributions! If you have any questions or suggestions, please feel free to open an issue or Pull Request.
git tag -a v0.x.x -m "version 0.x.x"
git push origin --tags
$ rye lint
$ rye format