This repository is the source code of Improving Colloquial Case Legal Judgment Prediction via Abstractive Text Summarization.
PekoNet is a colloquial case-based legal judgment prediction (LJP) framework based on abstractive text summarization. The main goal of the framework is to improve the performance of LJP on colloquial case facts and the experience for ordinary and non-professional users who use LJP services.
The framework is composed of two modules: Abstractive Text Summarization Module (ATSM) and Legal Judgment Prediction Module (LJPM). We first used a news summary dataset (CNewSum) to train ATSM. Then, we used ATSM to convert Taiwan criminal case facts from formal to colloquial. Last, we used colloquial case fact as the dataset to train LJPM.
Here are the details of the datasets and models:
- datasets (Each dataset has two data formats:
Orig. Fact
andSummary
)- TCI Training Set: A dataset there are 305,240 data, 147 charges, and 89 legal articles.
- BART Testing Set: A dataset whose
Summary
data are generated by the BART model there are 30,289 data, 101 charges, and 89 legal articles. - ChatGPT Testing Set: A dataset whose
Summary
data are generated by the ChatGPT model there are 30,289 data, 101 charges, and 89 legal articles. - Human Testing Set: A dataset whose
Summary
data are written by humans there are 235 data, 53 charges, and 78 legal articles.
- models
Baseline Model
: The PekoNet framework model without the ATS module, was trained by formal data of TCI Training Set.Independent Training Model
: The PekoNet framework model whose ATS module and LJP module are independent models, was trained by CNewSum and colloquial data of TCI Training Set.ATS-Freezing Model
: The PekoNet framework model whose ATS module was frozen when training LJP module, was trained by CNewSum and colloquial data of TCI Training Set.ATS-Finetuning Model
: The PekoNet framework model whose ATS module was finetuned when training LJP module, was trained by CNewSum and colloquial data of TCI Training Set.
- Ubuntu 18.04.6 LTS
- Python 3.7.8
- OpenCC 1.1.4
- numpy 1.21.6
- scikit-learn 1.0.2
- tabulate 0.8.10
- torch 1.12.1
- tqdm 4.64.0
- transformers 4.21.2
You should download and put the models
folder in the root directory of this repository.
These are the data we processed and used to train the model, and do experiments.
You should create the results
folder in the root directory of this repository.
Then, download and put the tvt_dataset
folder in results
.
These are the source data we used in this project.
If you want to generate all data by yourself, you should download and put the data
folder in the root of this repository, and use generate.py
to process data.
You should create the results
folder in the root directory of this repository.
Then, download and put the checkpoints
folder in results
.
Unupdated.
If you have any questions, feel free to raise issues or contact me!
@article{clsr2023hong,
title = {Improving Colloquial Case Legal Judgment Prediction via Abstractive Text Summarization},
journal = {Computer Law & Security Review},
volume = {51},
pages = {105863},
year = {2023},
issn = {0267-3649},
doi = {https://doi.org/10.1016/j.clsr.2023.105863},
url = {https://www.sciencedirect.com/science/article/pii/S0267364923000730},
author = {Yu-Xiang Hong and Chia-Hui Chang},
keywords = {Legal judgment prediction, Legal text summarization, Abstractive text summarization, Legal artificial intelligence}
}