Automatic Audit Opinion Labelling for Going Concern Issues
This is the accompanying code to our work: "Identifying going concern issues in auditor opinions: link to bankruptcy events", done in the context of the 5th Financial Narrative Processing Workshop (FNP2023@IEEE-BigData2023).
In this work, our goal is to predict possible Going Concern issues given Auditors Opinion as found in the 10K report of the company in free-text form. The Going Concern issues that are used as candidate labels are the ones used by the AuditAnalytics platform. More details on the going concern used can be found in the AuditAnalytics 20-year review (p.16 has also a list with some of the used going concern labels).
You can access a demo app that runs inference on given audit opinion text in this link.
In the folder streamlit_app you can find the sample code to deploy locally the demo app shown above.
The requirements for this project can be installed by running:
conda env create -f environment.yml
Python version used: Python 3.9.16
The contents of the repo are the following:
- frozen_splits: Folder with the train/val/test opinions data.
- logs: Folder to save log files.
- results: Folder to save resuts.
- settings: Folder with settings configurations.
- enviroment.yml: Conda enviroment to create.
- Script used for fine-tuning the LM. The script cannot be run as is, because the labels are missing from the data here. The whole process is the same though.
- Script used for hyper-parameter tuning on the tf-idf + RF model. The script cannot be run as is, because the labels are missing from the data here. The whole process is the same though.
- Module for the early stopping functionality.
- Module implementing the weighted focal loss as used in the results Section 4.3.
- Module implementing the loading/splitting and preprocessing of the data.
- Module containing the Langugage Model + Classifier layer, along with routines for training and evaluation.
- Helper functions.
One would run either:
python ./settings/wanted_settings_file.json
to train and evaluate a LM with the parameters as defined in the setting file or
to do a hyper-parameter tuning on the tf-idf + RF pipeline used as a baseline.
Currently, the above would result in an error, as the data shared here are not annotated with the AuditAnalytic labels used in our analysis due to property rights. For a subset of the labeled dataset please contact the authors after publication. Nonetheless, the unlabeled frozen splits containing the auditor reports, with other metadata as well, are currently available and provided in the data folder (need to be unzipped).
You can use the best-performing model (distilbert weighted variants as proposed in the paper) from here.
To cite please use our publication:
Konstantinos Bougiatiotis, Elias Zavitsanos, Georgios Paliouras. "Identifying going concern issues in auditor opinions: link to bankruptcy events". In proceedings of the 5th Financial Narrative Processing Workshop (FNP 2023) at the 2023 IEEE International Conference on Big Data (IEEE BigData 2023), Sorrento, Italy.