R+R: Security Vulnerability Dataset Quality Is Critical

ACSAC 2024: TECHNICAL PAPER - PAPER #300

TO BE PRESENTED AT THE ACSAC 2024, DECEMBER 9-13, 2024

ENVIRONMENT SETUP

CONDA SETUP [Anaconda3 4.9.2 Installer for Linux-64]

Open your terminal.
Use the following command to download the specific version of Anaconda3
curl -O https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh
Once the installer has been downloaded, run it using the following command
bash Anaconda3-2020.11-Linux-x86_64.sh
Once the installation is complete, to ensure that Conda 4.9.2 has been installed correctly, restart your terminal and run the following command
conda --version
You should see: conda 4.9.2

CONDA ENVIRONMENT SETUP [PYTHON 3.10.12]

After Conda is installed and initialized, open your terminal.
Run the following command to create a new conda environment with Python 3.10.12
conda create --name DatasetQuality python=3.10.12
Once the environment is created, activate it using the following command
conda activate DatasetQuality
To make sure that the correct version of Python (3.10.12) is installed in the environment
python --version
You should see: Python 3.10.12

REQUIRED PACKAGES [GIT CLONE]

First, install pip in your environment (if not already installed)
conda install pip
Clone our repository to your machine
git clone https://github.com/Anurag-Swarnim-Yadav/DatasetQuality.git
cd DatasetQuality
Install the necessary packages using pip
- For NVIDIA A100-SXM4-80GB
  pip install -r requirements.txt
- For other setups
  pip install -r requirements-small.txt
You can verify that the packages have been installed correctly by running
pip list

DATASET OVERVIEW

VULREPAIR DATASET [RQ1]

Samples	Train	Test
Total Samples (TS)	6,776	1,706

VULREPAIR DATASET ANALYSIS - UNIQUENESS [RQ2A AND RQ2B]

Samples	Train	Test
Total Samples (TS)	6,776	1,706
In-Set Duplicates (IS Dup)	1,593	91
Samples Left (SL = TS - IS Dup)	5,183	1,615
Cross-Set Duplicates (CS Dup)	796

VULREPAIR DATASET ANALYSIS - CONSISTENCY [RQ3A AND RQ3B]

Samples	Train	Test
Total Samples (TS)	6,776	1,706
In-Set Duplicates (IS Dup)	1,858	111
Samples Left (SL = TS - IS Dup)	4,918	1,595
Cross-Set Duplicates (CS Dup)	923

BUG-FIX DATASET ANALYSIS [RQ5]

Samples	Train	Validation
Total Samples (TS)	534,858	10,000
In-Set Duplicates (IS Dup)	6,192	4
Samples Left (SL = TS - IS Dup)	528,666	9,996
Cross-Set Duplicates (CS Dup)	247

Note: The Bug-Fix dataset is available at
https://github.com/ASSERT-KTH/VRepair/releases/download/v20240223/BugFix.tar.bz2.
Thanks to the authors of VRepair.

EXPERIMENTS

In an attempt to provide robust performance evaluations, each result is reported as the mean performance of six networks trained using different random seeds.
Six random Seeds are: 26312, 43511, 67732, 70757, 95541, and 123456

RQ1

DATASET

Samples	Train	Test	Comments
Total Samples (TS)	6,776	1,706	Contains IS and CS Duplicates

Note: IS: In-Set and CS: Cross-Set Duplicates

TO RUN THE RQ1 EXPERIMENT

cd RQ1

VULREPAIR

To train the VulRepair Model. Run the following command in your terminal
sh run_VulRepair_train.sh
Once the model is trained, navigate to cd RQ1-Code/VulRepair/ to see the VulRepair_train.log file and the new folder VulRepair_model, where the mode will be saved.
To test the VulRepair Model. Go back to cd ../.. and run the following command in your terminal
sh run_VulRepair_test.sh
Once finished, navigate to cd RQ1-Code/VulRepair/, and you will see VulRepair_test.log as well as the new folder raw_predictions, which will have the model prediction.

CODEBERT

To train the CodeBERT Model. Run the following command in your terminal
sh run_CodeBERT_train.sh
Once the model is trained, navigate to cd RQ1-Code/CodeBERT/ to see the CodeBERT_train.log file and the new folder CodeBERT_model, where the mode will be saved.
To test the CodeBERT Model. Go back to cd ../.. and run the following command in your terminal
sh run_CodeBERT_test.sh
Once finished, navigate to cd RQ1-Code/CodeBERT/, and you will see CodeBERT_test.log as well as the new folder raw_predictions, which will have the model prediction.

GRAPHCODEBERT

To train the GraphCodeBERT Model. Run the following command in your terminal
sh run_GraphCodeBERT_train.sh
Once the model is trained, navigate to cd RQ1-Code/GraphCodeBERT/ to see the GraphCodeBERT_train.log file and the new folder GraphCodeBERT_model, where the mode will be saved.
To test the GraphCodeBERT Model. Go back to cd ../.. and run the following command in your terminal
sh run_GraphCodeBERT_test.sh
Once finished, navigate to cd RQ1-Code/GraphCodeBERT/, and you will see GraphCodeBERT_test.log as well as the new folder raw_predictions, which will have the model prediction.

UNIXCODER

To train the UniXcoder Model. Run the following command in your terminal
sh run_UniXcoder_train.sh
Once the model is trained, navigate to cd RQ1-Code/UniXcoder/ to see the UniXcoder_train.log file and the new folder UniXcoder_model, where the mode will be saved.
To test the UniXcoder Model. Go back to cd ../.. and run the following command in your terminal
sh run_UniXcoder_test.sh
Once finished, navigate to cd RQ1-Code/UniXcoder/, and you will see UniXcoder_test.log as well as the new folder raw_predictions, which will have the model prediction.

REPLICATION RESULT OF VULREPAIR, CODEBERT, GRAPHCODEBERT AND UNIXCODER.

TO VERIFY OUR RESULTS, within the RQ1 folder, we have six different seed subfolders, each containing the raw prediction file for its respective model.

Models	PP Reported	PP Replicated	Change
VulRepair/CodeT5	44% Fu et al. ; 44.96% Zhang et al.	40.42%	-3.58% ; -4.54%
CodeBERT	31% Fu et al. ; 32.94% Zhang et al.	33.20%	+2.20% ; +0.74%
GraphCodeBERT	37.98% Zhang et al.	38.51%	+0.53%
UniXcoder	40.62% Zhang et al.	40.96%	+0.34%

Note: The trained models will be released separately.

RQ2A

DATASET

Samples	Train	Test
Unique Samples	4387	1615

TO RUN THE RQ2A EXPERIMENT

cd RQ2A

VULREPAIR

To train
sh run_VulRepair_train.sh
To test
sh run_VulRepair_test.sh

CODEBERT

To train
sh run_CodeBERT_train.sh
To test
sh run_CodeBERT_test.sh

GRAPHCODEBERT

To train
sh run_GraphCodeBERT_train.sh
To test
sh run_GraphCodeBERT_test.sh

UNIXCODER

To train
sh run_UniXcoder_train.sh
To test
sh run_UniXcoder_test.sh

RESULT

TO VERIFY OUR RESULTS, within the RQ2A folder, we have six different seed subfolders, each containing the raw prediction file for its respective model.

Models	PP RQ2A	% of Replication
VulRepair/CodeT5	8.91%	22.0% (8.91/40.42)
CodeBERT	5.58%	16.8% (5.58/33.20)
GraphCodeBERT	5.31%	13.7% (5.31/38.51)
UniXcoder	4.82%	11.8% (4.82/40.96)

Note: PP RQ2A shows perfect prediction scores on running on RQ2A dataset and % of Replication shows the fraction of perfect prediction in our replicated results from the VulRepair dataset.

RQ2B

DATASET

Samples	Train	Test
Unique Samples	5183	819

TO RUN THE RQ2B EXPERIMENT

cd RQ2B

VULREPAIR

To train
sh run_VulRepair_train.sh
To test
sh run_VulRepair_test.sh

CODEBERT

To train
sh run_CodeBERT_train.sh
To test
sh run_CodeBERT_test.sh

GRAPHCODEBERT

To train
sh run_GraphCodeBERT_train.sh
To test
sh run_GraphCodeBERT_test.sh

UNIXCODER

To train
sh run_UniXcoder_train.sh
To test
sh run_UniXcoder_test.sh

RESULT

TO VERIFY OUR RESULTS, within the RQ2B folder, we have six different seed subfolders, each containing the raw prediction file for its respective model.

Models	PP RQ2B	% of Replication
VulRepair/CodeT5	13.17%	33% (13.17/40.42)
CodeBERT	8.83%	27% (8.83/33.20)
GraphCodeBERT	9.22%	24% (9.22/38.51)
UniXcoder	9.10%	22% (9.10/40.96)

RQ3A

DATASET

Samples	Train	Test
Unique Samples	3995	1595

TO RUN THE RQ3A EXPERIMENT

cd RQ3A

VULREPAIR

To train
sh run_VulRepair_train.sh
To test
sh run_VulRepair_test.sh

CODEBERT

To train
sh run_CodeBERT_train.sh
To test
sh run_CodeBERT_test.sh

GRAPHCODEBERT

To train
sh run_GraphCodeBERT_train.sh
To test
sh run_GraphCodeBERT_test.sh

UNIXCODER

To train
sh run_UniXcoder_train.sh
To test
sh run_UniXcoder_test.sh

RESULT

TO VERIFY OUR RESULTS, within the RQ3A folder, we have six different seed subfolders, each containing the raw prediction file for its respective model.

Models	PP RQ3A	% of Replication
VulRepair/CodeT5	7.14%	17.7% (7.14/40.42)
CodeBERT	3.59%	10.8% (3.59/33.20)
GraphCodeBERT	3.75%	9.7% (3.75/38.51)
UniXcoder	4.11%	10.0% (4.11/40.96)

RQ3B

DATASET

Samples	Train	Test
Unique Samples	4918	672

TO RUN THE RQ3B EXPERIMENT

cd RQ3B

VULREPAIR

To train
sh run_VulRepair_train.sh
To test
sh run_VulRepair_test.sh

CODEBERT

To train
sh run_CodeBERT_train.sh
To test
sh run_CodeBERT_test.sh

GRAPHCODEBERT

To train
sh run_GraphCodeBERT_train.sh
To test
sh run_GraphCodeBERT_test.sh

UNIXCODER

To train
sh run_UniXcoder_train.sh
To test
sh run_UniXcoder_test.sh

RESULT

TO VERIFY OUR RESULTS, within the RQ3B folder, we have six different seed subfolders, each containing the raw prediction file for its respective model.

Models	PP RQ3B	% of Replication
VulRepair/CodeT5	10.27%	25.5% (10.27/40.24)
CodeBERT	5.38%	16.2% (5.38/33.20)
GraphCodeBERT	6.25%	16.2% (6.25/38.51)
UniXcoder	6.18%	15.0% (6.18/40.96)

RQ4A

In this research question, we report the performance of each of the models studied on the top 10 CWEs, showing their performance when duplicate and inconsistent samples are removed from consideration.

RQ4B

In this research question, we first assessed whether the samples had the correct CWE tags. If a sample was found to have an incorrect CWE tag, we identified the correct tag through manual analysis of the sample. Additionally, we evaluated whether the corresponding fix was complete based on manual analysis.

RESULT: MANUAL ANALYSIS REPORT [Top 10 CWEs ANALYSIS FOR ACCURACY AND COMPLETENESS]

Rank	CWE Type	Name	RQ2B Samples	Accurate	Complete	Accurate & Complete
1	CWE-787	Out-of-bounds Write	33	15	18	12
2	CWE-79	Cross-site Scripting	0	0	0	0
5	CWE-78	OS Command Injection	1	0	0	0
6	CWE-89	SQL Injection	1	1	1	1
7	CWE-416	Use After Free	29	11	18	7
8	CWE-22	Path Traversa	2	1	0	0
9	CWE-352	Cross-Site Request Forgery	2	2	1	1
10	CWE-434	Dangerous File Type	-	-	-	-
		Total	68	30	38	21

RQ5 [TRANSFER LEARNING]

BUG-FIX DATASET

Samples	Train	Validation
Unique Samples	528419	9996

FOR FINETUNING, WE USED RQ3B DATASET.

RESULT

We have released all the models at https://doi.org/10.5281/zenodo.11582874
Unzip the folder using unzip filename.zip

PRE-TRAINING

Pretraning is done on Seed 26312.
Download
VulRepairRQ5_Seed26312
CodeBERTRQ5_Seed26312
GraphCodeBERTRQ5_Seed26312
UniXcoderRQ5_Seed26312

VULREPAIR

To train
sh run_pretrain.sh
To test
sh run_pretrain_test.sh

CODEBERT

To train
sh run_pretrain.sh
To test
sh run_pretrain_test.sh

GRAPHCODEBERT

To train
sh run_pretrain.sh
To test
sh run_pretrain_test.sh

UNIXCODER

To train
sh run_pretrain.sh
To test
sh run_pretrain_test.sh

TRANSFER LEARNING

Transfer learning is done on six random Seeds: 26312, 43511, 67732, 70757, 95541, and 123456
Download all the folders.
Testing is done on beam sizes: 1, 3, 5, 10, 20, 30, 40, 50

VULREPAIR

To train
sh run_train.sh
To test
sh run_test.sh

CODEBERT

To train
sh run_train.sh
To test
sh run_test.sh

GRAPHCODEBERT

To train
sh run_train.sh
To test
sh run_test.sh

UNIXCODER

To train
sh run_train.sh
To test
sh run_test.sh

TO VERIFY OUR RESULTS, the RQ5 folder is organized into two subfolders: Bug-Fix and Transfer-Learning. The Bug-Fix subfolder contains one seed folder, while the Transfer-Learning subfolder includes six distinct seed folders, each holding the raw prediction file for the corresponding model.

Models	Beam = 1		Beam = 3		Beam = 5		Beam = 50
Datasets	BF	TL	BF	TL	BF	TL	BF	VR	TL
VulRepair	3.6%	13.5%	7.4%	19.0%	7.6%	20.2%	6.55%	10.27%	18.67%
CodeBERT	3.0%	12.5%	4.6%	17.3%	5.36%	18.9%	11.76%	5.38%	24.55%
GraphCodeBERT	2.2%	11.5%	4.6%	16.9%	5.8%	19.0%	11.76%	6.25%	25.42%
UniXcoder	1.9%	12.9%	5.2%	18.1%	6.6%	19.7%	11.31%	6.18%	26.07%

Note: BF = Bug-Fix, TL = Transfer Learning, VR = Vulnerability Repair

BF (Bug-Fix): The models are trained on the bug-fix dataset and tested on the RQ3B vulnerability dataset.
TL (Transfer Learning): The models are initially trained on the bug-fix dataset and subsequently fine-tuned on the RQ3B vulnerability dataset.

RELATED WORK

VISION TRANSFORMER INSPIRED AUTOMATED VULNERABILITY REPAIR PAPER [VQM]

We conducted a brief analysis of both the pre-training bug-fix dataset and VQM vulnerability fine-tuning dataset used in that paper. The pre-training dataset contains 21,246 training samples and 2,362 validation samples. Our review revealed 18,622 duplicated entries in the training set and 782 duplicates in the validation set. After removing those, 1,579 cross-set duplicates (in both train and validation) were identified, which were all of the validation set code samples present in the training set. Additionally, our analysis uncovered a substantial overlap between the bug-fix dataset and the VQM vulnerability fine-tuning dataset. Specifically, there were 511 matching entries in the test set, 243 in the validation set, and 1,747 in the training set of the vulnerability dataset that overlapped with the bug-fix dataset.

RESULT: DETAILED ANALYSIS AND REPORT

CODET5 BEAM ANALYSIS [NEW WORK - NOT INCLUDED IN THE PAPER]

REPLICATION

Zhang et al. investigated the impact of varying beam size values. To verify their findings, we utilized the same dataset provided by the authors and attempted to replicate the results. Our observations indicate that as the beam size increases, the %PP value goes up.

DATASET

Samples	Train	Validation	Test
Total Samples (TS)	5937	839	1706

RESULT

Seed	Beam = 1	Beam = 2	Beam = 3	Beam = 4	Beam = 5	Beam = 10	Beam = 15	Beam = 20	Beam = 50	Beam = 100
26312	0.3130	0.3623	0.3816	0.3951	0.3992	0.4127	0.4185	0.4191	0.4220	0.4174
43511	0.2198	0.2708	0.2948	0.3019	0.3095	0.3259	0.3306	0.3271	0.3247	0.3288
67732	0.3212	0.3769	0.3992	0.4127	0.4174	0.4291	0.4332	0.4343	0.4297	0.4308
70757	0.2726	0.3359	0.3517	0.3634	0.3693	0.3851	0.3845	0.3851	0.3875	0.3845
95541	0.2655	0.3253	0.3453	0.3617	0.3664	0.3787	0.3810	0.3810	0.3834	0.3769
123456	0.2521	0.3013	0.3376	0.3470	0.3529	0.3681	0.3681	0.3681	0.3693	0.3681
Average PP	27.33%	32.83%	35.17%	36.33%	36.83%	38.33%	38.67%	38.50%	38.67%	38.50%

NO DUPLICATE SAMPLES

For this experiment, we removed Infile and Crossfile duplicates from the dataset. We reran the CodeT5 model and observed that beyond a beam size of 15, %PP decreases. We were unable to find reported in any previously published papers.

DATASET

Samples	Train	Validation	Test
Total Samples (TS)	5937	839	1706
In-Set Duplicates (IS Dup)	1418	27	111
Sample Left(SL = TS - IS Dup)	4519	812	1595
Cross-Set Duplicates (TEST)	-	-	Tr:815,V:108
Cross-Set Duplicates (VALIDATION)	-	Tr:413	-
Unique Samples (US = SL - CS Dup)	4519	399	672

RESULT

Seed	Beam = 1	Beam = 2	Beam = 3	Beam = 4	Beam = 5	Beam = 10	Beam = 15	Beam = 20	Beam = 50	Beam = 100
26312	0.0536	0.0759	0.0818	0.0863	0.0908	0.0878	0.0893	0.0878	0.0848	0.0789
43511	0.0580	0.0833	0.0908	0.0952	0.0967	0.1057	0.1057	0.1027	0.0952	0.0923
67732	0.0625	0.0893	0.0938	0.1042	0.0982	0.0982	0.1042	0.1027	0.0982	0.0923
70757	0.0714	0.0952	0.1027	0.1071	0.1101	0.1146	0.1176	0.1176	0.1057	0.1057
95541	0.0536	0.0729	0.0804	0.0893	0.0908	0.0982	0.0997	0.0982	0.0938	0.0893
123456	0.0491	0.0699	0.0878	0.0923	0.0952	0.0982	0.0952	0.0952	0.0833	0.0804
Average PP	5.83%	8.17%	9.00%	9.50%	9.67%	10.00%	10.17%	10.00%	9.33%	9.00%

Note: The source code, dataset, model prediction output, train log, and test log files are under CodeT5_Beam_Analysis

Name		Name	Last commit message	Last commit date
Latest commit History 191 Commits
CodeT5_Beam_Analysis		CodeT5_Beam_Analysis
RQ1		RQ1
RQ2A		RQ2A
RQ2B		RQ2B
RQ3A		RQ3A
RQ3B		RQ3B
RQ4A		RQ4A
RQ4B		RQ4B
RQ5		RQ5
VQM_Analysis_Report		VQM_Analysis_Report
.gitignore		.gitignore
README.md		README.md
requirements-small.txt		requirements-small.txt
requirements.txt		requirements.txt

Anurag-Swarnim-Yadav/DatasetQuality

Folders and files

Latest commit

History

Repository files navigation

R+R: Security Vulnerability Dataset Quality Is Critical

ACSAC 2024: TECHNICAL PAPER - PAPER #300

TO BE PRESENTED AT THE ACSAC 2024, DECEMBER 9-13, 2024

ENVIRONMENT SETUP

CONDA SETUP [Anaconda3 4.9.2 Installer for Linux-64]

CONDA ENVIRONMENT SETUP [PYTHON 3.10.12]

REQUIRED PACKAGES [GIT CLONE]

DATASET OVERVIEW

VULREPAIR DATASET [RQ1]

VULREPAIR DATASET ANALYSIS - UNIQUENESS [RQ2A AND RQ2B]

VULREPAIR DATASET ANALYSIS - CONSISTENCY [RQ3A AND RQ3B]

BUG-FIX DATASET ANALYSIS [RQ5]

EXPERIMENTS

RQ1

DATASET

TO RUN THE RQ1 EXPERIMENT

VULREPAIR

CODEBERT

GRAPHCODEBERT

UNIXCODER

REPLICATION RESULT OF VULREPAIR, CODEBERT, GRAPHCODEBERT AND UNIXCODER.

RQ2A

DATASET

TO RUN THE RQ2A EXPERIMENT

VULREPAIR

CODEBERT

GRAPHCODEBERT

UNIXCODER

RESULT

RQ2B

DATASET

TO RUN THE RQ2B EXPERIMENT

VULREPAIR

CODEBERT

GRAPHCODEBERT

UNIXCODER

RESULT

RQ3A

DATASET

TO RUN THE RQ3A EXPERIMENT

VULREPAIR

CODEBERT

GRAPHCODEBERT

UNIXCODER

RESULT

RQ3B

DATASET

TO RUN THE RQ3B EXPERIMENT

VULREPAIR

CODEBERT

GRAPHCODEBERT

UNIXCODER

RESULT

RQ4A

RQ4B

RESULT: MANUAL ANALYSIS REPORT [Top 10 CWEs ANALYSIS FOR ACCURACY AND COMPLETENESS]

RQ5 [TRANSFER LEARNING]

BUG-FIX DATASET

FOR FINETUNING, WE USED RQ3B DATASET.

RESULT

PRE-TRAINING

VULREPAIR

CODEBERT

GRAPHCODEBERT

UNIXCODER

TRANSFER LEARNING

VULREPAIR

CODEBERT

GRAPHCODEBERT

UNIXCODER

RELATED WORK

VISION TRANSFORMER INSPIRED AUTOMATED VULNERABILITY REPAIR PAPER [VQM]

RESULT: DETAILED ANALYSIS AND REPORT

CODET5 BEAM ANALYSIS [NEW WORK - NOT INCLUDED IN THE PAPER]

REPLICATION

Packages