A Micro-benchmarking Framework for Python Type Inference Tools

📌 Features:

📜 Contains 154 code snippets to test and benchmark.
🏷 Offers 845 type annotations across a diverse set of Python functionalities.
📂 Organized into 18 distinct categories targeting various Python features.
🚢 Seamlessly manages the execution of containerized tools.
🔄 Efficiently transforms inferred types into a standardized format.
📊 Automatically produces meaningful metrics for in-depth assessment and comparison.

[New] TypeEvalPy Autogen

🤖 Autogenerates code snippets and ground truth to scale the benchmark based on the original TypeEvalPy benchmark.
📈 The autogen benchmark now contains:
- Python files: 7121
- Type annotations: 78373

🛠️ Supported Tools

Supported ✅	In-progress 🔧	Planned 💡
HeaderGen	Intellij PSI	MonkeyType
Jedi	Pyre	Pyannotate
Pyright	PySonar2
HiTyper	Pytype
Scalpel	TypeT5
Type4Py
GPT
Ollama

🏆 TypeEvalPy Leaderboard

Below is a comparison showcasing exact matches across different tools and LLMs on the Autogen benchmark.

Rank	🛠️ Tool	Function Return Type	Function Parameter Type	Local Variable Type	Total
1	mistral-large-it-2407-123b	16701	728	57550	74979
2	qwen2-it-72b	16488	629	55160	72277
3	llama3.1-it-70b	16648	580	54445	71673
4	gemma2-it-27b	16342	599	49772	66713
5	codestral-v0.1-22b	16456	706	49379	66541
6	codellama-it-34b	15960	473	48957	65390
7	mistral-nemo-it-2407-12.2b	16221	526	48439	65186
8	mistral-v0.3-it-7b	16686	472	47935	65093
9	phi3-medium-it-14b	16802	467	45121	62390
10	llama3.1-it-8b	16125	492	44313	60930
11	codellama-it-13b	16214	479	43021	59714
12	phi3-small-it-7.3b	16155	422	38093	54670
13	qwen2-it-7b	15684	313	38109	54106
14	HeaderGen	14086	346	36370	50802
15	phi3-mini-it-3.8b	15908	320	30341	46569
16	phi3.5-mini-it-3.8b	15763	362	28694	44819
17	codellama-it-7b	13779	318	29346	43443
18	Jedi	13160	0	15403	28563
19	Scalpel	15383	171	18	15572
20	gemma2-it-9b	1611	66	5464	7141
21	Type4Py	3143	38	2243	5424
22	tinyllama-1.1b	1514	28	2699	4241
23	mixtral-v0.1-it-8x7b	3235	33	377	3645
24	phi3.5-moe-it-41.9b	3090	25	273	3388
25	gemma2-it-2b	1497	41	1848	3386

_{(Auto-generated based on the the analysis run on 30 Aug 2024)}

🐳 Running with Docker

1️⃣ Clone the repo

git clone https://github.com/secure-software-engineering/TypeEvalPy.git

2️⃣ Build Docker image

docker build -t typeevalpy .

3️⃣ Run TypeEvalPy

🕒 Takes about 30mins on first run to build Docker containers.

📂 Results will be generated in the results folder within the root directory of the repository. Each results folder will have a timestamp, allowing you to easily track and compare different runs.

Correlation of CSV Files Generated to Tables in ICSE Paper

Here is how the auto-generated CSV tables relate to the paper's tables:

Table 1 in the paper is derived from three auto-generated CSV tables:
- paper_table_1.csv - details Exact matches by type category.
- paper_table_2.csv - lists Exact matches for 18 micro-benchmark categories.
- paper_table_3.csv - provides Sound and Complete values for tools.
Table 2 in the paper is based on the following CSV table:
- paper_table_5.csv - shows Exact matches with top_n values for machine learning tools.

Additionally, there are CSV tables that are not included in the paper:

paper_table_4.csv - containing Sound and Complete values for 18 micro-benchmark categories.
paper_table_6.csv - featuring Sensitivity analysis.

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy

🔧 Optionally, run analysis on specific tools:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners headergen scalpel

📊 Run analysis on custom benchmarks:

Here, running with the autogen benchmark on HeaderGen

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy \
      --runners headergen \
      --custom_benchmark_dir /app/autogen_typeevalpy_benchmark

🛠️ Available options: headergen, pyright, scalpel, jedi, hityper, type4py, hityperdl

🤖 Running TypeEvalPy with LLMs

TypeEvalPy integrates with LLMs through Ollama, streamlining their management. Begin by setting up your environment:

Create Configuration File: Copy the config_template.yaml from the src directory and rename it to config.yaml.

In the config.yaml, configure in the following:

openai_key: your key for accessing OpenAI's models.
ollama_url: the URL for your Ollama instance. For simplicity, we recommend deploying Ollama using their Docker container. Get started with Ollama here.
prompt_id: set this to questions_based_2 for optimal performance, based on our tests.
ollama_models: select a list of model tags from the Ollama library. For better operation, ensure the model is pre-downloaded with the ollama pull command.

With the config.yaml configured, run the following command:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners ollama

Running From Source...

1. 📥 Installation

Clone the repo

git clone https://github.com/secure-software-engineering/TypeEvalPy.git

Install Dependencies and Set Up Virtual Environment

Run the following commands to set up your virtual environment and activate the virtual environment.
```
python3 -m venv .env
```
```
source .env/bin/activate
```
```
pip install -r requirements.txt
```

2. 🚀 Usage: Running the Analysis

Navigate to the src Directory
```
cd src
```
Execute the Analyzer

Run the following command to start the benchmarking process on all tools:
```
python main_runner.py
```
or

Run analysis on specific tools
```
python main_runner.py --runners headergen scalpel
```

Running TypeEvalPy Autogen

To generate an extended version of the original TypeEvalPy benchmark to include many more Python types, run the following commands:

Navigate to the autogen Directory
```
cd autogen
```
Execute the Generation Script

Run the following command to start the generation process:
```
python generate_typeevalpy_dataset.py
```

This will generate a folder in the repo root with the autogen benchmark with the current date.

🤝 Contributing

Thank you for your interest in contributing! To add support for a new tool, please utilize the Docker templates provided in our repository. After implementing and testing your tool, please submit a pull request (PR) with a descriptive message. Our maintainers will review your submission, and merge them.

To get started with integrating your tool, please follow the guide here: docs/Tool_Integration_Guide.md

⭐️ Show Your Support

Give a ⭐️ if this project helped you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

A Micro-benchmarking Framework for Python Type Inference Tools

📌 Features:

[New] TypeEvalPy Autogen

🛠️ Supported Tools

🏆 TypeEvalPy Leaderboard

🐳 Running with Docker

1️⃣ Clone the repo

2️⃣ Build Docker image

3️⃣ Run TypeEvalPy

🤖 Running TypeEvalPy with LLMs

1. 📥 Installation

2. 🚀 Usage: Running the Analysis

Running TypeEvalPy Autogen

🤝 Contributing

⭐️ Show Your Support

Files

README.md

Latest commit

History

README.md

File metadata and controls

A Micro-benchmarking Framework for Python Type Inference Tools

📌 Features:

[New] TypeEvalPy Autogen

🛠️ Supported Tools

🏆 TypeEvalPy Leaderboard

🐳 Running with Docker

1️⃣ Clone the repo

2️⃣ Build Docker image

3️⃣ Run TypeEvalPy

🤖 Running TypeEvalPy with LLMs

1. 📥 Installation

2. 🚀 Usage: Running the Analysis

Running TypeEvalPy Autogen

🤝 Contributing

⭐️ Show Your Support