Skip to content

A Micro-benchmarking Framework for Python Type Inference Tools

Notifications You must be signed in to change notification settings

secure-software-engineering/TypeEvalPy

Repository files navigation


A Micro-benchmarking Framework for Python Type Inference Tools

📌 Features:

  • 📜 Contains 154 code snippets to test and benchmark.
  • 🏷 Offers 845 type annotations across a diverse set of Python functionalities.
  • 📂 Organized into 18 distinct categories targeting various Python features.
  • 🚢 Seamlessly manages the execution of containerized tools.
  • 🔄 Efficiently transforms inferred types into a standardized format.
  • 📊 Automatically produces meaningful metrics for in-depth assessment and comparison.

[New] TypeEvalPy Autogen

  • 🤖 Autogenerates code snippets and ground truth to scale the benchmark based on the original TypeEvalPy benchmark.
  • 📈 The autogen benchmark now contains:
    • Python files: 7121
    • Type annotations: 78373

🛠️ Supported Tools

Supported ✅ In-progress 🔧 Planned 💡
HeaderGen Intellij PSI MonkeyType
Jedi Pyre Pyannotate
Pyright PySonar2
HiTyper Pytype
Scalpel TypeT5
Type4Py
GPT
Ollama

🏆 TypeEvalPy Leaderboard

Below is a comparison showcasing exact matches across different tools and LLMs on the Autogen benchmark.

Rank 🛠️ Tool Function Return Type Function Parameter Type Local Variable Type Total
1 mistral-large-it-2407-123b 16701 728 57550 74979
2 qwen2-it-72b 16488 629 55160 72277
3 llama3.1-it-70b 16648 580 54445 71673
4 gemma2-it-27b 16342 599 49772 66713
5 codestral-v0.1-22b 16456 706 49379 66541
6 codellama-it-34b 15960 473 48957 65390
7 mistral-nemo-it-2407-12.2b 16221 526 48439 65186
8 mistral-v0.3-it-7b 16686 472 47935 65093
9 phi3-medium-it-14b 16802 467 45121 62390
10 llama3.1-it-8b 16125 492 44313 60930
11 codellama-it-13b 16214 479 43021 59714
12 phi3-small-it-7.3b 16155 422 38093 54670
13 qwen2-it-7b 15684 313 38109 54106
14 HeaderGen 14086 346 36370 50802
15 phi3-mini-it-3.8b 15908 320 30341 46569
16 phi3.5-mini-it-3.8b 15763 362 28694 44819
17 codellama-it-7b 13779 318 29346 43443
18 Jedi 13160 0 15403 28563
19 Scalpel 15383 171 18 15572
20 gemma2-it-9b 1611 66 5464 7141
21 Type4Py 3143 38 2243 5424
22 tinyllama-1.1b 1514 28 2699 4241
23 mixtral-v0.1-it-8x7b 3235 33 377 3645
24 phi3.5-moe-it-41.9b 3090 25 273 3388
25 gemma2-it-2b 1497 41 1848 3386

(Auto-generated based on the the analysis run on 30 Aug 2024)


🐳 Running with Docker

1️⃣ Clone the repo

git clone https://github.com/secure-software-engineering/TypeEvalPy.git

2️⃣ Build Docker image

docker build -t typeevalpy .

3️⃣ Run TypeEvalPy

🕒 Takes about 30mins on first run to build Docker containers.

📂 Results will be generated in the results folder within the root directory of the repository. Each results folder will have a timestamp, allowing you to easily track and compare different runs.

Correlation of CSV Files Generated to Tables in ICSE Paper Here is how the auto-generated CSV tables relate to the paper's tables:
  • Table 1 in the paper is derived from three auto-generated CSV tables:

    • paper_table_1.csv - details Exact matches by type category.
    • paper_table_2.csv - lists Exact matches for 18 micro-benchmark categories.
    • paper_table_3.csv - provides Sound and Complete values for tools.
  • Table 2 in the paper is based on the following CSV table:

    • paper_table_5.csv - shows Exact matches with top_n values for machine learning tools.

Additionally, there are CSV tables that are not included in the paper:

  • paper_table_4.csv - containing Sound and Complete values for 18 micro-benchmark categories.
  • paper_table_6.csv - featuring Sensitivity analysis.
docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy

🔧 Optionally, run analysis on specific tools:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners headergen scalpel

📊 Run analysis on custom benchmarks:

Here, running with the autogen benchmark on HeaderGen

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy \
      --runners headergen \
      --custom_benchmark_dir /app/autogen_typeevalpy_benchmark

🛠️ Available options: headergen, pyright, scalpel, jedi, hityper, type4py, hityperdl

🤖 Running TypeEvalPy with LLMs

TypeEvalPy integrates with LLMs through Ollama, streamlining their management. Begin by setting up your environment:

  • Create Configuration File: Copy the config_template.yaml from the src directory and rename it to config.yaml.

In the config.yaml, configure in the following:

  • openai_key: your key for accessing OpenAI's models.
  • ollama_url: the URL for your Ollama instance. For simplicity, we recommend deploying Ollama using their Docker container. Get started with Ollama here.
  • prompt_id: set this to questions_based_2 for optimal performance, based on our tests.
  • ollama_models: select a list of model tags from the Ollama library. For better operation, ensure the model is pre-downloaded with the ollama pull command.

With the config.yaml configured, run the following command:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners ollama

Running From Source...

1. 📥 Installation

  1. Clone the repo

    git clone https://github.com/secure-software-engineering/TypeEvalPy.git
  2. Install Dependencies and Set Up Virtual Environment

    Run the following commands to set up your virtual environment and activate the virtual environment.

    python3 -m venv .env
    source .env/bin/activate
    pip install -r requirements.txt

2. 🚀 Usage: Running the Analysis

  1. Navigate to the src Directory

    cd src
  2. Execute the Analyzer

    Run the following command to start the benchmarking process on all tools:

    python main_runner.py

    or

    Run analysis on specific tools

    python main_runner.py --runners headergen scalpel
    

Running TypeEvalPy Autogen

To generate an extended version of the original TypeEvalPy benchmark to include many more Python types, run the following commands:

  1. Navigate to the autogen Directory

    cd autogen
  2. Execute the Generation Script

    Run the following command to start the generation process:

    python generate_typeevalpy_dataset.py

This will generate a folder in the repo root with the autogen benchmark with the current date.


🤝 Contributing

Thank you for your interest in contributing! To add support for a new tool, please utilize the Docker templates provided in our repository. After implementing and testing your tool, please submit a pull request (PR) with a descriptive message. Our maintainers will review your submission, and merge them.

To get started with integrating your tool, please follow the guide here: docs/Tool_Integration_Guide.md


⭐️ Show Your Support

Give a ⭐️ if this project helped you!

About

A Micro-benchmarking Framework for Python Type Inference Tools

Topics

Resources

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •