Annotates files and lines of diffs (patches) with their purpose and type, and performs statistical analysis on the generated annotation data.
Use the package manager pip to install diffannotator.
To avoid dependency conflicts, it is strongly recommended to create a virtual environment first, activate it, and install diffannotator into this environment. See also "Virtual environment" subsection below.
To install the most recent version, use
python -m pip install diffannotator@git+https://github.com/ncusi/python-diff-annotator#egg=main
or (assuming that you can clone the repository with SSH)
python -m pip install diffannotator@git+ssh://[email protected]/ncusi/python-diff-annotator.git#egg=main
This package installs scripts (currently three) that you can run
to generate patches, annotate them, and extract their statistics.
Every script name starts with the diff-*
prefix.
Each script and subcommand supports the --help
option.
-
diff-generate
: used to generate patches (*.patch and *.diff files) from a given repository, in the format suitable for later analysis; not strictly necessary;Usage:
diff-generate [OPTIONS] REPO_PATH [REVISION_RANGE...]
(whereREVISION_RANGE...
is passed as arguments to thegit log
command) -
diff-annotate
: annotates existing dataset (patch files in subdirectories), or annotates selected subset of commits (of changes in commits) in the given repository;Usage:
diff-annotate [OPTIONS] COMMAND [ARGS]...
diff-annotate patch [OPTIONS] PATCH_FILE RESULT_JSON
: annotate a single PATCH_FILE, writing results to RESULT_JSON,diff-annotate dataset [OPTIONS] DATASETS...
: annotate all bugs in provided DATASETS,diff-anotate from-repo [OPTIONS] REPO_PATH [REVISION_RANGE...]
: create annotation data for commits from local Git repository (withREVISION_RANGE...
passed as arguments to thegit log
command);
-
diff-gather-stats
: compute various statistics and metrics from patch annotation data generated by thediff-annotate
script;Usage:
diff-gather-stats [OPTIONS] COMMAND [ARGS]...
diff-gather-stats purpose-counter [--output JSON_FILE] DATASETS...
: calculate count of purposes from all bugs in provided datasets,diff-gather-stats purpose-per-file [OPTIONS] RESULT_JSON DATASETS...
: calculate per-file count of purposes from all bugs in provided datasets,diff-gather-stats lines-stats [OPTIONS] OUTPUT_FILE DATASETS...
: calculate per-bug and per-file count of line types in provided datasets,diff-gather-stats timeline [OPTIONS] OUTPUT_FILE DATASETS...
: calculate timeline of bugs with per-bug count of different types of lines;
-
...
This repository also includes some examples demonstrating how this project works, and what it can be used for.
You can set up the environment for using this project, following
the recommended practices (described in the "Development"
section of this document), by running the examples-init.bash
Bash script,
and following its instructions.
Note that this script assumes that it is run on Linux, or Linux-like system. For other operating systems, you are probably better following the steps described in this document manually.
This script includes the configuration section at the beginning of it; you can change parameters to better fit your environment:
DVCSTORE_DIR
- directory with local dvc remotePYTHON
- Python 3.x executable (before activating virtual environment)
This project uses DVC (Data Version Control) tool to track annotations and metrics data, and version this data. It allows to store large files and large directories outside of Git repository, while still have them to be version controlled. They can be stored locally, or in the cloud.
The examples-init.bash
script also configures
local DVC storage (see the next subsection).
To provide reproducibility, and to make it possible to version data files separately from versioning the code, this project uses DVC (Data Version Control) tool for its examples.
DVC pipelines are versioned using Git, and allow you to better organize projects
and reproduce complete workflows and results at will. The pipeline is defined
in the dvc.yaml
file.
You can re-run the whole pipeline, after installing DVC, with the dvc repro
command. It will run only those pipeline stages that needed it, by examining
if stage dependencies (defined in dvc.yaml
) changed. The results are saved
in DVC cache, and you can push them to DVC remote with dvc push
, if you have
one configured (the examples-init.bash
script from previous subsection
configures DVC remote with storage on the local filesystem).
In the future the example data for demos will be available also as DVC remote, perhaps on DAGsHub, to be downloaded and used without the need to recompute it.
The notebooks/
directory contains Jupyter Notebooks with data exploration,
data analysis, etc. See notebooks/README.md
for details.
To avoid dependency conflicts, it is strongly recommended to create a virtual environment, for example with:
python -m venv .venv
This needs to be done only once, from top directory of the project. For each session, you should activate the environment:
source .venv/bin/activate
Using virtual environment, either directly like shown above, or
by using pipx
, might be required if you cannot install system
packages, but Python is configured in a very specific way:
error: externally-managed-environment
× This environment is externally managed
To install the project in editable mode (from top directory of this repo):
python -m pip install -e .
To be able to also run test, use:
python -m pip install --editable .[dev]
This project uses pytest framework.
Note that pytest
requires Python 3.8+ or PyPy3.
To run tests, run the following command
pytest
or
python -m pytest
See TODO.md
.
Here are some related projects that can also be used to extract development statistics from project or a repository.
Command line and terminal interface tools:
git-quick-stats
is a simple and efficient way to access various statistics in a git repositorygit-stats
provides local git statistics, including GitHub-like contributions calendarsgit_dash.sh
is a command-line shell script for generating a Git metrics dashboard directly in your terminalheatwave
visualizes your git commits with a heat map in the terminal, similar to how GitHub's heat map looksstatscat
is a CLI tool to get statistics of your all git repositories- hxtools
by Jan Engelhardt
is a collection of small tools and scripts, which include
git-author-stat
(commit author statistics of a git repository),git-blame-stat
(per-line author statistics), andgit-revert-stats
(reverting statistics) - git-fame (in Python ) and git-fame-rb (in Ruby ) are command-line tools to pretty-print Git repository collaborators sorted by contributions
git-of-theseus
is a set of scripts to analyze how a Git repo grows over time.- See The half-life of code & the ship of Theseus by Erik Bernhardsson (2016).
- GitHub Linguist
can also be used from the command line, using the
github-linguist
executable to generate repository's languages stats (the language breakdown by percentage and file size), also for selected revision - git-metrics tool is a set of util scripts to scrape data from git repositories to help teams improve (metrics such as lead time and open branches)
Tools to generate HTML dashboard, or providing an interactive web application:
- GitStats
is an open source GitHub contribution analyzer, providing live dashboard;
note that gitstats.me no longer works (the domain is parked for sale) repostat
is Git repository analyser and HTML-report generator with NVD3-driven interactive metrics visualisations;
note that demo site https://repostat.imfast.io/ no longer works- Repositorch
is a Git repository analysis engine written in C#;
it recommends using Docker Compose to install
(Repositorch on Docker Hub)
no demo site, but there is "How to use Repositorch" video on YouTube - cregit is a tool for helping to find and analyse code credits (unify identities, find contribution by token, extract metadata into a SQLite database, etc.)
- Githru is an interactive visual analytics system that enables developers to effectively understand the context of development history through the interactive exploration of Git and GitHub metadata (demo). It uses novel techniques (paper) (graph reconstruction, clustering, and Context-Preserving Squash Merge (CSM) methods) to abstract a large-scale Git commit graph.
- Assayo is a dashboard providing visualization and analysis of git commit statistics. Requires exporting data from Git. Has a homepage with demo. Its use is described in The visualization and analysis of git commit statistics for IT team leaders.
Visualizations for a specific repository:
- A Git history visualization page by Jeff Palmer shows "An Interactive Development History" of Git: project and contributor statistics, relative cumulative contributions by contributor, and aggregated commits by contributor by month with milestone annotations. Jeff wrote an associated blog post about how he created the visualization.
gitdm
(the "git data miner") is the tool that Greg KH and Jonathan Corbet have used to create statistics on where kernel patches come from. Written in Python. Original atgit://git.lwn.net/gitdm.git
Web applications that demonstrate some MSR tool:
- GitHub offers GitHub Insights for repositories
(see for example Contributors to qtile/qtile).
This includes the following subpages:
- Pulse (with configurable period of 1 month, 1 week, 3 days, 24 hours) shows information about pull requests and issues, and summary of changes as text (N authors pushed X commits to master, and Y to all branches. On master, M files were changed ad there had been A additions and D deletions).
- Contributions per week to master, excluding merge commits {as smoothed (!) line/area plot}, for whole project, and for up to 100 authors (with configurable period of all, last month, last 3 months, last 6 months, last 12 months, last 24 months; with configurable type of contributions: commits, additions, deletions). For each author we also have summary of their contributions as text (N commits, A ++, D --).
- Commits shows two plots: bar plot of commits per week over time for the last year {without any explanation, except for information shown on mouse hover}, and line plot with days of the week on x-axis {no explanation, no information on hover (!)}. No configuration.
- Code frequency over the history of the project: additions and deletions per week (where additions use green solid lines, and deletions use red dashed lines and are plotted upside-down). No configuration.
- other pages related to GitHub specifically, or the project as whole but not its history (like Community Standards, Dependency graph, Forks, or Action Usage Metrics).
- GitHub also offers Developer Overview, which among others include the following chart:
- N contributions in last year / in YYYY, showing heatmap using 5-color discrete colormap, with year worth of weeks on x-axis, and day of the week (Sun to Sat) on the y-axis. You can switch between the years with a "radio button" (though there is no 'last year' entry). Contributions are timestamped according to Coordinated Universal Time (UTC) rather than contributor's local time zone.
- Assayo has a homepage with demo where you can provide the output of given Git CLI command in your repo to create the demo for your repo, and there is also view a demo with mock data. Written in JavaScript with React.
- Githru has an
interactive demo,
where you can select one of the following two GitHub repositories
to visualize:
vuejs/vue
andrealm/realm-java
. Written in JavaScript with React, D3, dagre. - GitVision, a 3D repository graph visualization tool, has a live demo with visualization for more than 20 repositories (ranging from tiny to large), and where you can visualize your own repository by uploading the result of running the GitVision script. The demo is written in JavaScript using Vue and deployed with Vite.
- GitBug-Java, a reproducible Java benchmark of recent bugs (tool accompanying the GitBug-Java: A Reproducible Java Benchmark of Recent Bugs paper (on arXiv)), has web app visualizing the dataset. No source code for the web app; it seems to be in JavaScript using Angular, with the help of Chart.js and diff2html.
- Defects4J Dissection is an open-source web app that presents data to help researchers and practitioners to better understand the Defects4J bug dataset. Includes table view (the default) and charts. It is the open-science appendix of "Dissection of a bug dataset: anatomy of 395 patches from Defects4J" paper. Written in Python and JavaScript, under MIT license.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.