Tool to scan through a given source code base and detect, process, analyze, and catalog, all source code name identifiers, then determine if the those identifier names contain preambles.
This tool is the implmentation of the Graduate Capstone Project: Automated Detection and Analysis of Common Preambles and Their Meanings in Source Code Identifiers
The interpretation of identifier names is a significant problem in software development. Programmers must interpret identifier names before performing any software development or maintenance task, code search and analysis techniques use identifier names to provide useful services to developers. Having high quality identifier names reduces the amount of time developers spend reading code and the increases the accuracy of techniques that use natural language information to support developers. Preambles are a particular type of token found in identifiers. Unlike typical tokens, Preambles add no new information to the meaning of am identifier’s name—but instead specify certain types of behavior (e.g., pointers) or help namespace (e.g., in the case of the C programming language) identifiers to a specify module. Because preambles add no new information, they should be removed from, or at least identified in, identifiers before code analysis tries to interpret identifier name meaning. The goal of my project is to validate and augment the types of preambles described in prior work and create a technique that can automatically detect preambles.
Before using preamble_collection_tool, the following dependencies must be installed:
It is important to note that as of now, preamble_collection_tool, is not platform independent, and will only run on Unix based operating systems (ie. Linux, or OSX)
After properly installing all necessary dependencies, create 3 new directories/folders in the main working directory:
projects/
reports/
analysis/
After creating these directories, clone/download source code projects that you desire to analyze into the newly created projects
directory.
NOTE: This tool currently only detects identifiers from source code files written in C, C++, C#, and Java
The tool is run through the following command: python3 preamble_tool.py [OPTION]
There are a number of commands to run for operation:
-
preamble_tool.py --help
orpreamble_tool.py -h
- Displays the help/options menu for reference. -
preamble_tool.py --collect-project data
orpreamble_tool.py -cpd
- Collects the name identifiers from all source files in allprojects
directory. Collected raw data from projects will be put into thereports
directory. -
preamble_tool.py --collect-first-terms
orpreamble_tool.py -cft
- Scans all files in thereports
directory and collects the most commonly used first terms of identifiers in all projects. Outputs two CSV files into theanalysis
directory. One of these files,dictionary_first_terms.csv
contains first terms that are in the English dictionary, whileother_first_terms.csv
contains all others. These files contain all individual first terms of identifiers used in all analyzed projects, as well as how often these first terms are used in all instances, as well as an example identifier, and location of said example. -
preamble_tool.py --analyze-project-data
orpreamble_tool.py -apd
- Analyzes all gathered data in thereports
directory, and runs full analysis to determine which identifiers in all projects contain preambles. Outputs final analysis results intopreamble_identifiers.csv
, located in theanalysis
directory.
The heuristics and scientific basis from which this tool collects, gathers, and analyzes source code name identifiers is based on research conducted by various scientific sources.
These sources are: