Skip to content

Latest commit

 

History

History
72 lines (54 loc) · 3.08 KB

parsers.md

File metadata and controls

72 lines (54 loc) · 3.08 KB

Parsers

astminer supports multiple parsers for various programming languages. Here we describe the integrated parsers and their peculiarities.

ANTLR

ANTLR provides an infrastructure to generate lexers and parsers for languages based on grammars. For now, astminer supports ANTLR-based parsers for Java, Python, JS, and PHP.

GumTree

GumTree is a framework to work with source code as trees and to compute the differences between the trees in different versions of code. It also builds language-agnostic representations of code. For now, astminer supports GumTree-based parsers for Java and Python.

python-parser

Running GumTree with Python requires python-parser. You can set it up as follows:

  1. Download the sources from GitHub
  2. Install the dependencies
pip install -r requirements.txt
  1. Make the python-parser script executable
chmod +x src/main/python/pythonparser/pythonparser_3.py
  1. Add python-parser to PATH
cp src/main/python/pythonparser/pythonparser_3.py src/main/python/pythonparser/pythonparser
export PATH="<path>/src/main/python/pythonparser/pythonparser:${PATH}"

srcML backend

A lot of languages in gumtree additionally supported with srcML backend, so astminer uses gumtree with srcML as a whole new parser. Running it requires installing srcML: https://www.srcml.org/

If you have any problems with installation check the Dockerfile in the project root

Fuzzy

Originally fuzzyc2cpg, Fuzzy is now part of codepropertygraph. astmineruses it to parse C/C++ code. g++ is required for this parser.

JavaParser

Parser for Java which is used to get trees for Code2seq and Code2vec models, and is also used in many other studies to collect trees and work with them. When working with Javaparser astminer implements an algorithm similar to the algorithm in the JavaExtractor module in the Code2Vec repository to get similar trees.

Other languages and parsers

Support for a new programming language can be implemented in a few simple steps.

If there is an ANTLR grammar for the language:

  1. Add the corresponding ANTLR4 grammar file to the antlr directory.
  2. Run the generateGrammarSource Gradle task to generate the parser.
  3. Implement a small wrapper around the generated parser. See JavaParser or PythonParser for an example of such a wrapper.

If the language has a parsing tool that is available as a Java library:

  1. Add the library as a dependency in build.gradle.kts.
  2. Implement a wrapper for the parsing tool. See FuzzyCppParser for an example of such a wrapper.