Description

This module contains an algorithm for finding code quality issues in pre-written code templates using frequency analysis. Motivation, why this algorithm is needed, can be found here.

The algorithm has been tested on the JetBrains Academy platform. As a tool for evaluating code quality, Hyperstyle was used, which is currently used on the platform by default. More information about the platform and the use of Hyperstyle can be found here.

Search template issues based on frequency

Firstly we need to indicate issues which remains in all submissions of concrete user for concrete step (which student do not during all attempts to pass step).

The algorithm pipeline is the following:

The goal of our approach is to automatically detect code quality issues in pre-written task templates to help task creators detect and fix them. Firstly, we run Hyperstyle on solutions for problems containing a pre-written template (1 in figure with pipeline), and then 2 only keep the issues that are present in all of the successful attempts. This way, we focus on issues that the students did not fix and that could thus not be introduced by them. Next, 3 we take the original template and 4 use a matching function to match code quality issues from the student solutions to it. For each issue, we have the position in the solution (line number), which allows us to extract the student's code lines and compare them with the template. After running this matching function, 5 we get the final table with the list of raw template issues. Finally, it is possible to run a special converter to present the results in a user-friendly format. (see the postprocessing section)

Usage

Execute one of the following commands with necessary arguments:

poetry run search_by_freq [arguments]

or

docker run hyperstyle-analysis-prod:<VERSION> poetry run search_by_freq [arguments]

Required arguments:

submissions_path — Path to .csv file with submissions. The file must contain the following columns: id, lang, step_id, code, group, attempt, hyperskill_issues/qodana_issues (please, use preprocess_submissions.py script to get group and attempt columns).
steps_path — Path to .csv file with steps. The file must contain the following columns: id, and code_template OR code_templates.
issues_column — Column name in .csv file with submissions where issues are stored (can be hyperstyle_issues ot qodana_issues).

Optional arguments:

Argument	Description
‑ic, ‑‑ignore-trailing-comments	Ignore trailing (in the end of line) comments while comparing two code lines.
‑iw, ‑‑ignore-trailing-whitespaces	Ignore trailing whitespaces while comparing two code lines.
‑equal	Function for lines comparing. Possible functions: `edit_distance`, `edit_ratio`, `substring`. The default value is `edit_distance`.
‑output-path	Path .csv file with repetitive issues search result. If no value was passed, the output will be printed into the console.

Output format

Example for output csv with repetitive issues:

step_id - id of step where repetitive issue found
name - issue name
description - the message about this issue which student see
pos_in_template - position of issue in template code (can be null if not detected)
line - example of line of code where repetitive issue is detected
frequency - % of submission series with such repetitive issue
count - number of submission series with such repetitive issue
total_count - number of all submission series for such step
groups - ids of submission series (groups) with such repetitive issue

step_id	name	description	pos_in_template	line	frequency	count	total_count	groups
2262	WhitespaceAfterCheck	"if' is not followed by whitespace ..."		"\tif(x < y) {"	0.09967585	123	1234	"[30, 33, 36]"
5203	IndentationCheck	"for' has incorrect indentation ..."	10	"\t\t\t\tfor(int k = i; k > 0; k--){"	0.80433251	4567	5678	"[130, 133, 136]"

Postprocessing

This script allows processing the results of the repetitive issues search algorithm to convert into more user-friendly format:

First, we apply basic issues filtering by frequency, because if a repetitive issue is found in a small percentage of submissions series, it is most likely not an issue in the template and the students just don't want (or know how to) to fix it. We have chosen 10% as such a threshold. This threshold was chosen empirically.

If within the same task there are several same code quality issues, with different frequencies, but with the None position in the template, keep the most frequent of them.

The results of the algorithm are divided into several tables:

template errors (Template type, frequency of at least 51%)
common typical errors (Typical type, frequency from 25% to 51%)
rare typical errors (Typical type, frequency 10% to 25%)

Also, additional supporting information can be received:

random student solutions containing a given issue in a given task
all submissions group ids which given repetitive issue
line of code with given issues
issue description

Usage

Execute one of the following commands with necessary arguments:

poetry run postprocess_by_freq [arguments]

or

docker run hyperstyle-analysis-prod:<VERSION> poetry run postprocess_by_freq [arguments]

Required arguments:

repetitive_issues_path — Path to resulting .csv file with repetitive issues.
submissions_path — Path to .csv file with submissions. The file must contain the following columns: id, lang, step_id, code, group, attempt, hyperskill_issues/qodana_issues (please, use preprocess_submissions.py script to get group and attempt columns).
issues_column — Column name in .csv file with submissions where issues are stored (can be hyperstyle_issues ot qodana_issues).

Optional arguments:

Argument	Description
‑fr, ‑‑freq-to-remove	The threshold of frequency to remove issues in the final table. The default value is 10.
‑fs, ‑‑freq-to-separate-rare-and-common-issues	The threshold of frequency to separate typical issues into rare and common. The default value is 25.
‑ft, ‑‑freq-to-separate-template-issues	The threshold of frequency to keep issues in the final table. The default value is 51.
‑‑with-additional-info	Generate samples with repetitive issues and add urls to steps. By default this value is disabled.
‑n, ‑‑solutions-number	Tne number of random students solutions that should be gathered for each task. The default value is 5.
‑url, ‑‑base-task-url	Base url to the tasks on an education platform. The default value is https://hyperskill.org/learn/step.
‑‑output-path	Path to resulting folder with processed issues. If no value was passed, the output will be printed into the console.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Description

Search template issues based on frequency

Usage

Output format

Postprocessing

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Description

Search template issues based on frequency

Usage

Output format

Postprocessing

Usage