This module contains an algorithm for finding code quality issues in pre-written code templates using frequency analysis. Motivation, why this algorithm is needed, can be found here.
The algorithm has been tested on the JetBrains Academy platform. As a tool for evaluating code quality, Hyperstyle was used, which is currently used on the platform by default. More information about the platform and the use of Hyperstyle can be found here.
Firstly we need to indicate issues which remains in all submissions of concrete user for concrete step (which student do not during all attempts to pass step).
The algorithm pipeline is the following:
The goal of our approach is to automatically detect code quality issues in pre-written task templates to help task creators detect and fix them. Firstly, we run Hyperstyle on solutions for problems containing a pre-written template (1 in figure with pipeline), and then 2 only keep the issues that are present in all of the successful attempts. This way, we focus on issues that the students did not fix and that could thus not be introduced by them. Next, 3 we take the original template and 4 use a matching function to match code quality issues from the student solutions to it. For each issue, we have the position in the solution (line number), which allows us to extract the student's code lines and compare them with the template. After running this matching function, 5 we get the final table with the list of raw template issues. Finally, it is possible to run a special converter to present the results in a user-friendly format. (see the postprocessing section)
Execute one of the following commands with necessary arguments:
poetry run search_by_freq [arguments]
or
docker run hyperstyle-analysis-prod:<VERSION> poetry run search_by_freq [arguments]
Required arguments:
submissions_path
— Path to .csv file with submissions. The file must contain the following columns:id
,lang
,step_id
,code
,group
,attempt
,hyperskill_issues
/qodana_issues
(please, use preprocess_submissions.py script to getgroup
andattempt
columns).steps_path
— Path to .csv file with steps. The file must contain the following columns:id
, andcode_template
ORcode_templates
.issues_column
— Column name in .csv file with submissions where issues are stored (can behyperstyle_issues
otqodana_issues
).
Optional arguments:
Argument | Description |
---|---|
‑ic, ‑‑ignore-trailing-comments | Ignore trailing (in the end of line) comments while comparing two code lines. |
‑iw, ‑‑ignore-trailing-whitespaces | Ignore trailing whitespaces while comparing two code lines. |
‑equal | Function for lines comparing. Possible functions: edit_distance , edit_ratio , substring . The default value is edit_distance . |
‑output-path | Path .csv file with repetitive issues search result. If no value was passed, the output will be printed into the console. |
Example for output csv with repetitive issues:
step_id
- id of step where repetitive issue foundname
- issue namedescription
- the message about this issue which student seepos_in_template
- position of issue in template code (can be null if not detected)line
- example of line of code where repetitive issue is detectedfrequency
- % of submission series with such repetitive issuecount
- number of submission series with such repetitive issuetotal_count
- number of all submission series for such stepgroups
- ids of submission series (groups) with such repetitive issue
step_id | name | description | pos_in_template | line | frequency | count | total_count | groups |
---|---|---|---|---|---|---|---|---|
2262 | WhitespaceAfterCheck | "if' is not followed by whitespace ..." | "\tif(x < y) {" | 0.09967585 | 123 | 1234 | "[30, 33, 36]" | |
5203 | IndentationCheck | "for' has incorrect indentation ..." | 10 | "\t\t\t\tfor(int k = i; k > 0; k--){" | 0.80433251 | 4567 | 5678 | "[130, 133, 136]" |
This script allows processing the results of the repetitive issues search algorithm to convert into more user-friendly format:
First, we apply basic issues filtering by frequency, because if a repetitive issue is found in a small percentage of submissions series, it is most likely not an issue in the template and the students just don't want (or know how to) to fix it. We have chosen 10% as such a threshold. This threshold was chosen empirically.
If within the same task there are several same code quality issues, with different frequencies, but with the None position in the template, keep the most frequent of them.
The results of the algorithm are divided into several tables:
- template errors (Template type, frequency of at least 51%)
- common typical errors (Typical type, frequency from 25% to 51%)
- rare typical errors (Typical type, frequency 10% to 25%)
Also, additional supporting information can be received:
- random student solutions containing a given issue in a given task
- all submissions group ids which given repetitive issue
- line of code with given issues
- issue description
Execute one of the following commands with necessary arguments:
poetry run postprocess_by_freq [arguments]
or
docker run hyperstyle-analysis-prod:<VERSION> poetry run postprocess_by_freq [arguments]
Required arguments:
repetitive_issues_path
— Path to resulting .csv file with repetitive issues.submissions_path
— Path to .csv file with submissions. The file must contain the following columns:id
,lang
,step_id
,code
,group
,attempt
,hyperskill_issues
/qodana_issues
(please, use preprocess_submissions.py script to getgroup
andattempt
columns).issues_column
— Column name in .csv file with submissions where issues are stored (can behyperstyle_issues
otqodana_issues
).
Optional arguments:
Argument | Description |
---|---|
‑fr, ‑‑freq-to-remove | The threshold of frequency to remove issues in the final table. The default value is 10. |
‑fs, ‑‑freq-to-separate-rare-and-common-issues | The threshold of frequency to separate typical issues into rare and common. The default value is 25. |
‑ft, ‑‑freq-to-separate-template-issues | The threshold of frequency to keep issues in the final table. The default value is 51. |
‑‑with-additional-info | Generate samples with repetitive issues and add urls to steps. By default this value is disabled. |
‑n, ‑‑solutions-number | Tne number of random students solutions that should be gathered for each task. The default value is 5. |
‑url, ‑‑base-task-url | Base url to the tasks on an education platform. The default value is https://hyperskill.org/learn/step. |
‑‑output-path | Path to resulting folder with processed issues. If no value was passed, the output will be printed into the console. |