Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstract Config file parsing in R/config.R #230

Open
carlosparadis opened this issue Sep 20, 2023 · 48 comments
Open

Abstract Config file parsing in R/config.R #230

carlosparadis opened this issue Sep 20, 2023 · 48 comments
Assignees
Labels
type:refactoring Code changes that improve maintenability, performance, etc

Comments

@carlosparadis
Copy link
Member

Currently, modifying the /conf file schema requires checking every Notebook in Kaiaulu to make sure it still parses the project's yaml file correctly. The logic behind understanding the config file should be abstracted in a set of get methods, to de-couple all the Notebooks from it.

The coupling doesn't occur if it is adding fields, but any substantial re-organization will cause a mass co-change cascade. This should also help document the config file as R functions, and in the future write some utilities to analyze config files (e.g. a paper's project demographics could go here, since it would borrow info from the config to compute the latex table).

@carlosparadis carlosparadis added the type:refactoring Code changes that improve maintenability, performance, etc label Sep 20, 2023
@carlosparadis carlosparadis self-assigned this Sep 20, 2023
@anthonyjlau
Copy link
Collaborator

anthonyjlau commented Mar 15, 2024

1. Purpose

The purpose of this issue is to create get functions for configuration files. The motivation for this is that in each notebook in the vignettes folder, it loads in the config files by using a variable assignment (i.e. mbox_path <- conf[["mailing_list"]][["mbox"]]). If we were to change the structure of the config file (i.e. rename "mailing_list" to something else or move the variable elsewhere), then we would have to go though each notebook that uses it and change it so that it goes to the right value. To solve this, I am creating the file R/config.R to centralize the process of getting values from config files.

2. Process

This will be done by creating get methods for all of the config files that will be used in the notebooks. Then, I replace the direct variable assignment with the get methods in R/config.R.

3. Task List

  • Find out which config files I need to make get methods for.
  • Create R/config.R and make the get methods for each config file.
  • Replace the variable assignments in each notebook with my getter methods.

Updated API Nomenclature

All get functions now take in a parameter called project_name. The variable should be a string that contains the name of the config file WITHOUT the .yml after it.

Ex: get_project_code_language("camel"): This will return the code language field in the camel.yml config file.

@carlosparadis
Copy link
Member Author

@anthonyjlau Looks great!

@carlosparadis carlosparadis changed the title Abstract Config file parsing in R/config.R Abstract Config file parsing in R/config.R (Milestone 2) Mar 18, 2024
@carlosparadis
Copy link
Member Author

@anthonyjlau One more thing: Make sure you post here the function API nomenclature (i.e. update the first issue following the template). It would be a lot of time waste on your end if you went with a function nomenclature I would like to modify, as you would need to modify every notebook to fix that thereafter.

anthonyjlau added a commit that referenced this issue Mar 23, 2024
anthonyjlau added a commit that referenced this issue Mar 26, 2024
Added get  functions for mbox, jira,        and github.
anthonyjlau added a commit that referenced this issue Mar 28, 2024
Functions that require something from a list (ie. conf[["version_control"]][["branch"]][4]) now have an index parameter. (ie. conf[["version_control"]][["branch"]][branch_index]) Also finished the rest of the get methods for all notebooks.
anthonyjlau added a commit that referenced this issue Mar 28, 2024
I normalized the comments by having it follow a format. I added if wrappers to catch null values.
anthonyjlau added a commit that referenced this issue Apr 3, 2024
I changed the documentation of the JIRA functions to return a list of project keys and storage paths (HADOOP case).

I also changed the documentation of the mbox function to return a list of storage paths.
@anthonyjlau
Copy link
Collaborator

@Ssunoo2 @ian-lastname Ensure that the configs that are used on this project are in the final version and work with config.R. This includes the projects for the mailing lists (@ian-lastname), the ones used for Jira and Github (@Ssunoo2) and the ones for Bugzilla (@anthonyjlau).

anthonyjlau added a commit that referenced this issue Apr 29, 2024
@beydlern beydlern self-assigned this Sep 8, 2024
@crepesAlot crepesAlot self-assigned this Sep 12, 2024
@crepesAlot
Copy link
Collaborator

crepesAlot commented Sep 12, 2024

1. Purpose


Currently, the Notebooks (.Rmd files in the /vignettes folder) have hard coded file paths connected to the .yml files in the /conf folder.
This means that by refactoring the folder or files within it, a problem of needing to manually go in each and every Notebook and update any of the changed values.
To decouple the code, it needs to be configured such that the R/config.R file contains a set of get() functions to replace the hardcoded specific paths in the config files.

2. Process


The process will be to follow the previous solution, creating get() functions within the R/config.R file and then replace the direct variable assignments within the Notebooks with the R/config.R get() functions. I (@crepesAlot) will be primarily focusing on documentation work, which includes updating the notebooks and ensuring that the formatting for the conf/ files match that of the ones from issue #286.

3. Task List


  • Understand the structure of the conf/ files
  • Ensure all files within the conf/ folder follow the new specifications, following in line with the refresher cheatsheet and solutions to issue Downloaders data storage organization #286

conf/ specifications

An example of a line for the issue tracker for github would be:
issue: ../../rawdata/project_name/github/issue/owner_repo/
An for jira:
issue: ../../rawdata/project_name/jira/issue/project_key/

The projectName being listed before github would allow the user to save, find, or send all of the data of the project easier, rather than having to search in each data source for files related to the project.
i.e. Having to search the github folder for data related to Kaiaulu, then needing to check if the jira folder if it also has the data. Instead you just access the Kaiaulu project folder and can check the data.

The concern with Anomaly Case 1 was that the refresher function may break if it assumes that all files within a folder belong together (ex: if it were to assume that all issues in the issue folder belong to the same issue_id)
The concern with Anomaly Case 2 was how the downloader might react if issues were migrated from jira to github.

Both Anomaly Case 1 & 2 would be addressed by adding an extra depth to the file hierarchy to specify each project key.

For the conf/kaiaulu.yml file, this would look like:

# mailing_list:
#   mod_mbox: 
#     mail_key_1:
#        archive_url: http://mail-archives.apache.org/mod_mbox/geronimo-dev
#        mbox: ../../rawdata/geronimo/mod_mbox/geronimo-dev/
#        mailing_list: geronimo-dev
#        archive_type: apache
#     mail_key_2:
#        archive_url: http://mail-archives.apache.org/mod_mbox/geronimo-user
#        mbox: ../../rawdata/geronimo/mod_mbox/geronimo-user/
#        mailing_list: geronimo-user
#        archive_type: apache
#   pipermail:
#     mail_key_1:
#        archive_url: http://some/pipermail/url
#        mbox: ../../rawdata/geronimo/pipermail/geronimo-dev/

issue_tracker:
  jira:
    project_key_1:
      # Obtained from the project's JIRA URL
      domain: https://sailuh.atlassian.net
      project_key: SAILUH
      # Download using `download_jira_data.Rmd`
      issues: ../../rawdata/kaiaulu/jira/issues/sailuh/
      issue_comments: ../../rawdata/kaiaulu/jira/issue_comments/sailuh/
  github:
    project_key_1:
      # Obtained from the project's GitHub URL
      owner: sailuh
      repo: kaiaulu
      # Download using `download_github_comments.Rmd`
      issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu/
      issue: ../../rawdata/kaiaulu/github/issue/sailuh_kaiaulu/
      issue_search: ../..rawdata/kaiaulu/github/issue_search/sailuh_kaiaulu/
      pull_request: ../../kaiaulu/github/pull_request/sailuh_kaiaulu/
      commit: ../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/
    # project_key_2:
    #   # Obtained from the project's GitHub URL
    #   owner: ssunoo2
    #   repo: kaiaulu
    #   # Download using `download_github_comments.Rmd`
    #   issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/ssunoo2_kaiaulu/
    #   issue: ../../rawdata/kaiaulu/github/issue/ssunoo2_kaiaulu/
    #   refresh_issues: ../..rawdata/kaiaulu/github/refresh_issues/ssunoo2_kaiaulu
    #   pull_request: ../../kaiaulu/github/pull_request/ssunoo2_kaiaulu/
    #   commit: ../../rawdata/kaiaulu/github/commit/ssunoo2_kaiaulu/
  # bugzilla:
  #   project_key: kaiaulu

The unused portions are commented out while still maintaining the formatting.

@beydlern
Copy link
Collaborator

beydlern commented Sep 12, 2024

1. Purpose

The Notebook files (.Rmd files in the /vignettes folder) have hard-coded file paths connected to the config files in the /conf folder. For example, notebooks load in config files using variable assignment, such as git_repo_path <- conf[["version_control"]][["log"]]. Restructuring, extending, or editing the config files in /conf may sever the hard-coded connections in the Notebook files. In the example, if the name in the config version_control were changed to something else, then we would have to manually go into each notebook and edit the name to match the correct one to keep this variable assignment working properly. To decouple these hard-coded connections, a file R/config.R will contain a set of get() functions that will serve as a replacement for the hard-coded paths in the Notebooks. For example, git_repo_path <- config_file[["version_control"]][["log"]] will be changed to a call to a get function such as get_git_repo_path(config_file) where config_file is the name of the parsed config file. Thus, if all of the conf/ files require updating, then only the implementation of the get() functions in R/config.R need to be changed with respect to the changes made in the config files, and the notebooks do not need to be changed, because the get() function definition and return are the same as before. Implementation of these get() functions will employ the principle of segregation, information hiding, by protecting other parts of the program from substantial modification if the design decisions are changed (conf/ files are changed).

2. Process

The process will be to follow the previous solution, creating get() functions within the R/config.R file and then replace the direct variable assignments within the Notebooks with the R/config.R get() functions. I will primarily focus on implementing the get() functions in R/config.R and updating the notebooks (.Rmd files) in vignettes/ with the get() functions from R/config.R.

3. Task List

Step 1

  • Confirm the required get() methods

Step 2

  • Add any missing get() methods

API Nomenclature

The API nomenclature is as follows: most get() functions will take in a parameter config_file where this variable represents the parsed specified config file (config_file is obtained through a call to the get_conf(config_path) get function where config_path is the specified configuration file path such as conf/geronimo.yml). In addition, some get() functions will require an additional parameter project_key_index where this variable represents the index of the project key to discern from the different projects if the user were to run multi-project analysis. The function names will follow snake case, where every word is lowercase and they are separated by an underscore with the first word always get followed by a succinct descriptive few words that describe the get() function at a glance.

Ex: get_jira_issues_path(config_file, project_key_1): this get function will return the path to the jira issues in the geronimo.yml config file.

Step 3

  • Update notebooks (.Rmd files) in vignettes/ with new get() functions in R/config.R file

@carlosparadis
Copy link
Member Author

@beydlern

I now realize the other half of the issue lies on #286, which is why this task requires two of you to work on it, sorry for the confusion. You need to read through that to get a better idea of why this task requires two of you.

The purpose statement in #230 (comment) is a bit fuzzy. For example:

This means that by refactoring the folder or files within it, a lot of problems could occur unless you were to manually go in each Notebook and update any of the changed values.

What is the problem that occurs by hardcoding paths instead of using get() from a config.R?

R/config.R file will be able to automatically update any Notebooks should any changes occur.

I am not sure what you mean by config.R automatically updating the Notebooks. How would this occur?

Consider the purpose for Anthony's task to refine your specification on purpose. As for the get methods you need, which Anthony implemented a couple,


To put in a nutshell: The other issue mentions that the way Kaiaulu specify project configuration files should be extended, to account for more edge cases. To extend the specification of configs, that means all configs on conf/ should be updated. This is purely a documentation task, no code involved.

What you will notice, however, is that the moment you update these config files in conf/ another problem cascades: Because the paths in every notebook are hardcoded to specific paths in the config files, now every notebook has to be updated. If we ever need to extend the specification of the config again (which will happen to include other tools parameters), then once more, every notebook has to be updated. This is not a good organization.

To address this "config specification change cascades into every notebook being updated", we want to instead call the get() functions of conf.R at the start of the notebooks, which is the goal of this issue, instead of hardcode the content of the config files. This means if all conf/ files have to be updated again, then only the implementation of the get() functions in conf.R have to be updated for the portions of the config files that are changed (so a subset of the functions), the notebooks do not need to do so, since the get functions implementation change, but the function definition and return are the same.

The lesson here is Information Hiding. See: https://en.wikipedia.org/wiki/Information_hiding

With that understanding, I would like you both to have a task specification edited above, with what each of you will be working on. Your purpose statement is the same, but you should make it more clear who wil work on conf.R and who will update the notebooks (or you may split this in half and half, it is your call). Note @daomcgill and @RavenMarQ will need to understand your work, as they will have a notebook for the mail downloader, and one for scitools understand that depend on the new config format and your get functions.

p.s.: I would be mindful of pinging others on this issue who are not working on the task, as that may trigger e-mail notifications for them!

@beydlern
Copy link
Collaborator

@carlosparadis May you review the updated specifications?

@carlosparadis
Copy link
Member Author

@beydlern sorry, edits do not send notifications, did you just update or did I miss this for a few days?

@beydlern
Copy link
Collaborator

I edited my specifications today, would you like me to revert the edit and post it as new one?

@carlosparadis
Copy link
Member Author

No, it's alright. We can actually see the edits if we click the "edited" word in the message. Keep them in one comment. I will get back to you on it then, wanted to make sure I did not overlook!

@crepesAlot
Copy link
Collaborator

crepesAlot commented Oct 9, 2024

Unfortunately I attempted that, but it seems there was a limit to how low I'm allowed to expand the google images, and am unable to add anything below the tools section. I was hesitant to start shrinking everything down to make it fit, but if you'd prefer that, I can do so.
@beydlern is waiting on my commit to create and update get functions for bugzilla and scitools understanding.
I wanted to make sure that the current folder organization is good and do the scitools understanding config section before making the commit.

@beydlern
Copy link
Collaborator

beydlern commented Oct 9, 2024

@carlosparadis The GitHub PR #289 notebooks were updated to use the getter functions in config.R. These notebooks are currently using the new specification. Currently waiting for @crepesAlot to finish the config specification for Bugzilla so I may ensure that the Bugzilla notebook in the Bugzilla PR #293 follows the correct new specification.

@carlosparadis
Copy link
Member Author

@beydlern thank you for the update. Somewhere in the issue sea of comments I recall answering a question about the difference of issue search and just issue, was that clear or still confusing? Maybe it was @crepesAlot question.

crepesAlot added a commit that referenced this issue Oct 9, 2024
- config.R functions are organized to follow the same formatting as the conf files
- All conf files issue_trackers have been updated.
beydlern added a commit that referenced this issue Oct 9, 2024
- get_mbox_mailing_list() and get_mbox_archive_type() were removed as the keys that they are grabbing are scheduled for removal.
beydlern added a commit that referenced this issue Oct 9, 2024
- added two new bugzilla getter functions get_bugzilla_issue_comment_path and get_bugzilla_issue_path
beydlern added a commit that referenced this issue Oct 10, 2024
- Implemented two pipermail mailing list getters get_pipermail_domain and get_pipermail_path
beydlern added a commit that referenced this issue Oct 10, 2024
- Implemented two more getter functions: get_mbox_input_file() and get_pipermail_input_file()
beydlern added a commit that referenced this issue Oct 10, 2024
- Refactored the download_mail.Rmd notebook to expect the use of the getters from R/config.R (i #230 contains the getter functions in R/config.R).
crepesAlot added a commit that referenced this issue Oct 10, 2024
- Added Scitools understand section to tools
- Updated mailing_list
beydlern added a commit that referenced this issue Oct 10, 2024
- Implemented 4 Getter functions: get_understand_code_language(), get_understand_keep_dependencies_type(), get_understand_project_path(), and get_understand_output_path().
- Changed get_code_language() and get_keep_dependencies_type() to get_depends_code_language() and get_depends_keep_dependencies_type(), respectively as well as updated the appropriate notebooks that used these getters.
beydlern added a commit that referenced this issue Oct 10, 2024
- Refactored the understand_showcase.Rmd notebook to expect the use of the getters from R/config.R (i #230 contains the getter functions in R/config.R).
crepesAlot added a commit that referenced this issue Oct 11, 2024
- Updated NEWS.md
- Minor change to mailing_list
@carlosparadis
Copy link
Member Author

@crepesAlot the updated #230 (comment) is not here yet, did you create it?

@crepesAlot
Copy link
Collaborator

@carlosparadis Here is a screenshot of the powerpoint version of the folder organization. The google drawings version has also been updated.
Kaiaulu Folder Organization
Here is a link to the powerpoint slide:
https://1drv.ms/p/c/abff49bde83c4541/EVMVEyD7CxVPpX7DcvCKv7MB6yoQKqB3UINnOuXcBWfqrg?e=nb9QKG

beydlern added a commit that referenced this issue Oct 18, 2024
- The project configuration sections of each notebook was incorrectly using the project directory (kaiaulu/) as their working directory rather than the directory that they reside in (/vignettes/) as their working directory.
- Uncommented the output_filepath line for pattern4 tool in junit5.yml configuration file.
- Uncommented the mod_mbox mailing list section for mailing_list in openssl.yml configuration file.
@carlosparadis carlosparadis changed the title Abstract Config file parsing in R/config.R (Milestone 2) Abstract Config file parsing in R/config.R Nov 11, 2024
carlosparadis added a commit that referenced this issue Nov 12, 2024
Originally, it was assumed that .git could reside
folder-less to be accessed. While this is true, it
adds inconvenience to the user since when data is obtained
with .git, it is gererally through git clone, which
also adds the source code. The consequential effect is that
the path will be project/git_repo/project/.git rather than
project/git_repo/.git.

Not all configs were updated to adjust for this, only the
ones used in the notebooks.

Signed-off-by: Carlos Paradis <[email protected]>
carlosparadis pushed a commit that referenced this issue Nov 12, 2024
This commits perform a major refactoring of how Kaiaulu interface with config files, and the suggested folder organization to store rawdata and analysis. 

The configuration files are generalized to account for anomaly cases when performing project analysis. For instance, long-lived projects may contain multiple repositories, issue trackers, mailing list, etc. The new template of the configuration file allows to account for this information. 

Moreover, changes to the config template cascaded in changes to all notebooks, as the access to the config was hardcoded to the file organization. A new set of get_ functions should make this the last commit that change in template cascades into notebooks. All actively maintained notebooks  (not prefixed by underline under vignettes/) have been updated to use the get functions. Future changes, therefore, will only affect the get() functions in R/config.R.

The folder organization of the filepaths has also been modified. Previously, filepaths assumed as default in the versioned config files suggested organizing code as rawdata/git_repo/projectX ; rawdata/jira/projectY. This organization was not practical for sharing data manually, as the user would need to zip several folders individually. The new organization is now rawdata/projectX/git_repo ; rawdata/projectX/jira. This means users only need to zip projectX and that will contain all the data wanted to be shared.

A minor typo on graph.R was also fixed for merge function calls from `sorted=` to `sort=`.
carlosparadis added a commit that referenced this issue Nov 12, 2024
This commits perform a major refactoring of how Kaiaulu interface with config files, and the suggested folder organization to store rawdata and analysis.

The configuration files are generalized to account for anomaly cases when performing project analysis. For instance, long-lived projects may contain multiple repositories, issue trackers, mailing list, etc. The new template of the configuration file allows to account for this information.

Moreover, changes to the config template cascaded in changes to all notebooks, as the access to the config was hardcoded to the file organization. A new set of get_ functions should make this the last commit that change in template cascades into notebooks. All actively maintained notebooks  (not prefixed by underline under vignettes/) have been updated to use the get functions. Future changes, therefore, will only affect the get() functions in R/config.R.

The folder organization of the filepaths has also been modified. Previously, filepaths assumed as default in the versioned config files suggested organizing code as rawdata/git_repo/projectX ; rawdata/jira/projectY. This organization was not practical for sharing data manually, as the user would need to zip several folders individually. The new organization is now rawdata/projectX/git_repo ; rawdata/projectX/jira. This means users only need to zip projectX and that will contain all the data wanted to be shared.

A minor typo on graph.R was also fixed for merge function calls from `sorted=` to `sort=`.

Signed-off-by: Carlos Paradis <[email protected]>
Co-authored-by: Carlos Paradis <[email protected]>
Co-authored-by: Anthony Lau <[email protected]>
Co-authored-by: Nicholas Beydler <[email protected]>
Co-authored-by: Mark Burgess <[email protected]>
@carlosparadis
Copy link
Member Author

@crepesAlot I had to make a few modifications on the actual config. Could you take a look on Kaiaulu and Helix config on my last two commits to see the differences? Here's two examples. It has mostly to do with the inner folders. Also the current use of /.git towards the end without it being in a folder with git log. While editing I realized that would not work well, because when a user downloads .git data, they generally use git clone. That means the source code comes along with the .git, which in turn means the folder of the source code always sit above one level of .git (I do not have a screenshot for this here, if not clear you can ask me on call).

In addition to that you can see I reverted the deepest folder:

Image

Image

We may need to do some edits on the diagram to reflect this. Let me know what you think. @beydlern pinging you too on this since you worked on this on M1 to make sure we are on the same page. This may also affect @daomcgill PR. Will take a look on it next.

This may need another pass on the configs to fix the consistency (I only had time to fix the ones Kaiaulu needs on the notebooks that run).

crepesAlot added a commit that referenced this issue Nov 13, 2024
- Moved .git file into a subfolder within the git_repo folder, named after the project.
- Made changes for class_folder_path in junit5.yml consistent with all configs.
- Made changes for issue_tracker folder paths to be consistent across all configs.
@crepesAlot
Copy link
Collaborator

@carlosparadis I pushed a commit where I matched the changes across all configs to make them consistent.
I also updated this comment with the diagram, if you could take a look at it before I try updating the powerpoint version too.
I do have one question about one of the changes you made.
In the junit5.yml configuration file, you updated a line in the pattern4 tool section to:

class_folder_path: ../../rawdata/junit5/git_repo/junit5/junit-platform-engine/build/classes/java/main/org/junit/platform/engine/

The part I am particularly confused on is the: /junit5/git_repo/junit5/ section, I get why it is placed in the git_repo folder, but is there a specific reason as to why it repeats the junit5 a second time? Can I change that to /junit5/git_repo/pattern4/ to match the diagram? Should I change the diagram to match your version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:refactoring Code changes that improve maintenability, performance, etc
Projects
None yet
Development

No branches or pull requests

4 participants