Skip to content

Commit

Permalink
Add last changes from feedback
Browse files Browse the repository at this point in the history
  • Loading branch information
kequach committed Oct 10, 2022
1 parent ee2d07a commit c6b57d8
Show file tree
Hide file tree
Showing 8 changed files with 28 additions and 97 deletions.
Binary file modified Master_thesis.pdf
Binary file not shown.
3 changes: 1 addition & 2 deletions tex/chapters/2_literature.tex
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ \section{Technical background knowledge}
This is usually done by \textit{forking} a repository, which essentially means cloning. Afterwards, the contributor \textit{commits} and \textit{pushes} changes on the forked repository and \textit{merges} the changes to the original one via pull requests. It also offers the opportunity to follow people to get notified about their activity.
GitHub also offers many other features for free, like \textit{GitHub Actions}, which allows for integrated \acrfull{cicd} pipelines, and automatic detection of citation and license files, leading to machine-readable metadata. There is also a \textit{REST API} \cite{github_rest} which allows us to extract information about users and repositories with all accompanying metadata that is available. Such REST APIs use the HTTP protocol to receive and modify data. To make things easier, we use ghapi \cite{noauthor_ghapi_nodate} in the \acrshort{swordsuu} framework, a Python wrapper for the GitHub REST API.

% explain difference metric / variable / FAIR variable

In the remainder of the thesis, we will use the terms \textit{metric} and \textit{(\acrshort{fair}) variable}. A variable is any kind of captured data like the owner of the repository or the repository id. A \acrshort{fair} variable is related to concepts of FAIRness, openness, and sustainability. Examples of \acrshort{fair} variables are the availability of a license or citation information. A \textit{metric} is any kind of numeric variable like the number of issues, forks or stargazers. A metric is always a variable and can also be a \acrshort{fair} variable, but a (\acrshort{fair}) variable does not always have to be a metric.

The \acrfull{fair4rs}-subgroup3 reviewed definitions of research software and provided their own definition, which will be used for the remainder of the study \cite{gruenpeter_defining_2021}:
Expand Down Expand Up @@ -63,7 +63,6 @@ \section{FAIR principles for Research Software}
\item Definition of research software to provide context \cite{gruenpeter_defining_2021}. This definition is used throughout the study.
\item Review of new research related to \acrshort{fair} Software since the release of the paper “Towards \acrshort{fair} principles for research software”, as well as the paper itself \cite{lamprecht_towards_2020, wg_fair4rs_2021, chue_hong_what_2021}.
\end{enumerate}
% The results of the subgroups’ work were discussed by Katz et al. \cite{katz_fair4rs_2021-1}, which included a community consultation. A summary of the findings follows in \autoref{sec:lit:fair4rs}.
Three additional subgroups were launched afterwards in September 2021 \cite{chue_hong_fair4rs_2022} to review adoption guidelines for \acrshort{fair4rs} principles \cite{martinez_survey_2022}, (early) adoption support \cite{martinez-ortiz_fair4rs_2022}, and governance \cite{honeyman_subgroup_nodate}.
There are also related initiatives whose main focus does not lie on \acrshort{fair} research software. However, since \acrshort{fair} research software is related to \acrshort{fair} data, they also contributed to this subject:
\begin{itemize}
Expand Down
7 changes: 0 additions & 7 deletions tex/chapters/3_researchmethod.tex
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,6 @@ \subsection{User collection}
\label{fig:phase1-results}}
\end{figure}

% TODO Note: It would be a possible improvement to parse the mail in addition to name as most mails are based on either a firstname.lastname or lastname.firstname schema

\newpage
\subsection{Repository collection}

Expand Down Expand Up @@ -172,8 +170,6 @@ \subsection{Variable collection}
}


\vspace{-.25cm}
% \newpage
\section{Data analysis}
\label{sec:dataexplore}
To validate subquestion 2, we looked at the Jaccard similarity coefficient \cite{kosub_note_2016} of the howfairis and new \acrshort{fair} variables derived from literature and their percentages for research and non-research software.
Expand All @@ -184,8 +180,6 @@ \section{Data analysis}
The Kruskal-Wallis test represents the non-parametric alternative for one-way \acrfull{anova}.
An \acrshort{anova} tests equality of means, while a Kruskal-Wallis test compares mean ranks. It is a more general test and less powerful. However, \acrshort{anova} assumes normal-distribution, which our variables do not fulfill. The Dunn’s test compares the mean of each group pairwise and calculates which groups are significantly different.
An alternative approach would be to use a non-parametric multivariate analysis of variance, of which multiple methods exist \cite{anderson_new_2001, katz_multivariate_1980}. There are different justifications for choosing either approach, and they address different research questions \cite{huberty_multivariate_1989}. Some drawbacks are that multiple univariate tests ignore the increased precision of pooled variance estimates, decreasing inference reliability, and the estimation of the correlated error structure, which a multivariate model takes into account \cite{alexis_answer_2015}. However, as it is of interest to us in which metrics the differences exist, a multiple univariate approach is more applicable. Huberty and Morris \cite{huberty_multivariate_1989} also mention that multiple univariate analysis is applicable for exploratory research.
% Ordering of variables would require taking intercorrelation into account
% degrees of freedom

Subquestion 5 was answered with the help of two machine learning model classifications: \textit{logistic lasso regression} and \textit{random forest}. These were trained and tested on 80\% and 20\% of the data, respectively. The models included all metrics and \acrshort{fair} variables. Metrics were scaled to values between zero and one. This allowed us to draw comparisons between variables from the logistic regression coefficients, as all variables are now on the same scale. We used the Python package scikit-learn \cite{scikit-learn} for this part of the analysis.
For describing the models, we will use the term \textit{features} for independent variables and \textit{class} for the categorization of whether a repository is considered research software or not.
Expand All @@ -211,7 +205,6 @@ \section{Data analysis}
\newpage
\section{Validity}
\label{sec:valid}
% Validity: Threats are search process and selection criteria, focus only on gitHub; Dataset size
There were several threats concerning the study.
Contributions to research software that \acrshort{uu}-affiliated users do not own are not captured. This means we could not capture if a researcher would contribute to existing open research software.
The user search process may be flawed, resulting in a biased capture of \acrshort{uu} employees.
Expand Down
4 changes: 0 additions & 4 deletions tex/chapters/4_results.tex
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,6 @@ \section{Repository characteristics}
\vspace{-0.3cm}

Descriptive statistics of all metrics can be seen in \autoref{tab:all_faculties}. The skewness and kurtosis show us that all metrics are heavily skewed and long-tailed, albeit to different degrees. The statistics per faculty can be seen in \autoref{app:stats_faculty}. These show us that the median for all metrics except life span is relatively similar across all faculties.
% TODO: write more?

\input{tables/repo_stats_all_faculties}

Expand All @@ -200,7 +199,6 @@ \section{\acrshort{fair} variables}
\autoref{fig:heatmap_fair_booleans} shows the heatmaps for the average percentage of each boolean \acrshort{fair} variable per faculty. The upper plot shows the averages for research software, while the lower one shows averages for non-research software. We can immediately see from this that every \acrshort{fair} variable has a higher percentage for research software than for non-research software. Large absolute increases can be seen in \textit{has license, correct vcs usage, has install instructions, has tests, and version identifiable}.
Social Sciences has the highest percentage of licenses across both classes for research software, while Geosciences has the lowest percentage. Social Sciences also have a high percentage of repositories that are registered, have citation enabled, or contain a checklist, compared to other faculties. Geosciences have the lowest average percentage of correct vcs usage, meaning that more than 20\% of the research software repositories had only commits within a single day. Social Sciences and Support departments have a high percentage of active repositories, while other faculties are generally less active. Support departments, by far, have the highest percentage of install instructions and usage examples. For contribution guidelines, Social Sciences and Support departments again are around the same high percentage, while other faculties have only a minuscule amount. Geosciences and Humanities, the two faculties with the lowest percentage, are also the ones with the least amount of contributors. Interestingly, the Science faculty has the least amount of tests and identifiable versions on average.

% TODO: --> followup for discussion, Geo might need more support for actual use of GitHub and its features (licensing, version control usage)
\begin{figure}[h!]
\centerline{
\includegraphics[scale=0.53]{figures_results/heatmap_fair_booleans.png}}
Expand Down Expand Up @@ -247,7 +245,6 @@ \section{Research software classification}
\autoref{fig:stats_confusion_matrices} shows the confusion matrices for both model predictions and chance predictions for the research software classification. We can see from this that random forest performed better than logistic regression since there are fewer false negatives and false positives.

\begin{figure}[h!]
% \vspace*{-1cm}
\centerline{
\includegraphics[scale=0.5]{figures_results/stats_confusion_matrices.png}}
\caption{Confusion matrices for logistic regression, random forest, and chance.
Expand All @@ -259,7 +256,6 @@ \section{Research software classification}
Based on the confusion matrix, further performance measures are calculated in \autoref{fig:stats_barplot_scores}, which shows the performance measures for both model and chance predictions. Both models outperform the chance classification in accuracy and precision. Random forest achieves a higher score in all performance measures than logistic regression and chance except for recall, which is expected. A random forest can better utilize complex variables than a logistic regression, where an increase in a variable value always increases or decreases the probability of a positive classification. This shows that the used variables improve classification compared to a chance classifier, that we can improve research software identification accuracy by 16 percentage points, and that a random forest is more suitable than logistic regression for this classification task.

\begin{figure}[h!]
% \vspace*{-1cm}
\centerline{
\includegraphics[scale=0.5]{figures_results/stats_barplot_scores.png}}
\caption{Performance measures of logistic regression, random forest, and chance.
Expand Down
Loading

0 comments on commit c6b57d8

Please sign in to comment.