diff --git a/Master_thesis.pdf b/Master_thesis.pdf index 4dfffce..2116ca2 100644 Binary files a/Master_thesis.pdf and b/Master_thesis.pdf differ diff --git a/tex/chapters/2_literature.tex b/tex/chapters/2_literature.tex index 42c3c68..7ac26ba 100644 --- a/tex/chapters/2_literature.tex +++ b/tex/chapters/2_literature.tex @@ -10,7 +10,7 @@ \section{Technical background knowledge} This is usually done by \textit{forking} a repository, which essentially means cloning. Afterwards, the contributor \textit{commits} and \textit{pushes} changes on the forked repository and \textit{merges} the changes to the original one via pull requests. It also offers the opportunity to follow people to get notified about their activity. GitHub also offers many other features for free, like \textit{GitHub Actions}, which allows for integrated \acrfull{cicd} pipelines, and automatic detection of citation and license files, leading to machine-readable metadata. There is also a \textit{REST API} \cite{github_rest} which allows us to extract information about users and repositories with all accompanying metadata that is available. Such REST APIs use the HTTP protocol to receive and modify data. To make things easier, we use ghapi \cite{noauthor_ghapi_nodate} in the \acrshort{swordsuu} framework, a Python wrapper for the GitHub REST API. -% explain difference metric / variable / FAIR variable + In the remainder of the thesis, we will use the terms \textit{metric} and \textit{(\acrshort{fair}) variable}. A variable is any kind of captured data like the owner of the repository or the repository id. A \acrshort{fair} variable is related to concepts of FAIRness, openness, and sustainability. Examples of \acrshort{fair} variables are the availability of a license or citation information. A \textit{metric} is any kind of numeric variable like the number of issues, forks or stargazers. A metric is always a variable and can also be a \acrshort{fair} variable, but a (\acrshort{fair}) variable does not always have to be a metric. The \acrfull{fair4rs}-subgroup3 reviewed definitions of research software and provided their own definition, which will be used for the remainder of the study \cite{gruenpeter_defining_2021}: @@ -63,7 +63,6 @@ \section{FAIR principles for Research Software} \item Definition of research software to provide context \cite{gruenpeter_defining_2021}. This definition is used throughout the study. \item Review of new research related to \acrshort{fair} Software since the release of the paper “Towards \acrshort{fair} principles for research software”, as well as the paper itself \cite{lamprecht_towards_2020, wg_fair4rs_2021, chue_hong_what_2021}. \end{enumerate} -% The results of the subgroups’ work were discussed by Katz et al. \cite{katz_fair4rs_2021-1}, which included a community consultation. A summary of the findings follows in \autoref{sec:lit:fair4rs}. Three additional subgroups were launched afterwards in September 2021 \cite{chue_hong_fair4rs_2022} to review adoption guidelines for \acrshort{fair4rs} principles \cite{martinez_survey_2022}, (early) adoption support \cite{martinez-ortiz_fair4rs_2022}, and governance \cite{honeyman_subgroup_nodate}. There are also related initiatives whose main focus does not lie on \acrshort{fair} research software. However, since \acrshort{fair} research software is related to \acrshort{fair} data, they also contributed to this subject: \begin{itemize} diff --git a/tex/chapters/3_researchmethod.tex b/tex/chapters/3_researchmethod.tex index 6f386f5..8c1ed7b 100644 --- a/tex/chapters/3_researchmethod.tex +++ b/tex/chapters/3_researchmethod.tex @@ -52,8 +52,6 @@ \subsection{User collection} \label{fig:phase1-results}} \end{figure} -% TODO Note: It would be a possible improvement to parse the mail in addition to name as most mails are based on either a firstname.lastname or lastname.firstname schema - \newpage \subsection{Repository collection} @@ -172,8 +170,6 @@ \subsection{Variable collection} } -\vspace{-.25cm} -% \newpage \section{Data analysis} \label{sec:dataexplore} To validate subquestion 2, we looked at the Jaccard similarity coefficient \cite{kosub_note_2016} of the howfairis and new \acrshort{fair} variables derived from literature and their percentages for research and non-research software. @@ -184,8 +180,6 @@ \section{Data analysis} The Kruskal-Wallis test represents the non-parametric alternative for one-way \acrfull{anova}. An \acrshort{anova} tests equality of means, while a Kruskal-Wallis test compares mean ranks. It is a more general test and less powerful. However, \acrshort{anova} assumes normal-distribution, which our variables do not fulfill. The Dunn’s test compares the mean of each group pairwise and calculates which groups are significantly different. An alternative approach would be to use a non-parametric multivariate analysis of variance, of which multiple methods exist \cite{anderson_new_2001, katz_multivariate_1980}. There are different justifications for choosing either approach, and they address different research questions \cite{huberty_multivariate_1989}. Some drawbacks are that multiple univariate tests ignore the increased precision of pooled variance estimates, decreasing inference reliability, and the estimation of the correlated error structure, which a multivariate model takes into account \cite{alexis_answer_2015}. However, as it is of interest to us in which metrics the differences exist, a multiple univariate approach is more applicable. Huberty and Morris \cite{huberty_multivariate_1989} also mention that multiple univariate analysis is applicable for exploratory research. -% Ordering of variables would require taking intercorrelation into account -% degrees of freedom Subquestion 5 was answered with the help of two machine learning model classifications: \textit{logistic lasso regression} and \textit{random forest}. These were trained and tested on 80\% and 20\% of the data, respectively. The models included all metrics and \acrshort{fair} variables. Metrics were scaled to values between zero and one. This allowed us to draw comparisons between variables from the logistic regression coefficients, as all variables are now on the same scale. We used the Python package scikit-learn \cite{scikit-learn} for this part of the analysis. For describing the models, we will use the term \textit{features} for independent variables and \textit{class} for the categorization of whether a repository is considered research software or not. @@ -211,7 +205,6 @@ \section{Data analysis} \newpage \section{Validity} \label{sec:valid} -% Validity: Threats are search process and selection criteria, focus only on gitHub; Dataset size There were several threats concerning the study. Contributions to research software that \acrshort{uu}-affiliated users do not own are not captured. This means we could not capture if a researcher would contribute to existing open research software. The user search process may be flawed, resulting in a biased capture of \acrshort{uu} employees. diff --git a/tex/chapters/4_results.tex b/tex/chapters/4_results.tex index 7421af7..e7eae50 100644 --- a/tex/chapters/4_results.tex +++ b/tex/chapters/4_results.tex @@ -176,7 +176,6 @@ \section{Repository characteristics} \vspace{-0.3cm} Descriptive statistics of all metrics can be seen in \autoref{tab:all_faculties}. The skewness and kurtosis show us that all metrics are heavily skewed and long-tailed, albeit to different degrees. The statistics per faculty can be seen in \autoref{app:stats_faculty}. These show us that the median for all metrics except life span is relatively similar across all faculties. -% TODO: write more? \input{tables/repo_stats_all_faculties} @@ -200,7 +199,6 @@ \section{\acrshort{fair} variables} \autoref{fig:heatmap_fair_booleans} shows the heatmaps for the average percentage of each boolean \acrshort{fair} variable per faculty. The upper plot shows the averages for research software, while the lower one shows averages for non-research software. We can immediately see from this that every \acrshort{fair} variable has a higher percentage for research software than for non-research software. Large absolute increases can be seen in \textit{has license, correct vcs usage, has install instructions, has tests, and version identifiable}. Social Sciences has the highest percentage of licenses across both classes for research software, while Geosciences has the lowest percentage. Social Sciences also have a high percentage of repositories that are registered, have citation enabled, or contain a checklist, compared to other faculties. Geosciences have the lowest average percentage of correct vcs usage, meaning that more than 20\% of the research software repositories had only commits within a single day. Social Sciences and Support departments have a high percentage of active repositories, while other faculties are generally less active. Support departments, by far, have the highest percentage of install instructions and usage examples. For contribution guidelines, Social Sciences and Support departments again are around the same high percentage, while other faculties have only a minuscule amount. Geosciences and Humanities, the two faculties with the lowest percentage, are also the ones with the least amount of contributors. Interestingly, the Science faculty has the least amount of tests and identifiable versions on average. -% TODO: --> followup for discussion, Geo might need more support for actual use of GitHub and its features (licensing, version control usage) \begin{figure}[h!] \centerline{ \includegraphics[scale=0.53]{figures_results/heatmap_fair_booleans.png}} @@ -247,7 +245,6 @@ \section{Research software classification} \autoref{fig:stats_confusion_matrices} shows the confusion matrices for both model predictions and chance predictions for the research software classification. We can see from this that random forest performed better than logistic regression since there are fewer false negatives and false positives. \begin{figure}[h!] -% \vspace*{-1cm} \centerline{ \includegraphics[scale=0.5]{figures_results/stats_confusion_matrices.png}} \caption{Confusion matrices for logistic regression, random forest, and chance. @@ -259,7 +256,6 @@ \section{Research software classification} Based on the confusion matrix, further performance measures are calculated in \autoref{fig:stats_barplot_scores}, which shows the performance measures for both model and chance predictions. Both models outperform the chance classification in accuracy and precision. Random forest achieves a higher score in all performance measures than logistic regression and chance except for recall, which is expected. A random forest can better utilize complex variables than a logistic regression, where an increase in a variable value always increases or decreases the probability of a positive classification. This shows that the used variables improve classification compared to a chance classifier, that we can improve research software identification accuracy by 16 percentage points, and that a random forest is more suitable than logistic regression for this classification task. \begin{figure}[h!] -% \vspace*{-1cm} \centerline{ \includegraphics[scale=0.5]{figures_results/stats_barplot_scores.png}} \caption{Performance measures of logistic regression, random forest, and chance. diff --git a/tex/chapters/5_discussion.tex b/tex/chapters/5_discussion.tex index 5d556ef..1fed5db 100644 --- a/tex/chapters/5_discussion.tex +++ b/tex/chapters/5_discussion.tex @@ -1,34 +1,27 @@ \vspace{-1cm} \chapter{Discussion} \label{chap:discussion} -% show why the results "solve" the problem and to what extent,  describe what your research or design adds to the field's knowledge, what remains to be done and what other problems may be triggered; give hints for further study -% \begin{itemize} -% \item How can information about open source publications on GitHub be used to infer actionable recommendations for \acrshort{rse} practice to improve the research software landscape of an organization? -% \end{itemize} - We argue how the research questions are answered based on the results in \autoref{chap:results}. Subquestion 1 is related to the literature review and will not be further discussed. The \acrshort{fair} variables that were retrieved from the literature review for subquestion 2 are validated in \autoref{sec:disc:sq2}. \autoref{sec:disc:sq3_sq4} answers subquestion 3 and 4 that were related to subpopulation characteristics and support for application of \acrshort{fair} variables. The last subquestion, which is about research software classification, is discussed in \autoref{sec:disc:sq5}. Lastly, we look at the limitations of our study in \autoref{sec:disc:limitations}. -% \vspace{-0.3cm} \section{Validation of additional \acrshort{fair} variables} \label{sec:disc:sq2} -% \item How can FAIR principles be supplemented with additional variables? To validate \textbf{subquestion 2} (\textit{how can FAIR principles be supplemented with additional variables?}), we looked at the Jaccard similarity coefficient of the \acrshort{fair} variables in \autoref{fig:jaccard} and percentages for research and non-research software in \autoref{fig:heatmap_fair_booleans}. Since only a tiny percentage of repositories have information about registration, citation, and checklists, it is helpful to have additional measures of FAIRness. While all repositories should at least include license and citation information, it is not always sensible to register research software or scripts in a registry or to create a checklist. However, it can be argued that \acrshort{fair} variables should be applicable for all kinds of research software. Unlike the other proposed \acrshort{fair} variables, the commit-related variables \textit{correct vcs usage} and \textit{repository active} do not have a higher similarity with licensed repositories. This might indicate that these are not a good measure for FAIRness since licensing published software is a fundamental part of FAIRness. Additionally, while it might be a good practice in software development to frequently commit and correctly use version control, this aspect only relates marginally to \acrshort{fair} principles. Their primary focus lies on measuring openness and sustainability. However, \textit{correct vcs usage} might serve as a good measure to indicate which researchers might need support with the usage of software tools. For the other \acrshort{fair} variables, they relate to a stronger extent to the \acrshort{fair} principles. Findability is improved through clear version identifiability, accessibility through contribution guidelines, reusability through install instructions, example usage, and tests. As such, we determine them as valuable additions as \acrshort{fair} variables. It does not seem to be useful to determine faculty-specific \acrshort{fair} variables. For example, even if Geosciences is significantly different to all other faculties regarding the number of contributors, it still makes sense to include contribution guidelines in case someone wants to contribute. Having such a guideline improves FAIRness, nonetheless. -% \item How can the application of FAIR variables for research software be supported? -% \item Are there different characteristics for different subpopulations in the data? - -% \vspace{-0.3cm} \section{Subpopulations and supporting application of \acrshort{fair} variables} \label{sec:disc:sq3_sq4} To answer \textbf{subquestion 3} (\textit{are there different characteristics for different subpopulations in the data?}) and \textbf{subquestion 4} (\textit{how can the application of FAIR variables for research software be supported?}), we looked at descriptive statistics, licenses, languages, topics, and statistical test analysis. We first address \textbf{subquestion 3}. -\autoref{fig:heatmap_numeric_variables} provided an overview of how available metrics differed between the faculties and repository types. Further faculty-specific differences could then be seen for both license and language usage. Social Sciences use R in more than 70\% of their repositories. In contrast, other faculties use mainly Python to a lesser degree and a mix of other languages. \autoref{fig:heatmap_fair_booleans} revealed more differences regarding \acrshort{fair} variables, with \autoref{fig:fair_score} revealing a potential order of measured FAIRness across the faculties. This also empowers the argument that the \acrshort{rse} community is in a perfect position to support the improvement of FAIRness \cite{hasselbring_fair_2020}. They are part of the \textit{Support department}, which scored highest on average. + + +\autoref{fig:heatmap_numeric_variables} provided an overview of how available metrics differed between the faculties and repository types. Further faculty-specific differences could then be seen for both license and language usage. +Social Sciences use R in more than 70\% of their repositories. In contrast, other faculties use mainly Python to a lesser degree and a mix of other languages. However, Python and R comprise most of the research software created at \acrshort{uu}, which shows that support for these languages should be prioritized. Fostering local programming communities for these popular languages, like the \textit{R Data Café} \cite{noauthor_r_nodate} does, are useful to create informal support structures. For that reason, we recommend creating a similar environment for Python, the most popular language at \acrshort{uu}, or incorporating Python into the R Data Café, as it is quite popular among researchers. +\autoref{fig:heatmap_fair_booleans} revealed more differences regarding \acrshort{fair} variables, with \autoref{fig:fair_score} revealing a potential order of measured FAIRness across the faculties. This also empowers the argument that the \acrshort{rse} community is in a perfect position to support the improvement of FAIRness \cite{hasselbring_fair_2020}. They are part of the \textit{Support department}, which scored highest on average. As such, it would be wise to incorporate them into the informal support structures. It should be noted that the \acrshort{fair} score from \autoref{fig:fair_score} is currently designed such that each \acrshort{fair} variable has an equal weight. We have already discussed the importance of having a license, which should therefore be weighted higher than having a checklist. Pico et al. \cite{pico_fairsoft_2022} implemented such a weighting for their analysis. The statistical tests confirmed that all metrics except life span are significantly different. This is particularly interesting since Hasselbring et al. \cite{hasselbring_open_2020} found the life span median to be vastly different across the two subpopulations they compared. However, they only presented the median value without a statistical test for significance. The tests also showed that Geosciences are significantly different to all other faculties regarding contributors, which means that they usually have less expertise available, possibly leading to the lowest \acrshort{fair} score. @@ -38,26 +31,21 @@ \section{Subpopulations and supporting application of \acrshort{fair} variables} This finding becomes relevant in answering \textbf{subquestion 4}. In order to support the application of FAIR variables, we need to consider the FAIRness maturity level of the researchers. As we have seen, this seems to vary across the faculties. We can specifically examine Geosciences in different regards. Looking at \textit{licenses} in \autoref{fig:heatmap_licenses}, we see that most repositories in this faculty had no license, while other faculties performed considerably better in that regard. \autoref{fig:heatmap_fair_booleans} also showed that Geosciences has a lower percentage of correct vcs usage. As we have previously discussed regarding subquestion 2, this variable relates to good practices in software engineering. As such, this might indicate that Geosciences require more fundamental training in this topic and hands-on usage of GitHub. Such training should still be open for anybody to participate in but could be more geared towards the academic needs of Geosciences. A document or video-based tutorial would possibly also be worthwhile to consider. However, while this may save resources, it also comes with drawbacks. A training for this purpose seems most suited as it allows inexperienced researchers to ask questions and receive appropriate feedback. Understanding the tools that are used during the process of creating research software is a prerequisite to creating \acrshort{fair} research software. Coming back to the licenses, there would ideally be no repository without a license since unlicensed software can not be legally reused for any purpose. Therefore, this topic should be made more aware across all faculties as it concerns them equally. However, it is also noteworthy that research-related repositories have more licenses on average compared to non-research-related repositories and previously conducted analyses such as Russell et al. \cite{russell_large-scale_2018}. -While the significance of other FAIR-related aspects is less extreme, they follow a similar logic in being relevant for all faculties. For the \acrshort{fair} variables in this dataset, this concerns \textit{has citation, has install instructions, has example usage, has contribution guidelines, has tests, version identifiable}. We see varying percentages across the variables and faculties, ranging from 1\% to 66\%. As such, these are aspects that can be improved everywhere. This could be accomplished by faculty-agnostic training or a document regarding best \acrshort{fair} practices. A document has the advantage of being easily editable, such that new additions or changes in best \acrshort{fair} practices can be incorporated with little effort to avoid the provision of outdated information. -In addition, a yearly report on FAIRness that analyzes changes per faculty, similar to this study, might be helpful in tracking the effect of the implemented measures. It might also incentivize researchers to improve FAIRness, as such a report should provide transparent criteria on how to achieve this. This has been a big hurdle in improving FAIRness for many researchers, as stated in our findings from \acrshort{swordsuu} consultations and presentations in \autoref{sec:lit_fairrs}. Such a yearly report also becomes feasible since this study provides a labelled dataset and code for the analysis. If the phases are repeated, only new repositories need to be labelled, which is a minor effort compared to having no previously labelled data. +While the significance of other FAIR-related aspects is less extreme, they follow a similar logic in being relevant for all faculties. For the \acrshort{fair} variables in this dataset, this concerns \textit{has citation, has install instructions, has example usage, has contribution guidelines, has tests, version identifiable}. We see varying percentages across the variables and faculties, ranging from 1\% to 66\%. As such, these are aspects that can be improved everywhere. Making research software more \acrshort{fair} could be accomplished by faculty-agnostic training or a document regarding best \acrshort{fair} practices. A document has the advantage of being easily editable, such that new additions or changes in best \acrshort{fair} practices can be incorporated with little effort to avoid the provision of outdated information. -% Re: correlation metrics -% TODO: discussion --> using topics for findability to increase contributions might be helpful +Furthermore, \acrshort{fair} research software with high impact can be featured, for example in existing newsletters, to serve as references. High impact can be approximated with a high number of stargazers. We then additionally look at the \acrshort{fair} score as a second selection criteria and highlight aspects that show how these repositories have achieved not only high impact but also FAIRness. +In addition, a yearly report on FAIRness that analyzes changes per faculty, similar to this study, might be helpful in tracking the effect of the implemented measures. It might also incentivize researchers to improve FAIRness, as such a report should provide transparent criteria on how to achieve this. This has been a big hurdle in improving FAIRness for many researchers, as stated in our findings from \acrshort{swordsuu} consultations and presentations in \autoref{sec:lit_fairrs}. Such a yearly report also becomes feasible since this study provides a labelled dataset and code for the analysis. If the phases are repeated, only new repositories need to be labelled, which is a minor effort compared to having no previously labelled data. -% \item How can information about open source publications on GitHub be used to infer actionable recommendations for \acrshort{rse} practice to improve the research software landscape of an organization? -% \vspace{-0.3cm} \section{Identifying research software} \label{sec:disc:sq5} -% \item How well can we identify research software with available data? \autoref{sec:classification} answers subquestion 5 (\textit{how well can we identify research software with available data?}). We used logistic regression and random forest to predict the class of the repositories. Our results showed that both models outperform the majority class prediction. Since random forest performs better than logistic regression across all performance measures, we assume this model is better suited for our purposes, dataset size, and available variables. Using such a classification model can be useful for future tasks. Relating back to the yearly report, it would first be necessary to label newly collected repositories each time. Using the model for prediction as a first automated labelling step reduces manual labor and will help in making such a yearly report even more feasible. The manual labelling process is time-consuming and error-prone. It should therefore be supported by automated solutions as much as possible. The used features for our models were all numeric or boolean variables. Further development on classifier models should consider including categorical variables, such as the given license or most used language. Also, it would be beneficial for future classification to develop more variables. One example would be to include the readme text as n-grams or other potential forms. The classification could additionally be done for each subpopulation. However, considering the amount of currently available repositories, we decided against this. This might be more suitable if future work can also incorporate previous university employees. -% \vspace{-0.3cm} \section{Limitations of study} \label{sec:disc:limitations} This study only considers current researchers due to the API restrictions of the employee pages. If there were public access to previous employees, it would be possible to widen this study to include all previous employees. diff --git a/tex/chapters/6_conclusion.tex b/tex/chapters/6_conclusion.tex index bef2e69..76e8bf9 100644 --- a/tex/chapters/6_conclusion.tex +++ b/tex/chapters/6_conclusion.tex @@ -1,20 +1,6 @@ %!TEX root = ../dissertation.tex \chapter{Conclusion} \label{chap:conclusion} -% 1. Restate your thesis statement. Rephrase it so that slightly different from the thesis statement presented in the introduction and does not sound repetitive. -% 2. Reiterate the key points of your work. To do this, go back to your thesis and extract the topic sentences of each main paragraph/argument. Rephrase these sentences and use them in your conclusion. -% 3. Explain the relevance and significance of your work. These should include the larger implications of your work and showcase the impact it will have on society. -% 4. End with a take-home message, such as a call to action or future direction. - - - -%%%%%%%%%%%%%%%%%%% -% Answer main RQ -% Summarize findings and implications for others -% Summarize what others can take from the work -% Outline future work -%%%%%%%%%%%%%%%%%%% - This study aimed to identify how open source publications on GitHub can be used to infer actionable recommendations for \acrshort{rse} practice to improve the research software landscape of an organization. In order to do so, we reviewed the \acrshort{fair} principles, identified suitable variables to measure FAIRness, and conducted an exploratory data analysis. The quantitative analysis was applied to \acrshort{uu} to determine different characteristics in the faculties, support for the application of \acrshort{fair} variables, and how well research software can be identified. @@ -24,50 +10,8 @@ \chapter{Conclusion} Among other findings, we found out that Geosciences have 57\% of unlicensed software, while the next highest percentage is much lower with 35\% for the Humanities. There is also a clear difference in language usage between Social Sciences, who primarily use R, and the other faculties, who primarily use Python. We additionally provided first models for classifying research software to facilitate future research software identification, achieving an accuracy of 70\%. The \acrshort{fair} data analysis for GitHub allows \acrshort{uu} to make data-based decisions, as a first analysis of the research software landscape at \acrshort{uu} and a first labelled dataset for reuse were provided. -This has not only confirmed and refuted beliefs in existing literature, but also provided novel findings and laid the foundation for repeated analyses of this kind. +This has not only confirmed and refuted beliefs in existing literature, but also provided novel findings and recommendations, as well as laid the foundation for repeated analyses of this kind. The recommendations include expanding the R café, creating \acrshort{fair} reference documents, featuring and highlighting high impact and FAIR research software, and creating yearly reports. Additionally, the \acrshort{swordsuu} framework was extended with additional validated \acrshort{fair} variables. There are several topics for future work, which are explained in detail in \autoref{chap:discussion}. Data collection can be improved in several ways, analysis can be extended and applied to more variations of subpopulations, and classification of research software can be refined. -We conclude that the conducted analysis allows us to infer actionable recommendations for \acrshort{rse} practice and encourage others to reuse and improve the method. - - - - - - - - -% Step 1: Answer your research question -% Your conclusion should begin with the main question that your thesis or dissertation aimed to address. This is your final chance to show that you’ve done what you set out to do, so make sure to formulate a clear, concise answer. - - - -% Step 2: Summarize and reflect on your research -% Your conclusion is an opportunity to remind your reader why you took the approach you did, what you expected to find, and how well the results matched your expectations. -% To avoid repetition, consider writing more reflectively here, rather than just writing a summary of each preceding section. Consider mentioning the effectiveness of your methodology, or perhaps any new questions or unexpected insights that arose in the process. -% You can also mention any limitations of your research, but only if you haven’t already included these in the discussion. Don’t dwell on them at length, though—focus on the positives of your work. -% Example: Summarization sentence -% While x limits the generalizability of the results, this approach provides new insight into y. -% This research clearly illustrates x, but it also raises the question of y. - - - -% Step 3: Make future recommendations -% You may already have made a few recommendations for future research in your discussion section, but the conclusion is a good place to elaborate and look ahead, considering the implications of your findings in both theoretical and practical terms. -% Example: Recommendation sentence -% Based on these conclusions, practitioners should consider … -% To better understand the implications of these results, future studies could address … -% Further research is needed to determine the causes of/effects of/relationship between … -% When making recommendations for further research, be sure not to undermine your own work. Relatedly, while future studies might confirm, build on, or enrich your conclusions, they shouldn’t be required for your argument to feel complete. Your work should stand alone on its own merits. -% Just as you should avoid too much self-criticism, you should also avoid exaggerating the applicability of your research. If you’re making recommendations for policy, business, or other practical implementations, it’s generally best to frame them as “shoulds” rather than “musts.” All in all, the purpose of academic research is to inform, explain, and explore—not to demand. - - -% Step 4: Emphasize your contributions to your field -% Make sure your reader is left with a strong impression of what your research has contributed to the state of your field. - -% Some strategies to achieve this include: - -% Returning to your problem statement to explain how your research helps solve the problem -% Referring back to the literature review and showing how you have addressed a gap in knowledge -% Discussing how your findings confirm or challenge an existing theory or assumption -% Again, avoid simply repeating what you’ve already covered in the discussion in your conclusion. Instead, pick out the most important points and sum them up succinctly, situating your project in a broader context. \ No newline at end of file +We conclude that the conducted analysis allows us to infer actionable recommendations for \acrshort{rse} practice and encourage others to reuse and improve the method. \ No newline at end of file diff --git a/tex/frontmatter/abstract.tex b/tex/frontmatter/abstract.tex index f31f843..59ba930 100644 --- a/tex/frontmatter/abstract.tex +++ b/tex/frontmatter/abstract.tex @@ -12,9 +12,10 @@ Research software enables data processing and plays a vital role in academia and industry. As such, it is essential to have \acrfull{fair} research software. However, what precisely the landscape of research software looks like is unknown. Thus, we would like to understand the research software landscape better and utilize this information to infer actionable recommendations for the \acrfull{rse} practice. This study provides insights into the research software landscape at Utrecht University through an exploratory analysis while also considering the different scientific domains. We achieve this by collecting GitHub data and analyzing repository FAIRness and characteristics through heatmaps, histograms, statistical tables, and tests. Our method retrieved 176 users with 1521 repositories, of which 823 are considered research software. Others can adopt the proposed method to gain insights into their specific organization, as it is designed to be reproducible and reusable. -The analysis showed significant differences between faculty characteristics and how to support the application of \acrshort{fair} variables. Among other things, our results showed that Geosciences have the highest percentage of unlicensed repositories with 57\%. Also, Social Sciences are an outlier in language usage, as they are the only faculty to primarily use R. Other faculties primarily use Python. -A first classification model is developed that achieves 70\% accuracy in identifying research software that can be used for future labelling tasks. -We conclude that our labelled GitHub dataset allows us to infer actionable recommendations on \acrshort{rse} practice. - +The analysis showed significant differences between faculty characteristics and how to support the application of \acrshort{fair} variables. Among other things, our results showed that Geosciences have the highest percentage of unlicensed repositories with 57\%. Also, Social Sciences are an outlier in language usage, as they are the only faculty to primarily use R, while other faculties primarily use Python. +A first classification model is developed that achieves 70\% accuracy in identifying research software that can be used for future labelling tasks. +% Our recommendations include... +Our recommendations include expanding the R café, creating \acrshort{fair} reference documents, featuring and highlighting high impact and FAIR research software, and creating yearly reports. +We conclude that our labelled GitHub dataset allows us to infer actionable recommendations on \acrshort{rse} practice. % Among other findings, we found out that Geosciences have 57\% of unlicensed software, while the next highest percentage is much lower with 35\% for the Humanities. There is also a clear difference in language usage between Social Sciences, who primarily use R, and the other faculties, who primarily use Python. \ No newline at end of file diff --git a/tex/masterthesis.bib b/tex/masterthesis.bib index 7dd8194..a4ff91e 100644 --- a/tex/masterthesis.bib +++ b/tex/masterthesis.bib @@ -1542,8 +1542,6 @@ @article{scikit-learn year={2011} } - - @misc{noauthor_badgeapp_nodate, title = {{BadgeApp}}, url = {https://bestpractices.coreinfrastructure.org/en}, @@ -1551,3 +1549,15 @@ @misc{noauthor_badgeapp_nodate note = {(accessed 2022-10-05)}, file = {BadgeApp:C\:\\Users\\beld\\Zotero\\storage\\PUXP42YH\\en.html:text/html}, } + + +@misc{noauthor_r_nodate, + title = {R {Data} {Café} - {Applied} {Data} {Science} - {Utrecht} {University}}, + url = {https://www.uu.nl/en/research/applied-data-science/r-data-cafe}, + abstract = {During this informal meetup, we will help each other with R programming, data handling and cleaning challenges.}, + language = {en}, + urldate = {2022-10-08}, + note = {(accessed 2022-10-08)}, + file = {Snapshot:C\:\\Users\\beld\\Zotero\\storage\\9EQC3V52\\r-data-cafe.html:text/html}, +} +