From 36c310c3e7a9bbcc2ad38240f7f512086280f3a8 Mon Sep 17 00:00:00 2001 From: lcolladotor Date: Tue, 27 Aug 2024 12:33:11 -0400 Subject: [PATCH] =?UTF-8?q?Add=20links=20to=20RStudio=20Conf=202019=20and?= =?UTF-8?q?=202020=20talks=20by=20Brooke=20Watson=20and=20Mar=C3=ADa=20Ter?= =?UTF-8?q?esa=20Ort=C3=ADz?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../index/execute-results/html.json | 4 ++-- .../index/execute-results/html.json | 4 ++-- posts/01-welcome/index.qmd | 21 ++++++++++++++++--- .../index.qmd | 6 ++++++ 4 files changed, 28 insertions(+), 7 deletions(-) diff --git a/_freeze/posts/01-welcome/index/execute-results/html.json b/_freeze/posts/01-welcome/index/execute-results/html.json index 3a6dfd4..f294a1f 100644 --- a/_freeze/posts/01-welcome/index/execute-results/html.json +++ b/_freeze/posts/01-welcome/index/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "4010d68cfaaffa4bc053ee472abced1e", + "hash": "b540e06a944e2e37c24b4320dc682388", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"01 - Welcome!\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Overview course information for BSPH Biostatistics 140.776\"\ncategories: [course-admin, module 1, week 1]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing/commits/main/posts/01-welcome/index.qmd).*\n\nWelcome! I am very excited to have you in our one-term (i.e. half a semester) course on Statistical Computing course number (140.776) offered by the [Department of Biostatistics](https://publichealth.jhu.edu/departments/biostatistics) at the [Johns Hopkins Bloomberg School of Public Health](https://publichealth.jhu.edu).\n\n\n\n

I'm excited to be back this year to teach "140.776 Statistical Computing" at @jhubiostat @JohnsHopkinsSPH 😊

This year, I decided to start with some music 🎢. I like part of the lyrics of this song, which talks about overcoming challenges, [...]https://t.co/H5Uq6QhH4D

1/3 🧡 pic.twitter.com/pkRjefcokc

— πŸ‡²πŸ‡½ Leonardo Collado-Torres (@lcolladotor) August 27, 2024
\n\nThis course is designed for ScM and PhD students at Johns Hopkins Bloomberg School of Public Health. I am pretty flexible about permitting outside students, but I want everyone to be aware of the goals and assumptions so no one feels like they are surprised by how the class works.\n\n::: callout-note\nThe primary goal of the course is to teach you practical programming and computational skills required for the research and application of statistical methods.\n:::\n\nThis class is not designed to teach the theoretical aspects of statistical or computational methods, but rather the goal is to help with the **practical issues** related to setting up a statistical computing environment for data analyses, developing high-quality R packages, conducting reproducible data analyses, best practices for data visualization and writing code, and creating websites for personal or project use.\n\n### Assumptions and pre-requisites\n\nThe course is designed for students in the Johns Hopkins Biostatistics Masters and PhD programs. However, we do not assume a significant background in statistics. Specifically we assume:\n\n#### 1. You know the **basics** of at least one programming language (e.g. R or Python)\n\n- If it's not R, we assume that you are willing to spend the time to learn R\n- You have heard of things such as control structures, functions, loops, etc\n- Know the difference between different data types (e.g. character, numeric, etc)\n- Know the basics of plotting (e.g. what is a scatterplot, histogram, etc)\n\n#### 2. You know the **basics** of computing environments\n\n- You have access to a computing environment (i.e. locally on a laptop or working in the cloud)\n- You generally feel comfortable with installing and working with software\n\n#### 3. You know the **basics** of statistics\n\n- The central dogma (estimates, standard errors, basic distributions, etc.)\n- Key statistical terms and methods\n- Differences between estimation vs testing vs prediction\n- Know how to fit and interpret **basic** statistical models (e.g. linear models)\n\n#### 4. You know the **basics** of reproducible research\n\n- Difference between replication and reproducible. If not, check this excellent paper by Prasad Patil et al.: .\n- Know how to cite references (e.g. like in a publication)\n- Somewhat familiar with tools that enable reproducible research (In complete transparency, we will briefly cover these topics in the first week, but depending on your comfort level with them, this may impact whether you choose to continue with the course).\n\nSince the target audience for this course is advanced students in statistics we will not be able to spend significant time covering these concepts and technologies. To give you some idea about how these prerequisites will impact your experience in the course, we will be using R for nearly all classes, we will be turning in all assignments via R Markdown documents and you will be encouraged (not required) to use git/GitHub to track changes to your code over time. The majority of the assignments will involve learning the practical issues around performing data analyses, building software packages, building websites, etc all using the R programming language. Data analyses you will perform will also often involve significant data extraction, cleaning, and transformation. We will learn about tools to do all of this, but hopefully most of this sounds familiar to you so you can focus on the concepts we will be teaching around best practices for statistical computing.\n\n::: callout-tip\nSome resources that may be useful if you feel you may be missing pieces of this background:\n\n- **Statistics** - [Mathematical Biostatistics Bootcamp I (Coursera)](https://www.coursera.org/learn/biostatistics); [Mathematical Biostatistics Bootcamp II (Coursera)](https://www.coursera.org/learn/biostatistics-2)\n- **Basic Data Science** - [Cloud Data Science (Leanpub)](https://leanpub.com/universities/set/jhu/cloud-based-data-science); [Data Science Specialization (Coursera)](https://www.coursera.org/specializations/jhu-data-science)\n- **Version Control** - [Github Learning Lab](https://lab.github.com/); [Happy Git and Github for the useR](https://happygitwithr.com/)\n- **Rmarkdown** - [Rmarkdown introduction](https://rmarkdown.rstudio.com/lesson-1.html)\n- **R 101** [*LIBD rstats club*](https://research.libd.org/rstatsclub/) blog post: \n- **Introductory R videos** from the [*LIBD rstats club*](https://research.libd.org/rstatsclub/) such as these videos:\n\n\n\n\n:::\n\n### Getting set up\n\nYou must install [R](https://cran.r-project.org) and [RStudio](https://rstudio.com) on your computing environment in order to complete this course.\n\nThese are two **different** applications that must be installed separately before they can be used together:\n\n- `R` is the core underlying programming language and computing engine that we will be learning in this course\n\n- `RStudio` is an interface into R that makes many aspects of using and programming R simpler\n\nBoth `R` and `RStudio` are available for Windows, macOS, and most flavors of Unix and Linux. Please download the version that is suitable for your computing setup.\n\nThroughout the course, we will make use of numerous R add-on packages that must be installed over the Internet. Packages can be installed using the `install.packages()` function in R. For example, to install the `tidyverse` package, you can run\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"tidyverse\")\n```\n:::\n\n\nin the R console.\n\n#### How to Download R for Windows\n\nGo to and\n\n1. Click the link to \"Download R for Windows\"\n\n2. Click on \"base\"\n\n3. Click on \"Download R 4.4.1 for Windows\"\n\n::: callout-warning\nThe version in the video is not the latest version of R. Please download the latest version.\n:::\n\n![Video Demo for Downloading R for Windows](../../videos/downloadRWindows.gif){width=\"80%\"}\n\n#### How to Download R for the Mac\n\nGoto and\n\n1. Click the link to \"Download R for (Mac) OS X\".\n\n2. Click on \"R-4.4.1.pkg\"\n\n::: callout-warning\nThe version in the video is not the latest version of R. Please download the latest version.\n:::\n\n![Video Demo for Downloading R for the Mac](../../videos/downloadRMac.gif){width=\"80%\"}\n\n#### How to Download RStudio\n\nGoto and\n\n1. Click on \"Products\" in the top menu\n\n2. Then click on \"RStudio\" in the drop down menu\n\n3. Click on \"RStudio Desktop\"\n\n4. Click the button that says \"DOWNLOAD RSTUDIO DESKTOP\"\n\n5. Click the button under \"RStudio Desktop\" Free\n\n6. Under the section \"All Installers\" choose the file that is appropriate for your operating system.\n\n::: callout-warning\nNOTE: The video shows how to download RStudio for the Mac but you should download RStudio for whatever computing setup you have\n:::\n\n![Video Demo for Downloading RStudio](../../videos/downloadRStudio.gif){width=\"80%\"}\n\n#### How to Download Git\n\nInstall `Git` on your computer following the detailed instructions at , which will depend on your operating system.\n\n#### Create your GitHub account\n\nCreate a personal GitHub account following the instructions at .\n\nFollowing [Jeff Leek](https://twitter.com/jtleek)'s advice on [**How to be a modern scientist**](https://leanpub.com/modernscientist) that was part of my team's bootcamps at , try to choose a username that will be the same one you use on your email and other work-related social media platforms such as Twitter, Bluesky, etc.\n\n\n\n# Learning Objectives\n\nThe goal is by the end of the class, students will be able to:\n\n1. Install and configure software necessary for a statistical programming environment\n2. Discuss generic programming language concepts as they are implemented in a high-level statistical language\n3. Write and debug code in base R and the tidyverse (and integrate code from Python modules)\n4. Build basic data visualizations using R and the tidyverse\n5. Discuss best practices for coding and reproducible research, basics of data ethics, basics of working with special data types, and basics of storing data\n\n# Course logistics\n\nThe only option is to take the course is in-person (140.776.01).\n\n- \n\nAll communication for the course is going to take place on one of three platforms:\n\n- **Courseplus**: for discussion, sharing resources, collaborating, and announcements\n\n- **Github**: for getting access to course materials (e.g. lectures, project assignments)\n\n - Course Github: \n\n- **Lectures**: lectures will be in person\n\n - All in person lectures to be recorded and posted online after class ends\n - If for some reason I am sick or not capable of coming onsite, I will send out a Zoom link for everyone to attend remotely for that day.\n\nThe primary communication for the class will go through Courseplus. That is where we will post course announcements, answer common questions, and as the primary means of communication between course participants and course instructors.\n\n::: callout-important\nIf you are registered for the course, you should have access to Courseplus now. Once you have access you will also be able to find all material and dates/times of drop-in hours. Any Zoom links will be posted on Courseplus.\n:::\n\n### Course Staff\n\nThe course instructor this year is [Leonardo Collado Torres](http://lcolladotor.github.io/) who is an Investigator at the [Lieber Institute for Brain Development](https://www.libd.org/) and an Assistant Professor in the [Department of Biostatistics](https://publichealth.jhu.edu/departments/biostatistics) at the [Johns Hopkins Bloomberg School of Public Health](https://publichealth.jhu.edu). I was actually a student of this course in Fall 2012 when [Roger D. Peng](https://twitter.com/rdpeng) was the instructor of this course and [Jenna Krall](https://twitter.com/kralljr) was the only teaching assistant. I first taught this course in 2023. I myself started learning and [teaching](http://lcolladotor.github.io/#teaching) about R in 2008 and we have a lot of material in the [LIBD rstats club](https://research.libd.org/rstatsclub/) that you might be interested in.\n\nAt the Lieber Institute for Brain Development ([LIBD](https://www.libd.org/)), my group works on understanding the roots and signatures of disease (particularly psychiatric disorders) by zooming in across dimensions of gene activity. We achieve this by studying gene expression at all expression feature levels (genes, exons, exon-exon junctions, and un-annotated regions) and by using different gene expression measurement technologies (bulk RNA-seq, single cell/nucleus RNA-seq, and spatial transcriptomics) that provide finer biological resolution and localization of gene expression. We work closely with collaborators from LIBD as well as from Johns Hopkins University (JHU) and other institutions, which reflects the cross-disciplinary approach and diversity in expertise needed to further advance our understanding of high throughput biology.\n\nEvery day I use [R](http://cran.r-project.org/) and [Bioconductor](http://www.bioconductor.org/), and on some days I [write R packages](https://lcolladotor.github.io/pkgs/). Occasionally I write [blog posts](http://lcolladotor.github.io/#blog) about them and other tools. I'm a co-founder of the [LIBD rstats club](http://LieberInstitute.github.io/rstatsclub/) and the [CDSB community](https://comunidadbioinfo.github.io) of R and Bioconductor developers in Mexico and Latin America, that we described at the [R Consortium website](https://www.r-consortium.org/blog/2020/03/18/cdsb-diversity-and-outreach-hotspot-in-mexico). In the past, I also served on the [Bioconductor Community Advisory Board](http://bioconductor.org/about/community-advisory-board/).\n\nIf you want, you can find me on [Twitter](https://twitter.com/lcolladotor) or [Bluesky](https://bsky.app/profile/lcolladotor.bsky.social).\n\n#### Teaching Assistants\n\nWe also have three amazing TAs this year:\n\n- Jiaxin (Jessi) Huang ([jhuan206\\@jh.edu](mailto:jhuan206@jh.edu)): I am a second-year master student in the Biostatistics program, specializing in longitudinal data analysis. My current research focuses on exploring and identifying factors associated with long asymptomatic malaria infections. In my free time, I enjoy traveling, cooking, outdoor activities and watching shows.\n- Phyllis Wei ([ywei43\\@jhu.edu](mailto:ywei43@jhu.edu)). She is a fourth year Ph.D. student in Biostatistics. She develops methods to help understand the genetic basis of complex traits and diseases and enhance disease risk prediction models through data integration. Outside of biostatistics, she enjoys hiking, baking, and visiting museums.\n- Yu Lu ([ylu136\\@jh.edu](mailto:ylu136@jh.edu)): I am a third-year Ph.D. student in Biostatistics. My work is focused on developing functional data analysis methods inspired by physical activity data. In my free time, I enjoy the outdoor but also cuddling with my cat, Finland. (Didn’t manage to make him wear a harness and take a walk outside yet)\n\n### Assignment Due Dates\n\nAll course assignment due dates appear on the **Schedule** and **Syllabus**.\n\n### Grading\n\n#### Philosophy\n\nWe believe the purpose of graduate education is to train you to be able to think for yourself and initiate and complete your own projects. We are super excited to talk to you about ideas, work out solutions with you, and help you to figure out how to produce professional data analyses. We do not think that graduate school grades are important for this purpose. This means that we do not care very much about graduate student grades.\n\nThat being said, we have to give you a grade so they will be:\n\n- A - Excellent - 90%+\n- B - Passing - 80%+\n- C - Needs improvement - 70%+\n\nWe rarely give out grades below a C and if you consistently submit work, and do your best you are very likely to get an A or a B in the course.\n\nWhen I was a JHBSPH student in 2011-2016, I had a scholarship from my country πŸ‡²πŸ‡½ which had specific grade requirements, so I recognize that while most employers won't care about your grades over your grad school projects, you might have strong reasons for aiming for high grades.\n\n#### Relative weights\n\nThe grades are based on three projects (plus one entirely optional project to help you get set up). The breakdown of grading will be\n\n- 33% for Project 1\n- 33% for Project 2\n- 34% for Project 3\n\nIf you submit an project solution, it is your own work, and it meets a basic level of completeness and effort you will get 100% for that project. If you submit a project solution, but it doesn't meet basic completeness and effort you will receive 50%. If you do not submit an solution you will receive 0%.\n\n### Submitting assignments\n\nPlease write up your project solutions using R Markdown. In some cases, you will compile a R Markdown file into an HTML file and submit your HTML file to the dropbox on Courseplus. In other cases, you may create an R package or website. In all of the above, when applicable, show all your code and provide as much explanation / documentation as you can.\n\nFor each project, we will provide a time when we download the materials. We will assume whatever version we download at that time is what you are turning in.\n\n### Reproducibility\n\nWe will talk about reproducibility a bit during class, and it will be a part of the homework assignments as well. Reproducibility of scientific code is very challenging, so the faculty and TAs completely understand difficulties that arise. But we think that it is important that you practice reproducible research. In particular, your project assignments should perform the tasks that you are asked to do and create the figures and tables you are asked to make as a part of the compilation of your document. We will have some pointers for some issues that have come up as we announce the projects.\n\n### Code of Conduct\n\nWe are committed to providing a welcoming, inclusive, and harassment-free experience for everyone, regardless of gender, gender identity and expression, age, sexual orientation, disability, physical appearance, body size, race, ethnicity, religion (or lack thereof), political beliefs/leanings, or technology choices. We do not tolerate harassment of course participants in any form. Sexual language and imagery is not appropriate for any work event, including group meetings, conferences, talks, parties, Twitter, and other online media. This code of conduct applies to all course participants, including instructors and TAs, and applies to all modes of interaction, both in-person and online, including GitHub project repos, Slack channels, and Twitter.\n\nI was also part of the Bioconductor Code of Conduct committee for a few years. You might find the Bioconductor Code of Conduct useful as it is translated into different languages by native speakers of said languages: .\n\nCourse participants violating these rules will be referred to leadership of the Department of Biostatistics and the Title IX coordinator at JHU and may face expulsion from the class.\n\nAll class participants agree to:\n\n- **Be considerate** in speech and actions, and actively seek to acknowledge and respect the boundaries of other members.\n- **Be respectful**. Disagreements happen, but do not require poor behavior or poor manners. Frustration is inevitable, but it should never turn into a personal attack. A community where people feel uncomfortable or threatened is not a productive one. Course participants should be respectful both of the other course participants and those outside the course.\n- **Refrain from demeaning, discriminatory, or harassing behavior and speech**. Harassment includes, but is not limited to: deliberate intimidation; stalking; unwanted photography or recording; sustained or willful disruption of talks or other events; inappropriate physical contact; use of sexual or discriminatory imagery, comments, or jokes; and unwelcome sexual attention. If you feel that someone has harassed you or otherwise treated you inappropriately, please alert Leonardo Collado Torres.\n- **Take care of each other**. Refrain from advocating for, or encouraging, any of the above behavior. And, if someone asks you to stop, then stop. Alert Leonardo Collado Torres if you notice a dangerous situation, someone in distress, or violations of this code of conduct, even if they seem inconsequential.\n\n### Need Help?\n\nPlease speak with Leonardo Collado Torres or one of the TAs. You can also reach out to Elizabeth Stuart, chair of the department of Biostatistics or Margaret Taub, Ombudsman for the Department of Biostatistics.\n\nYou may also reach out to any Hopkins resource for sexual harassment, discrimination, or misconduct:\n\n- JHU Sexual Assault Helpline, 410-516-7333 (confidential)\n- [University Sexual Assault Response and Prevention website](http://sexualassault.jhu.edu/?utm_source=JHU+Broadcast+Messages+-+Synced+List&utm_campaign=c9030551f7-EMAIL_CAMPAIGN_2017_12_11&utm_medium=email&utm_term=0_af6859b027-c9030551f7-69248741)\n- [Johns Hopkins Compliance Hotline](https://johnshopkinsspeak2us.tnwreports.com/?utm_source=JHU+Broadcast+Messages+-+Synced+List&utm_campaign=c9030551f7-EMAIL_CAMPAIGN_2017_12_11&utm_medium=email&utm_term=0_af6859b027-c9030551f7-69248741), 844-SPEAK2US (844-733-2528)\n- [Hopkins Policies Online](https://jhu.us5.list-manage.com/track/click?u=bd75ef1a5cad0cbfd522412c4&id=8a667a12dd&e=b1124f7c17)\n- [JHU Office of Institutional Equity](https://jhu.us5.list-manage.com/track/click?u=bd75ef1a5cad0cbfd522412c4&id=928bcfb8a9&e=b1124f7c17) 410-516-8075 (nonconfidential)\n- [Johns Hopkins Student Assistance Program](https://jhu.us5.list-manage.com/track/click?u=bd75ef1a5cad0cbfd522412c4&id=98f4091f97&e=b1124f7c17) (JHSAP), 443-287-7000\n- [University Health Services](https://jhu.us5.list-manage.com/track/click?u=bd75ef1a5cad0cbfd522412c4&id=d51077694c&e=b1124f7c17), 410-955-1892\n- [The Faculty and Staff Assistance Program](https://jhu.us5.list-manage.com/track/click?u=bd75ef1a5cad0cbfd522412c4&id=af1f20bd97&e=b1124f7c17) (FASAP), 443-997-7000\n\n### Feedback\n\nWe welcome feedback on this Code of Conduct.\n\n### License and attribution\n\nThis Code of Conduct is distributed under a Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. Portions of above text comprised of language from the Codes of Conduct adopted by rOpenSci and Django, which are licensed by CC BY-SA 4.0 and CC BY 3.0. This work was further inspired by Ada Initiative's ''how to design a code of conduct for your community'' and Geek Feminism's Code of conduct evaluations and expanded by Ashley Johnson and Shannon Ellis in the Jeff Leek group.\n\n### Academic Ethics\n\nStudents enrolled in the Bloomberg School of Public Health of The Johns Hopkins University assume an obligation to conduct themselves in a manner appropriate to the University's mission as an institution of higher education. A student is obligated to refrain from acts which he or she knows, or under the circumstances has reason to know, impair the academic integrity of the University. Violations of academic integrity include, but are not limited to: cheating; plagiarism; knowingly furnishing false information to any agent of the University for inclusion in the academic record; violation of the rights and welfare of animal or human subjects in research; and misconduct as a member of either School or University committees or recognized groups or organizations.\n\nStudents should be familiar with the policies and procedures specified under Policy and Procedure Manual Student-01 (Academic Ethics), available on the school's [portal](https://my.jhsph.edu/Pages/Faculty.aspx).\n\nThe faculty, staff and students of the Bloomberg School of Public Health and the Johns Hopkins University have the shared responsibility to conduct themselves in a manner that upholds the law and respects the rights of others. Students enrolled in the School are subject to the Student Conduct Code (detailed in Policy and Procedure Manual Student-06) and assume an obligation to conduct themselves in a manner which upholds the law and respects the rights of others. They are responsible for maintaining the academic integrity of the institution and for preserving an environment conducive to the safe pursuit of the School's educational, research, and professional practice missions.\n\n## Disability Support Service\n\nStudents requiring accommodations for disabilities should register with [Student Disability Service (SDS)](https://publichealth.jhu.edu/about/inclusion-diversity-anti-racism-and-equity-idare/student-disability-services). It is the responsibility of the student to register for accommodations with SDS. Accommodations take effect upon approval and apply to the remainder of the time for which a student is registered and enrolled at the Bloomberg School of Public Health. Once you are f a student in your class has approved accommodations you will receive formal notification and the student will be encouraged to reach out. If you have questions about requesting accommodations, please contact `BSPH.dss@jhu.edu`.\n\n## Previous versions of the class\n\n- \n- \n- \n\n## Typos and corrections\n\nFeel free to submit typos/errors/etc via the GitHub repository associated with the class: . You will have the thanks of your grateful instructor!\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.4.1 (2024-06-14)\n os macOS Sonoma 14.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2024-08-27\n pandoc 3.2 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)\n colorout * 1.3-0.2 2024-05-03 [1] Github (jalvesaq/colorout@c6113a2)\n digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.0)\n evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)\n fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)\n htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)\n htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)\n jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)\n knitr 1.48 2024-07-07 [1] CRAN (R 4.4.0)\n rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)\n rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)\n rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)\n xfun 0.46 2024-07-18 [1] CRAN (R 4.4.0)\n yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n", + "markdown": "---\ntitle: \"01 - Welcome!\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Overview course information for BSPH Biostatistics 140.776\"\ncategories: [course-admin, module 1, week 1]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing/commits/main/posts/01-welcome/index.qmd).*\n\nWelcome! I am very excited to have you in our one-term (i.e. half a semester) course on Statistical Computing course number (140.776) offered by the [Department of Biostatistics](https://publichealth.jhu.edu/departments/biostatistics) at the [Johns Hopkins Bloomberg School of Public Health](https://publichealth.jhu.edu).\n\n\n\n
\n\n

\n\nI'm excited to be back this year to teach \"140.776 Statistical Computing\" at @jhubiostat @JohnsHopkinsSPH 😊

This year, I decided to start with some music 🎢. I like part of the lyrics of this song, which talks about overcoming challenges, \\[...\\]https://t.co/H5Uq6QhH4D

1/3 🧡 pic.twitter.com/pkRjefcokc\n\n

\n\nβ€” πŸ‡²πŸ‡½ Leonardo Collado-Torres (@lcolladotor) August 27, 2024\n\n
\n\n\n```{=html}\n\n```\n\nThis course is designed for ScM and PhD students at Johns Hopkins Bloomberg School of Public Health. I am pretty flexible about permitting outside students, but I want everyone to be aware of the goals and assumptions so no one feels like they are surprised by how the class works.\n\n::: callout-note\nThe primary goal of the course is to teach you practical programming and computational skills required for the research and application of statistical methods.\n:::\n\nThis class is not designed to teach the theoretical aspects of statistical or computational methods, but rather the goal is to help with the **practical issues** related to setting up a statistical computing environment for data analyses, developing high-quality R packages, conducting reproducible data analyses, best practices for data visualization and writing code, and creating websites for personal or project use.\n\n### Assumptions and pre-requisites\n\nThe course is designed for students in the Johns Hopkins Biostatistics Masters and PhD programs. However, we do not assume a significant background in statistics. Specifically we assume:\n\n#### 1. You know the **basics** of at least one programming language (e.g. R or Python)\n\n- If it's not R, we assume that you are willing to spend the time to learn R\n- You have heard of things such as control structures, functions, loops, etc\n- Know the difference between different data types (e.g. character, numeric, etc)\n- Know the basics of plotting (e.g. what is a scatterplot, histogram, etc)\n\n#### 2. You know the **basics** of computing environments\n\n- You have access to a computing environment (i.e. locally on a laptop or working in the cloud)\n- You generally feel comfortable with installing and working with software\n\n#### 3. You know the **basics** of statistics\n\n- The central dogma (estimates, standard errors, basic distributions, etc.)\n- Key statistical terms and methods\n- Differences between estimation vs testing vs prediction\n- Know how to fit and interpret **basic** statistical models (e.g. linear models)\n\n#### 4. You know the **basics** of reproducible research\n\n- Difference between replication and reproducible. If not, check this excellent paper by Prasad Patil et al.: .\n- Know how to cite references (e.g. like in a publication)\n- Somewhat familiar with tools that enable reproducible research (In complete transparency, we will briefly cover these topics in the first week, but depending on your comfort level with them, this may impact whether you choose to continue with the course).\n\nSince the target audience for this course is advanced students in statistics we will not be able to spend significant time covering these concepts and technologies. To give you some idea about how these prerequisites will impact your experience in the course, we will be using R for nearly all classes, we will be turning in all assignments via R Markdown documents and you will be encouraged (not required) to use git/GitHub to track changes to your code over time. The majority of the assignments will involve learning the practical issues around performing data analyses, building software packages, building websites, etc all using the R programming language. Data analyses you will perform will also often involve significant data extraction, cleaning, and transformation. We will learn about tools to do all of this, but hopefully most of this sounds familiar to you so you can focus on the concepts we will be teaching around best practices for statistical computing.\n\n::: callout-tip\nSome resources that may be useful if you feel you may be missing pieces of this background:\n\n- **Statistics** - [Mathematical Biostatistics Bootcamp I (Coursera)](https://www.coursera.org/learn/biostatistics); [Mathematical Biostatistics Bootcamp II (Coursera)](https://www.coursera.org/learn/biostatistics-2)\n- **Basic Data Science** - [Cloud Data Science (Leanpub)](https://leanpub.com/universities/set/jhu/cloud-based-data-science); [Data Science Specialization (Coursera)](https://www.coursera.org/specializations/jhu-data-science)\n- **Version Control** - [Github Learning Lab](https://lab.github.com/); [Happy Git and Github for the useR](https://happygitwithr.com/)\n- **Rmarkdown** - [Rmarkdown introduction](https://rmarkdown.rstudio.com/lesson-1.html)\n- **R 101** [*LIBD rstats club*](https://research.libd.org/rstatsclub/) blog post: \n- **Introductory R videos** from the [*LIBD rstats club*](https://research.libd.org/rstatsclub/) such as these videos:\n\n\n\n\n:::\n\n### Getting set up\n\nYou must install [R](https://cran.r-project.org) and [RStudio](https://rstudio.com) (available from [Posit](https://posit.co/)) on your computing environment in order to complete this course.\n\nThese are two **different** applications that must be installed separately before they can be used together:\n\n- `R` is the core underlying programming language and computing engine that we will be learning in this course\n\n- `RStudio` is an interface into R that makes many aspects of using and programming R simpler\n\nBoth `R` and `RStudio` are available for Windows, macOS, and most flavors of Unix and Linux. Please download the version that is suitable for your computing setup.\n\nThroughout the course, we will make use of numerous R add-on packages that must be installed over the Internet. Packages can be installed using the `install.packages()` function in R. For example, to install the `tidyverse` package, you can run\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"tidyverse\")\n```\n:::\n\n\nin the R console.\n\n#### How to Download R for Windows\n\nGo to and\n\n1. Click the link to \"Download R for Windows\"\n\n2. Click on \"base\"\n\n3. Click on \"Download R 4.4.1 for Windows\"\n\n::: callout-warning\nThe version in the video is not the latest version of R. Please download the latest version.\n:::\n\n![Video Demo for Downloading R for Windows](../../videos/downloadRWindows.gif){width=\"80%\"}\n\n#### How to Download R for the Mac\n\nGoto and\n\n1. Click the link to \"Download R for (Mac) OS X\".\n\n2. Click on \"R-4.4.1.pkg\"\n\n::: callout-warning\nThe version in the video is not the latest version of R. Please download the latest version.\n:::\n\n![Video Demo for Downloading R for the Mac](../../videos/downloadRMac.gif){width=\"80%\"}\n\n#### How to Download RStudio\n\nGoto and\n\n1. Click on \"Products\" in the top menu\n\n2. Then click on \"RStudio\" in the drop down menu\n\n3. Click on \"RStudio Desktop\"\n\n4. Click the button that says \"DOWNLOAD RSTUDIO DESKTOP\"\n\n5. Click the button under \"RStudio Desktop\" Free\n\n6. Under the section \"All Installers\" choose the file that is appropriate for your operating system.\n\n::: callout-warning\nNOTE: The video shows how to download RStudio for the Mac but you should download RStudio for whatever computing setup you have\n:::\n\n![Video Demo for Downloading RStudio](../../videos/downloadRStudio.gif){width=\"80%\"}\n\n#### How to Download Git\n\nInstall `Git` on your computer following the detailed instructions at , which will depend on your operating system.\n\n#### Create your GitHub account\n\nCreate a personal GitHub account following the instructions at .\n\nFollowing [Jeff Leek](https://twitter.com/jtleek)'s advice on [**How to be a modern scientist**](https://leanpub.com/modernscientist) that was part of my team's bootcamps at , try to choose a username that will be the same one you use on your email and other work-related social media platforms such as Twitter, Bluesky, etc.\n\n\n\n# Learning Objectives\n\nThe goal is by the end of the class, students will be able to:\n\n1. Install and configure software necessary for a statistical programming environment\n2. Discuss generic programming language concepts as they are implemented in a high-level statistical language\n3. Write and debug code in base R and the tidyverse (and integrate code from Python modules)\n4. Build basic data visualizations using R and the tidyverse\n5. Discuss best practices for coding and reproducible research, basics of data ethics, basics of working with special data types, and basics of storing data\n\n# Course logistics\n\nThe only option is to take the course is in-person (140.776.01).\n\n- \n\nAll communication for the course is going to take place on one of three platforms:\n\n- **Courseplus**: for discussion, sharing resources, collaborating, and announcements\n\n- **Github**: for getting access to course materials (e.g. lectures, project assignments)\n\n - Course Github: \n\n- **Lectures**: lectures will be in person\n\n - All in person lectures to be recorded and posted online after class ends\n - If for some reason I am sick or not capable of coming onsite, I will send out a Zoom link for everyone to attend remotely for that day.\n\nThe primary communication for the class will go through Courseplus. That is where we will post course announcements, answer common questions, and as the primary means of communication between course participants and course instructors.\n\n::: callout-important\nIf you are registered for the course, you should have access to Courseplus now. Once you have access you will also be able to find all material and dates/times of drop-in hours. Any Zoom links will be posted on Courseplus.\n:::\n\n### Course Staff\n\nThe course instructor this year is [Leonardo Collado Torres](http://lcolladotor.github.io/) who is an Investigator at the [Lieber Institute for Brain Development](https://www.libd.org/) and an Assistant Professor in the [Department of Biostatistics](https://publichealth.jhu.edu/departments/biostatistics) at the [Johns Hopkins Bloomberg School of Public Health](https://publichealth.jhu.edu). I was actually a student of this course in Fall 2012 when [Roger D. Peng](https://twitter.com/rdpeng) was the instructor of this course and [Jenna Krall](https://twitter.com/kralljr) was the only teaching assistant. I first taught this course in 2023. I myself started learning and [teaching](http://lcolladotor.github.io/#teaching) about R in 2008 and we have a lot of material in the [LIBD rstats club](https://research.libd.org/rstatsclub/) that you might be interested in.\n\nAt the Lieber Institute for Brain Development ([LIBD](https://www.libd.org/)), my group works on understanding the roots and signatures of disease (particularly psychiatric disorders) by zooming in across dimensions of gene activity. We achieve this by studying gene expression at all expression feature levels (genes, exons, exon-exon junctions, and un-annotated regions) and by using different gene expression measurement technologies (bulk RNA-seq, single cell/nucleus RNA-seq, and spatial transcriptomics) that provide finer biological resolution and localization of gene expression. We work closely with collaborators from LIBD as well as from Johns Hopkins University (JHU) and other institutions, which reflects the cross-disciplinary approach and diversity in expertise needed to further advance our understanding of high throughput biology.\n\nEvery day I use [R](http://cran.r-project.org/) and [Bioconductor](http://www.bioconductor.org/), and on some days I [write R packages](https://lcolladotor.github.io/pkgs/). Occasionally I write [blog posts](http://lcolladotor.github.io/#blog) about them and other tools. I'm a co-founder of the [LIBD rstats club](http://LieberInstitute.github.io/rstatsclub/) and the [CDSB community](https://comunidadbioinfo.github.io) of R and Bioconductor developers in Mexico and Latin America, that we described at the [R Consortium website](https://www.r-consortium.org/blog/2020/03/18/cdsb-diversity-and-outreach-hotspot-in-mexico). In the past, I also served on the [Bioconductor Community Advisory Board](http://bioconductor.org/about/community-advisory-board/).\n\nIf you want, you can find me on [Twitter](https://twitter.com/lcolladotor) or [Bluesky](https://bsky.app/profile/lcolladotor.bsky.social).\n\n#### Teaching Assistants\n\nWe also have three amazing TAs this year:\n\n- Jiaxin (Jessi) Huang ([jhuan206\\@jh.edu](mailto:jhuan206@jh.edu)): I am a second-year master student in the Biostatistics program, specializing in longitudinal data analysis. My current research focuses on exploring and identifying factors associated with long asymptomatic malaria infections. In my free time, I enjoy traveling, cooking, outdoor activities and watching shows.\n- Phyllis Wei ([ywei43\\@jhu.edu](mailto:ywei43@jhu.edu)). She is a fourth year Ph.D. student in Biostatistics. She develops methods to help understand the genetic basis of complex traits and diseases and enhance disease risk prediction models through data integration. Outside of biostatistics, she enjoys hiking, baking, and visiting museums.\n- Yu Lu ([ylu136\\@jh.edu](mailto:ylu136@jh.edu)): I am a third-year Ph.D. student in Biostatistics. My work is focused on developing functional data analysis methods inspired by physical activity data. In my free time, I enjoy the outdoor but also cuddling with my cat, Finland. (Didn’t manage to make him wear a harness and take a walk outside yet)\n\n### Assignment Due Dates\n\nAll course assignment due dates appear on the **Schedule** and **Syllabus**.\n\n### Grading\n\n#### Philosophy\n\nWe believe the purpose of graduate education is to train you to be able to think for yourself and initiate and complete your own projects. We are super excited to talk to you about ideas, work out solutions with you, and help you to figure out how to produce professional data analyses. We do not think that graduate school grades are important for this purpose. This means that we do not care very much about graduate student grades.\n\nThat being said, we have to give you a grade so they will be:\n\n- A - Excellent - 90%+\n- B - Passing - 80%+\n- C - Needs improvement - 70%+\n\nWe rarely give out grades below a C and if you consistently submit work, and do your best you are very likely to get an A or a B in the course.\n\nWhen I was a JHBSPH student in 2011-2016, I had a scholarship from my country πŸ‡²πŸ‡½ which had specific grade requirements, so I recognize that while most employers won't care about your grades over your grad school projects, you might have strong reasons for aiming for high grades.\n\n#### Relative weights\n\nThe grades are based on three projects (plus one entirely optional project to help you get set up). The breakdown of grading will be\n\n- 33% for Project 1\n- 33% for Project 2\n- 34% for Project 3\n\nIf you submit an project solution, it is your own work, and it meets a basic level of completeness and effort you will get 100% for that project. If you submit a project solution, but it doesn't meet basic completeness and effort you will receive 50%. If you do not submit an solution you will receive 0%.\n\n### Submitting assignments\n\nPlease write up your project solutions using R Markdown. In some cases, you will compile a R Markdown file into an HTML file and submit your HTML file to the dropbox on Courseplus. In other cases, you may create an R package or website. In all of the above, when applicable, show all your code and provide as much explanation / documentation as you can.\n\nFor each project, we will provide a time when we download the materials. We will assume whatever version we download at that time is what you are turning in.\n\n### Reproducibility\n\nWe will talk about reproducibility a bit during class, and it will be a part of the homework assignments as well. Reproducibility of scientific code is very challenging, so the faculty and TAs completely understand difficulties that arise. But we think that it is important that you practice reproducible research. In particular, your project assignments should perform the tasks that you are asked to do and create the figures and tables you are asked to make as a part of the compilation of your document. We will have some pointers for some issues that have come up as we announce the projects.\n\n### Code of Conduct\n\nWe are committed to providing a welcoming, inclusive, and harassment-free experience for everyone, regardless of gender, gender identity and expression, age, sexual orientation, disability, physical appearance, body size, race, ethnicity, religion (or lack thereof), political beliefs/leanings, or technology choices. We do not tolerate harassment of course participants in any form. Sexual language and imagery is not appropriate for any work event, including group meetings, conferences, talks, parties, Twitter, and other online media. This code of conduct applies to all course participants, including instructors and TAs, and applies to all modes of interaction, both in-person and online, including GitHub project repos, Slack channels, and Twitter.\n\nI was also part of the Bioconductor Code of Conduct committee for a few years. You might find the Bioconductor Code of Conduct useful as it is translated into different languages by native speakers of said languages: .\n\nCourse participants violating these rules will be referred to leadership of the Department of Biostatistics and the Title IX coordinator at JHU and may face expulsion from the class.\n\nAll class participants agree to:\n\n- **Be considerate** in speech and actions, and actively seek to acknowledge and respect the boundaries of other members.\n- **Be respectful**. Disagreements happen, but do not require poor behavior or poor manners. Frustration is inevitable, but it should never turn into a personal attack. A community where people feel uncomfortable or threatened is not a productive one. Course participants should be respectful both of the other course participants and those outside the course.\n- **Refrain from demeaning, discriminatory, or harassing behavior and speech**. Harassment includes, but is not limited to: deliberate intimidation; stalking; unwanted photography or recording; sustained or willful disruption of talks or other events; inappropriate physical contact; use of sexual or discriminatory imagery, comments, or jokes; and unwelcome sexual attention. If you feel that someone has harassed you or otherwise treated you inappropriately, please alert Leonardo Collado Torres.\n- **Take care of each other**. Refrain from advocating for, or encouraging, any of the above behavior. And, if someone asks you to stop, then stop. Alert Leonardo Collado Torres if you notice a dangerous situation, someone in distress, or violations of this code of conduct, even if they seem inconsequential.\n\n### Need Help?\n\nPlease speak with Leonardo Collado Torres or one of the TAs. You can also reach out to Elizabeth Stuart, chair of the department of Biostatistics or Margaret Taub, Ombudsman for the Department of Biostatistics.\n\nYou may also reach out to any Hopkins resource for sexual harassment, discrimination, or misconduct:\n\n- JHU Sexual Assault Helpline, 410-516-7333 (confidential)\n- [University Sexual Assault Response and Prevention website](http://sexualassault.jhu.edu/?utm_source=JHU+Broadcast+Messages+-+Synced+List&utm_campaign=c9030551f7-EMAIL_CAMPAIGN_2017_12_11&utm_medium=email&utm_term=0_af6859b027-c9030551f7-69248741)\n- [Johns Hopkins Compliance Hotline](https://johnshopkinsspeak2us.tnwreports.com/?utm_source=JHU+Broadcast+Messages+-+Synced+List&utm_campaign=c9030551f7-EMAIL_CAMPAIGN_2017_12_11&utm_medium=email&utm_term=0_af6859b027-c9030551f7-69248741), 844-SPEAK2US (844-733-2528)\n- [Hopkins Policies Online](https://jhu.us5.list-manage.com/track/click?u=bd75ef1a5cad0cbfd522412c4&id=8a667a12dd&e=b1124f7c17)\n- [JHU Office of Institutional Equity](https://jhu.us5.list-manage.com/track/click?u=bd75ef1a5cad0cbfd522412c4&id=928bcfb8a9&e=b1124f7c17) 410-516-8075 (nonconfidential)\n- [Johns Hopkins Student Assistance Program](https://jhu.us5.list-manage.com/track/click?u=bd75ef1a5cad0cbfd522412c4&id=98f4091f97&e=b1124f7c17) (JHSAP), 443-287-7000\n- [University Health Services](https://jhu.us5.list-manage.com/track/click?u=bd75ef1a5cad0cbfd522412c4&id=d51077694c&e=b1124f7c17), 410-955-1892\n- [The Faculty and Staff Assistance Program](https://jhu.us5.list-manage.com/track/click?u=bd75ef1a5cad0cbfd522412c4&id=af1f20bd97&e=b1124f7c17) (FASAP), 443-997-7000\n\n### Feedback\n\nWe welcome feedback on this Code of Conduct.\n\n### License and attribution\n\nThis Code of Conduct is distributed under a Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. Portions of above text comprised of language from the Codes of Conduct adopted by rOpenSci and Django, which are licensed by CC BY-SA 4.0 and CC BY 3.0. This work was further inspired by Ada Initiative's ''how to design a code of conduct for your community'' and Geek Feminism's Code of conduct evaluations and expanded by Ashley Johnson and Shannon Ellis in the Jeff Leek group.\n\n### Academic Ethics\n\nStudents enrolled in the Bloomberg School of Public Health of The Johns Hopkins University assume an obligation to conduct themselves in a manner appropriate to the University's mission as an institution of higher education. A student is obligated to refrain from acts which he or she knows, or under the circumstances has reason to know, impair the academic integrity of the University. Violations of academic integrity include, but are not limited to: cheating; plagiarism; knowingly furnishing false information to any agent of the University for inclusion in the academic record; violation of the rights and welfare of animal or human subjects in research; and misconduct as a member of either School or University committees or recognized groups or organizations.\n\nStudents should be familiar with the policies and procedures specified under Policy and Procedure Manual Student-01 (Academic Ethics), available on the school's [portal](https://my.jhsph.edu/Pages/Faculty.aspx).\n\nThe faculty, staff and students of the Bloomberg School of Public Health and the Johns Hopkins University have the shared responsibility to conduct themselves in a manner that upholds the law and respects the rights of others. Students enrolled in the School are subject to the Student Conduct Code (detailed in Policy and Procedure Manual Student-06) and assume an obligation to conduct themselves in a manner which upholds the law and respects the rights of others. They are responsible for maintaining the academic integrity of the institution and for preserving an environment conducive to the safe pursuit of the School's educational, research, and professional practice missions.\n\n## Disability Support Service\n\nStudents requiring accommodations for disabilities should register with [Student Disability Service (SDS)](https://publichealth.jhu.edu/about/inclusion-diversity-anti-racism-and-equity-idare/student-disability-services). It is the responsibility of the student to register for accommodations with SDS. Accommodations take effect upon approval and apply to the remainder of the time for which a student is registered and enrolled at the Bloomberg School of Public Health. Once you are f a student in your class has approved accommodations you will receive formal notification and the student will be encouraged to reach out. If you have questions about requesting accommodations, please contact `BSPH.dss@jhu.edu`.\n\n## Previous versions of the class\n\n- \n- \n- \n\n## Typos and corrections\n\nFeel free to submit typos/errors/etc via the GitHub repository associated with the class: . You will have the thanks of your grateful instructor!\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.4.1 (2024-06-14)\n os macOS Sonoma 14.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2024-08-27\n pandoc 3.2 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)\n colorout * 1.3-0.2 2024-05-03 [1] Github (jalvesaq/colorout@c6113a2)\n digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.0)\n evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)\n fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)\n htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)\n htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)\n jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)\n knitr 1.48 2024-07-07 [1] CRAN (R 4.4.0)\n rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)\n rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)\n rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)\n xfun 0.46 2024-07-18 [1] CRAN (R 4.4.0)\n yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/posts/02-introduction-to-r-and-rstudio/index/execute-results/html.json b/_freeze/posts/02-introduction-to-r-and-rstudio/index/execute-results/html.json index d8b63d5..8fec5e4 100644 --- a/_freeze/posts/02-introduction-to-r-and-rstudio/index/execute-results/html.json +++ b/_freeze/posts/02-introduction-to-r-and-rstudio/index/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "1eae53fab21845a9fe3c7f78e44cee15", + "hash": "5ba4dca443337c86e30c6e59aa2c70e8", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"02 - Introduction to R and RStudio!\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Let's dig into the R programming language and the RStudio integrated developer environment\"\nimage: https://github.com/njtierney/rmd4sci/raw/master/figs/rstudio-screenshot.png\ncategories: [module 1, week 1, R, programming, RStudio]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing/commits/main/posts/02-introduction-to-r-and-rstudio/index.qmd).*\n\n> There are only two kinds of languages: the ones people complain about and the ones nobody uses. ---*Bjarne Stroustrup*\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n### Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. [An overview and history of R](https://rdpeng.github.io/Biostat776/lecture-introduction-and-overview.html) from Roger Peng\n2. [Installing R and RStudio](https://rafalab.github.io/dsbook/installing-r-rstudio.html) from Rafael Irizarry\n3. [Getting Started in R and RStudio](https://rafalab.github.io/dsbook/getting-started.html) from Rafael Irizarry\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n- \n- \n\n# Learning objectives\n\n::: callout-note\n## Learning objectives\n\n**At the end of this lesson you will:**\n\n- Learn about (some of) the history of R.\n- Identify some of the strengths and weaknesses of R.\n- Install R and Rstudio on your computer.\n- Know how to install and load R packages.\n:::\n\n# Overview and history of R\n\nBelow is a very quick introduction to R, to get you set up and running. We'll go deeper into R and coding later.\n\n### tl;dr (R in a nutshell)\n\nLike every programming language, R has its advantages and disadvantages. If you search the internet, you will quickly discover lots of folks with opinions about R. Some of the features that are useful to know are:\n\n- R is open-source, freely accessible, and cross-platform (multiple OS).\n- R is a [\"high-level\" programming language](https://en.wikipedia.org/wiki/High-level_programming_language), relatively easy to learn.\n - While \"Low-level\" programming languages (e.g. Fortran, C, etc) often have more efficient code, they can also be harder to learn because it is designed to be close to a machine language.\n - In contrast, high-level languages deal more with variables, objects, functions, loops, and other abstract CS concepts with a focus on usability over optimal program efficiency.\n- R is great for statistics, data analysis, websites, web apps, data visualizations, and so much more!\n- R integrates easily with document preparation systems like $\\LaTeX$, but R files can also be used to create `.docx`, `.pdf`, `.html`, `.ppt` files with integrated R code output and graphics.\n- The R Community is very dynamic, helpful and welcoming.\n - Check out the [#rstats](https://twitter.com/search?q=%23rstats) or [#rtistry](https://twitter.com/search?q=%23rtistry) on Twitter, [TidyTuesday](https://www.tidytuesday.com) podcast and community activity in the [R4DS Online Learning Community](https://www.rfordatasci.com), and [r/rstats](https://www.reddit.com/r/rstats/) subreddit.\n - If you are looking for more local resources, check out [R-Ladies Baltimore](https://www.meetup.com/rladies-baltimore/).\n- Through R packages, it is easy to get lots of state-of-the-art algorithms.\n- Documentation and help files for R are generally good.\n\nWhile we use R in this course, it is not the only option to analyze data. Maybe the most similar to R, and widely used, is Python, which is also free. There is also commercial software that can be used to analyze data (e.g., Matlab, Mathematica, Tableau, SAS, SPSS). Other more general programming languages are suitable for certain types of analyses as well (e.g., C, Fortran, Perl, Java, Julia).\n\nDepending on your future needs or jobs, you might have to learn one or several of those additional languages. The good news is that even though those languages are all different, they all share general ways of thinking and structuring code. So once you understand a specific concept (e.g., variables, loops, branching statements or functions), it applies to all those languages. Thus, learning a new programming language is much easier once you already know one. And R is a good one to get started with.\n\nWith the skills gained in this course, hopefully you will find R a fun and useful programming language for your future projects.\n\n![Artwork by Allison Horst on learning R](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/r_first_then.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### Basic Features of R\n\nToday R runs on almost any standard computing platform and operating system. Its open source nature means that anyone is free to adapt the software to whatever platform they choose. Indeed, R has been reported to be running on modern tablets, phones, PDAs, and game consoles.\n\nOne nice feature that R shares with many popular open source projects is frequent releases. These days there is a major annual release, typically at the end of April, where major new features are incorporated and released to the public. Throughout the year, smaller-scale bugfix releases will be made as needed. The frequent releases and regular release cycle indicates active development of the software and ensures that bugs will be addressed in a timely manner. Of course, while the core developers control the primary source tree for R, many people around the world make contributions in the form of new feature, bug fixes, or both.\n\nAnother key advantage that R has over many other statistical packages (even today) is its sophisticated graphics capabilities. R's ability to create \"publication quality\" graphics has existed since the very beginning and has generally been better than competing packages. Today, with many more visualization packages available than before, that trend continues. R's *base* graphics system allows for very fine control over essentially every aspect of a plot or graph. Other newer graphics systems, like `lattice` (not as used nowadays) and `ggplot2` (very widely used now) allow for complex and sophisticated visualizations of high-dimensional data.\n\nR has maintained the original S philosophy (see box below), which is that **it provides a language that is both useful for interactive work, but contains a powerful programming language for developing new tools**. This allows the user, who takes existing tools and applies them to data, to slowly but surely become a developer who is creating new tools.\n\n::: callout-tip\nFor a great discussion on an overview and history of R and the S programming language, read through [this chapter](https://rdpeng.github.io/Biostat776/lecture-introduction-and-overview.html) from Roger D. Peng.\n:::\n\nFinally, one of the joys of using R has nothing to do with the language itself, but rather with the active and vibrant user community. In many ways, a language is successful inasmuch as it creates a platform with which many people can create new things. R is that platform and thousands of people around the world have come together to make contributions to R, to develop packages, and help each other use R for all kinds of applications. The R-help and R-devel mailing lists have been highly active for over a decade now and there is considerable activity on web sites like GitHub, [Posit (RStudio) Community](https://community.rstudio.com/), [Bioconductor Support](https://support.bioconductor.org/), Stack Overflow, Twitter [#rstats](https://twitter.com/search?q=%23rstats), [#rtistry](https://twitter.com/search?q=%23rtistry), and [Reddit](https://www.reddit.com/r/rstats/).\n\n### Free Software\n\nA major advantage that R has over many other statistical packages and is that it's free in the sense of free software (it's also free in the sense of free beer). The copyright for the primary source code for R is held by the [R Foundation](http://www.r-project.org/foundation/) and is published under the [GNU General Public License version 2.0](http://www.gnu.org/licenses/gpl-2.0.html).\n\nAccording to the Free Software Foundation, with *free software*, you are granted the following [four freedoms](http://www.gnu.org/philosophy/free-sw.html)\n\n- The freedom to run the program, for any purpose (freedom 0).\n\n- The freedom to study how the program works, and adapt it to your needs (freedom 1). Access to the source code is a precondition for this.\n\n- The freedom to redistribute copies so you can help your neighbor (freedom 2).\n\n- The freedom to improve the program, and release your improvements to the public, so that the whole community benefits (freedom 3). Access to the source code is a precondition for this.\n\n::: callout-tip\nYou can visit the [Free Software Foundation's web site](http://www.fsf.org) to learn a lot more about free software. The Free Software Foundation was founded by Richard Stallman in 1985 and [Stallman's personal web site](https://stallman.org) is an interesting read if you happen to have some spare time.\n:::\n\n### Design of the R System\n\nThe primary R system is available from the [Comprehensive R Archive Network](http://cran.r-project.org), also known as CRAN. CRAN also hosts many add-on packages that can be used to extend the functionality of R.\n\nThe R system is divided into 2 conceptual parts:\n\n1. The \"base\" R system that you download from CRAN:\n\n- [Linux](http://cran.r-project.org/bin/linux/)\n- [Windows](http://cran.r-project.org/bin/windows/)\n- [Mac](http://cran.r-project.org/bin/macosx/)\n\n2. Everything else.\n\nR functionality is divided into a number of *packages*.\n\n- The \"base\" R system contains, among other things, the `base` package which is required to run R and contains the most fundamental functions.\n\n- The other packages contained in the \"base\" system include `utils`, `stats`, `datasets`, `graphics`, `grDevices`, `grid`, `methods`, `tools`, `parallel`, `compiler`, `splines`, `tcltk`, `stats4`.\n\n- There are also \"Recommended\" packages: `boot`, `class`, `cluster`, `codetools`, `foreign`, `KernSmooth`, `lattice`, `mgcv`, `nlme`, `rpart`, `survival`, `MASS`, `spatial`, `nnet`, `Matrix`.\n\nWhen you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available:\n\n- There are over 21,163 packages on CRAN that have been developed by users and programmers around the world.\n\n- There are also over 2,300 packages associated with the [Bioconductor project](http://bioconductor.org).\n\n- People often make packages available on GitHub (very common) and their personal websites (not so common nowadays); there is no reliable way to keep track of how many packages are available in this fashion.\n\n![Slide from 2012 by Roger D. Peng](/images/140.776_2012_websites.png){fig-alt=\"Screenshot from a slide from 2012 showing that R developers share their packages on personal websites\" fig-align=\"center\"}\n\n::: callout-note\n## Questions\n\n1. How many R packages are on CRAN today?\n2. How many R packages are on Bioconductor today?\n3. How many R packages are on GitHub today?\n:::\n\nWant to learn more about Bioconductor? Check this video:\n\n\n\n### Limitations of R\n\nNo programming language or statistical analysis system is perfect. R certainly has a number of drawbacks. For starters, R is essentially based on **almost 50 year old technology**, going back to the original S system developed at Bell Labs. There was originally little built in support for dynamic or 3-D graphics (but things have improved greatly since the \"old days\").\n\nAnother commonly cited limitation of R is that **objects must generally be stored in physical memory** (though this is increasingly not true anymore). This is in part due to the scoping rules of the language, but R generally is more of a memory hog than other statistical packages. However, there have been a number of advancements to deal with this, both in the R core and also in a number of packages developed by contributors. Also, computing power and capacity has continued to grow over time and amount of physical memory that can be installed on even a consumer-level laptop is substantial. While we will likely never have enough physical memory on a computer to handle the increasingly large datasets that are being generated, the situation has gotten quite a bit easier over time.\n\nAt a higher level one \"limitation\" of R is that **its functionality is based on consumer demand and (voluntary) user contributions**. If no one feels like implementing your favorite method, then it's *your* job to implement it (or you need to pay someone to do it). The capabilities of the R system generally reflect the interests of the R user community. As the community has ballooned in size over the past 10 years, the capabilities have similarly increased. When I first started using R, there was very little in the way of functionality for the physical sciences (physics, astronomy, etc.). However, now some of those communities have adopted R and we are seeing more code being written for those kinds of applications.\n\n# Using R and RStudio\n\n> If R is the engine and bare bones of your car, then RStudio is like the rest of the car. The engine is super critical part of your car. But in order to make things properly functional, you need to have a steering wheel, comfy seats, a radio, rear and side view mirrors, storage, and seatbelts. --- *Nicholas Tierney*\n\n\\[[Source](https://rmd4sci.njtierney.com)\\]\n\nThe RStudio layout has the following features:\n\n- On the upper left, something called a Rmarkdown script\n- On the lower left, the R console\n- On the lower right, the view for files, plots, packages, help, and viewer.\n- On the upper right, the environment / history pane\n\n![A screenshot of the RStudio integrated developer environment (IDE) -- aka the working environment](https://github.com/njtierney/rmd4sci/raw/master/figs/rstudio-screenshot.png)\n\nThe R console is the bit where you can run your code. This is where the R code in your Rmarkdown document gets sent to run (we'll learn about these files later).\n\nThe file/plot/pkg viewer is a handy browser for your current files, like Finder, or File Explorer, plots are where your plots appear, you can view packages, see the help files. And the environment / history pane contains the list of things you have created, and the past commands that you have run.\n\n### Installing R and RStudio\n\n- If you have not already, [install R first](http://cran.r-project.org). If you already have R installed, make sure it is a fairly recent version, version 4.4.1 or newer. If yours is older that version 4.3.0, I suggest you update (install a new R version).\n- Once you have R installed, install the free version of [RStudio Desktop](https://www.rstudio.com/products/rstudio/download/). Again, make sure it's a recent version.\n\n::: callout-tip\nInstalling R and RStudio should be fairly straightforward. However, a great set of detailed instructions is in Rafael Irizarry's `dsbook`\n\n- \n:::\n\nIf things don't work, ask for help in the Courseplus discussion board.\n\nI have both a macOS and a winOS computer, and have used Linux (Ubuntu) in the past too, but I might be more limited in how much I can help you on Linux.\n\n### RStudio default options\n\nTo first get set up, I highly recommend changing the following setting\n\nTools \\> Global Options (or `Cmd + ,` on macOS)\n\nUnder the **General** tab:\n\n- For **workspace**\n - Uncheck restore .RData into workspace at startup\n - Save workspace to .RData on exit : \"Never\"\n- For **History**\n - Uncheck \"Always save history (even when not saving .RData)\n - Uncheck \"Remove duplicate entries in history\"\n\nThis means that you won't save the objects and other things that you create in your R session and reload them. This is important for two reasons\n\n1. **Reproducibility**: you don't want to have objects from last week cluttering your session\n2. **Privacy**: you don't want to save private data or other things to your session. You only want to read these in.\n\nYour \"history\" is the commands that you have entered into R.\n\nAdditionally, not saving your history means that you won't be relying on things that you typed in the last session, which is a good habit to get into!\n\n### Installing and loading R packages\n\nAs we discussed, most of the functionality and features in R come in the form of add-on packages. There are tens of thousands of packages available, some big, some small, some well documented, some not. We will be using many different packages in this course. Of course, you are free to install and use any package you come across for any of the assignments.\n\nThe \"official\" place for packages is the [CRAN website](https://cran.r-project.org/web/packages/available_packages_by_name.html). If you are interested in packages on a specific topic, the [CRAN task views](http://cran.r-project.org/web/views/) provide curated descriptions of packages sorted by topic.\n\nTo install an R package from CRAN, one can simply call the `install.packages()` function and pass the name of the package as an argument. For example, to install the `ggplot2` package from CRAN: open RStudio,go to the R prompt (the `>` symbol) in the lower-left corner and type\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"ggplot2\")\n\n## Below is an example for installing more than one package at a time:\n\n## Install R packages for project 0\ninstall.packages(\n c(\"postcards\", \"usethis\", \"gitcreds\")\n)\n```\n:::\n\n\nand the appropriate version of the package will be installed.\n\nOften, a package needs other packages to work (called dependencies), and they are installed automatically. It usually does not matter if you use a single or double quotation mark around the name of the package.\n\n::: callout-note\n## Questions\n\n1. As you installed the `ggplot2` package, what other packages were installed?\n2. What happens if you tried to install `GGplot2`?\n:::\n\nIt could be that you already have all packages required by `ggplot2` installed. In that case, you will not see any other packages installed. To see which of the packages above `ggplot2` needs (and thus installs if it is not present), type into the R console:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntools::package_dependencies(\"ggplot2\")\n```\n:::\n\n\nIn RStudio, you can also install (and update/remove) packages by clicking on the 'Packages' tab in the bottom right window.\n\nIt is very common these days for packages to be developed on GitHub. It is possible to install packages from GitHub directly. Those usually contain the latest version of the package, with features that might not be available yet on the CRAN website. Sometimes, in early development stages, a package is only on GitHub until the developer(s) feel it is good enough for CRAN submission. So installing from GitHub gives you the latest. The downside is that packages under development can often be buggy and not working right. To install packages from GitHub, you need to install the `remotes` package and then use the following function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nremotes::install_github()\n```\n:::\n\n\nWe will not do that now, but it is quite likely that at one point later in this course we will.\n\nYou only need to install a package once, unless you upgrade/re-install R. Once installed, you still need to load the package before you can use it. That has to happen every time you start a new R session. You do that using the `library()` command. For instance to load the `ggplot2` package, type\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(\"ggplot2\")\n```\n:::\n\n\nYou may or may not see a short message on the screen. Some packages show messages when you load them, and others do not.\n\nThis was a quick overview of R packages. We will use a lot of them, so you will get used to them rather quickly.\n\n### Getting started in RStudio\n\nWhile one can use R and do pretty much every task, including all the ones we cover in this class, without using RStudio, RStudio is very useful, has lots of features that make your R coding life easier and has become pretty much the default integrated development environment (IDE) for R. Since RStudio has lots of features, it takes time to learn them. A good resource to learn more about RStudio are the [R Studio Essentials](https://resources.rstudio.com/) collection of videos.\n\n::: callout-tip\nFor more information on setting up and getting started with R, RStudio, and R packages, read the Getting Started chapter in the `dsbook`:\n\n- \n\nThis chapter gives some tips, shortcuts, and ideas that might be of interest even to those of you who already have R and/or RStudio experience.\n:::\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n## Questions\n\n1. If a software company asks you, as a requirement for using their software, to sign a license that restricts you from using their software to commit illegal activities, is this consistent with the \"Four Freedoms\" of Free Software?\n\n2. What is an R package and what is it used for?\n\n3. What function in R can be used to install packages from CRAN?\n\n4. What is a limitation of the current R system?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- [R for Data Science (2e)](https://r4ds.hadley.nz/) by Wickham & Grolemund (2017, 2e is from July 18th 2023). Covers most of the basics of using R for data analysis.\n\n- [Advanced R](https://adv-r.hadley.nz) by Wickham (2014). Covers a number of areas including object-oriented, programming, functional programming, profiling and other advanced topics.\n\n- [RStudio IDE cheatsheet](https://github.com/rstudio/cheatsheets/raw/main/rstudio-ide.pdf)\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.4.1 (2024-06-14)\n os macOS Sonoma 14.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2024-08-27\n pandoc 3.2 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)\n colorout * 1.3-0.2 2024-05-03 [1] Github (jalvesaq/colorout@c6113a2)\n digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.0)\n evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)\n fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)\n htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)\n htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)\n jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)\n knitr 1.48 2024-07-07 [1] CRAN (R 4.4.0)\n rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)\n rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)\n rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)\n xfun 0.46 2024-07-18 [1] CRAN (R 4.4.0)\n yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n", + "markdown": "---\ntitle: \"02 - Introduction to R and RStudio!\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Let's dig into the R programming language and the RStudio integrated developer environment\"\nimage: https://github.com/njtierney/rmd4sci/raw/master/figs/rstudio-screenshot.png\ncategories: [module 1, week 1, R, programming, RStudio]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing/commits/main/posts/02-introduction-to-r-and-rstudio/index.qmd).*\n\n> There are only two kinds of languages: the ones people complain about and the ones nobody uses. ---*Bjarne Stroustrup*\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n### Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. [An overview and history of R](https://rdpeng.github.io/Biostat776/lecture-introduction-and-overview.html) from Roger Peng\n2. [Installing R and RStudio](https://rafalab.github.io/dsbook/installing-r-rstudio.html) from Rafael Irizarry\n3. [Getting Started in R and RStudio](https://rafalab.github.io/dsbook/getting-started.html) from Rafael Irizarry\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n- \n- \n\n# Learning objectives\n\n::: callout-note\n## Learning objectives\n\n**At the end of this lesson you will:**\n\n- Learn about (some of) the history of R.\n- Identify some of the strengths and weaknesses of R.\n- Install R and Rstudio on your computer.\n- Know how to install and load R packages.\n:::\n\n# Overview and history of R\n\nBelow is a very quick introduction to R, to get you set up and running. We'll go deeper into R and coding later.\n\n### tl;dr (R in a nutshell)\n\nLike every programming language, R has its advantages and disadvantages. If you search the internet, you will quickly discover lots of folks with opinions about R. Some of the features that are useful to know are:\n\n- R is open-source, freely accessible, and cross-platform (multiple OS).\n- R is a [\"high-level\" programming language](https://en.wikipedia.org/wiki/High-level_programming_language), relatively easy to learn.\n - While \"Low-level\" programming languages (e.g. Fortran, C, etc) often have more efficient code, they can also be harder to learn because it is designed to be close to a machine language.\n - In contrast, high-level languages deal more with variables, objects, functions, loops, and other abstract CS concepts with a focus on usability over optimal program efficiency.\n- R is great for statistics, data analysis, websites, web apps, data visualizations, and so much more!\n- R integrates easily with document preparation systems like $\\LaTeX$, but R files can also be used to create `.docx`, `.pdf`, `.html`, `.ppt` files with integrated R code output and graphics.\n- The R Community is very dynamic, helpful and welcoming.\n - Check out the [#rstats](https://twitter.com/search?q=%23rstats) or [#rtistry](https://twitter.com/search?q=%23rtistry) on Twitter, [TidyTuesday](https://www.tidytuesday.com) podcast and community activity in the [R4DS Online Learning Community](https://www.rfordatasci.com), and [r/rstats](https://www.reddit.com/r/rstats/) subreddit.\n - If you are looking for more local resources, check out [R-Ladies Baltimore](https://www.meetup.com/rladies-baltimore/).\n- Through R packages, it is easy to get lots of state-of-the-art algorithms.\n- Documentation and help files for R are generally good.\n\nWhile we use R in this course, it is not the only option to analyze data. Maybe the most similar to R, and widely used, is Python, which is also free. There is also commercial software that can be used to analyze data (e.g., Matlab, Mathematica, Tableau, SAS, SPSS). Other more general programming languages are suitable for certain types of analyses as well (e.g., C, Fortran, Perl, Java, Julia).\n\nDepending on your future needs or jobs, you might have to learn one or several of those additional languages. The good news is that even though those languages are all different, they all share general ways of thinking and structuring code. So once you understand a specific concept (e.g., variables, loops, branching statements or functions), it applies to all those languages. Thus, learning a new programming language is much easier once you already know one. And R is a good one to get started with.\n\nWith the skills gained in this course, hopefully you will find R a fun and useful programming language for your future projects.\n\n![Artwork by Allison Horst on learning R](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/r_first_then.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### Basic Features of R\n\nToday R runs on almost any standard computing platform and operating system. Its open source nature means that anyone is free to adapt the software to whatever platform they choose. Indeed, R has been reported to be running on modern tablets, phones, PDAs, and game consoles.\n\nOne nice feature that R shares with many popular open source projects is frequent releases. These days there is a major annual release, typically at the end of April, where major new features are incorporated and released to the public. Throughout the year, smaller-scale bugfix releases will be made as needed. The frequent releases and regular release cycle indicates active development of the software and ensures that bugs will be addressed in a timely manner. Of course, while the core developers control the primary source tree for R, many people around the world make contributions in the form of new feature, bug fixes, or both.\n\nAnother key advantage that R has over many other statistical packages (even today) is its sophisticated graphics capabilities. R's ability to create \"publication quality\" graphics has existed since the very beginning and has generally been better than competing packages. Today, with many more visualization packages available than before, that trend continues. R's *base* graphics system allows for very fine control over essentially every aspect of a plot or graph. Other newer graphics systems, like `lattice` (not as used nowadays) and `ggplot2` (very widely used now) allow for complex and sophisticated visualizations of high-dimensional data.\n\nR has maintained the original S philosophy (see box below), which is that **it provides a language that is both useful for interactive work, but contains a powerful programming language for developing new tools**. This allows the user, who takes existing tools and applies them to data, to slowly but surely become a developer who is creating new tools.\n\n::: callout-tip\nFor a great discussion on an overview and history of R and the S programming language, read through [this chapter](https://rdpeng.github.io/Biostat776/lecture-introduction-and-overview.html) from Roger D. Peng.\n:::\n\nFinally, one of the joys of using R has nothing to do with the language itself, but rather with the active and vibrant user community. In many ways, a language is successful inasmuch as it creates a platform with which many people can create new things. R is that platform and thousands of people around the world have come together to make contributions to R, to develop packages, and help each other use R for all kinds of applications. The R-help and R-devel mailing lists have been highly active for over a decade now and there is considerable activity on web sites like GitHub, [Posit (RStudio) Community](https://community.rstudio.com/), [Bioconductor Support](https://support.bioconductor.org/), Stack Overflow, Twitter [#rstats](https://twitter.com/search?q=%23rstats), [#rtistry](https://twitter.com/search?q=%23rtistry), and [Reddit](https://www.reddit.com/r/rstats/).\n\n### Free Software\n\nA major advantage that R has over many other statistical packages and is that it's free in the sense of free software (it's also free in the sense of free beer). The copyright for the primary source code for R is held by the [R Foundation](http://www.r-project.org/foundation/) and is published under the [GNU General Public License version 2.0](http://www.gnu.org/licenses/gpl-2.0.html).\n\nAccording to the Free Software Foundation, with *free software*, you are granted the following [four freedoms](http://www.gnu.org/philosophy/free-sw.html)\n\n- The freedom to run the program, for any purpose (freedom 0).\n\n- The freedom to study how the program works, and adapt it to your needs (freedom 1). Access to the source code is a precondition for this.\n\n- The freedom to redistribute copies so you can help your neighbor (freedom 2).\n\n- The freedom to improve the program, and release your improvements to the public, so that the whole community benefits (freedom 3). Access to the source code is a precondition for this.\n\n::: callout-tip\nYou can visit the [Free Software Foundation's web site](http://www.fsf.org) to learn a lot more about free software. The Free Software Foundation was founded by Richard Stallman in 1985 and [Stallman's personal web site](https://stallman.org) is an interesting read if you happen to have some spare time.\n:::\n\n### Design of the R System\n\nThe primary R system is available from the [Comprehensive R Archive Network](http://cran.r-project.org), also known as CRAN. CRAN also hosts many add-on packages that can be used to extend the functionality of R.\n\nThe R system is divided into 2 conceptual parts:\n\n1. The \"base\" R system that you download from CRAN:\n\n- [Linux](http://cran.r-project.org/bin/linux/)\n- [Windows](http://cran.r-project.org/bin/windows/)\n- [Mac](http://cran.r-project.org/bin/macosx/)\n\n2. Everything else.\n\nR functionality is divided into a number of *packages*.\n\n- The \"base\" R system contains, among other things, the `base` package which is required to run R and contains the most fundamental functions.\n\n- The other packages contained in the \"base\" system include `utils`, `stats`, `datasets`, `graphics`, `grDevices`, `grid`, `methods`, `tools`, `parallel`, `compiler`, `splines`, `tcltk`, `stats4`.\n\n- There are also \"Recommended\" packages: `boot`, `class`, `cluster`, `codetools`, `foreign`, `KernSmooth`, `lattice`, `mgcv`, `nlme`, `rpart`, `survival`, `MASS`, `spatial`, `nnet`, `Matrix`.\n\nWhen you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available:\n\n- There are over 21,163 packages on CRAN that have been developed by users and programmers around the world.\n\n- There are also over 2,300 packages associated with the [Bioconductor project](http://bioconductor.org).\n\n- People often make packages available on GitHub (very common) and their personal websites (not so common nowadays); there is no reliable way to keep track of how many packages are available in this fashion.\n\n![Slide from 2012 by Roger D. Peng](/images/140.776_2012_websites.png){fig-alt=\"Screenshot from a slide from 2012 showing that R developers share their packages on personal websites\" fig-align=\"center\"}\n\n::: callout-note\n## Questions\n\n1. How many R packages are on CRAN today?\n2. How many R packages are on Bioconductor today?\n3. How many R packages are on GitHub today?\n:::\n\nWant to learn more about Bioconductor? Check this video:\n\n\n\n### Limitations of R\n\nNo programming language or statistical analysis system is perfect. R certainly has a number of drawbacks. For starters, R is essentially based on **almost 50 year old technology**, going back to the original S system developed at Bell Labs. There was originally little built in support for dynamic or 3-D graphics (but things have improved greatly since the \"old days\").\n\nAnother commonly cited limitation of R is that **objects must generally be stored in physical memory** (though this is increasingly not true anymore). This is in part due to the scoping rules of the language, but R generally is more of a memory hog than other statistical packages. However, there have been a number of advancements to deal with this, both in the R core and also in a number of packages developed by contributors. Also, computing power and capacity has continued to grow over time and amount of physical memory that can be installed on even a consumer-level laptop is substantial. While we will likely never have enough physical memory on a computer to handle the increasingly large datasets that are being generated, the situation has gotten quite a bit easier over time.\n\nAt a higher level one \"limitation\" of R is that **its functionality is based on consumer demand and (voluntary) user contributions**. If no one feels like implementing your favorite method, then it's *your* job to implement it (or you need to pay someone to do it). The capabilities of the R system generally reflect the interests of the R user community. As the community has ballooned in size over the past 10 years, the capabilities have similarly increased. When I first started using R, there was very little in the way of functionality for the physical sciences (physics, astronomy, etc.). However, now some of those communities have adopted R and we are seeing more code being written for those kinds of applications.\n\n# Using R and RStudio\n\n> If R is the engine and bare bones of your car, then RStudio is like the rest of the car. The engine is super critical part of your car. But in order to make things properly functional, you need to have a steering wheel, comfy seats, a radio, rear and side view mirrors, storage, and seatbelts. --- *Nicholas Tierney*\n\n\\[[Source](https://rmd4sci.njtierney.com)\\]\n\nThe RStudio layout has the following features:\n\n- On the upper left, something called a Rmarkdown script\n- On the lower left, the R console\n- On the lower right, the view for files, plots, packages, help, and viewer.\n- On the upper right, the environment / history pane\n\n![A screenshot of the RStudio integrated developer environment (IDE) -- aka the working environment](https://github.com/njtierney/rmd4sci/raw/master/figs/rstudio-screenshot.png)\n\nThe R console is the bit where you can run your code. This is where the R code in your Rmarkdown document gets sent to run (we'll learn about these files later).\n\nThe file/plot/pkg viewer is a handy browser for your current files, like Finder, or File Explorer, plots are where your plots appear, you can view packages, see the help files. And the environment / history pane contains the list of things you have created, and the past commands that you have run.\n\n### Installing R and RStudio\n\n- If you have not already, [install R first](http://cran.r-project.org). If you already have R installed, make sure it is a fairly recent version, version 4.4.1 or newer. If yours is older that version 4.3.0, I suggest you update (install a new R version).\n- Once you have R installed, install the free version of [RStudio Desktop](https://www.rstudio.com/products/rstudio/download/). Again, make sure it's a recent version.\n\n::: callout-tip\nInstalling R and RStudio should be fairly straightforward. However, a great set of detailed instructions is in Rafael Irizarry's `dsbook`\n\n- \n:::\n\nIf things don't work, ask for help in the Courseplus discussion board.\n\nI have both a macOS and a winOS computer, and have used Linux (Ubuntu) in the past too, but I might be more limited in how much I can help you on Linux.\n\n### RStudio default options\n\nTo first get set up, I highly recommend changing the following setting\n\nTools \\> Global Options (or `Cmd + ,` on macOS)\n\nUnder the **General** tab:\n\n- For **workspace**\n - Uncheck restore .RData into workspace at startup\n - Save workspace to .RData on exit : \"Never\"\n- For **History**\n - Uncheck \"Always save history (even when not saving .RData)\n - Uncheck \"Remove duplicate entries in history\"\n\nThis means that you won't save the objects and other things that you create in your R session and reload them. This is important for two reasons\n\n1. **Reproducibility**: you don't want to have objects from last week cluttering your session\n2. **Privacy**: you don't want to save private data or other things to your session. You only want to read these in.\n\nYour \"history\" is the commands that you have entered into R.\n\nAdditionally, not saving your history means that you won't be relying on things that you typed in the last session, which is a good habit to get into!\n\n### Installing and loading R packages\n\nAs we discussed, most of the functionality and features in R come in the form of add-on packages. There are tens of thousands of packages available, some big, some small, some well documented, some not. We will be using many different packages in this course. Of course, you are free to install and use any package you come across for any of the assignments.\n\nThe \"official\" place for packages is the [CRAN website](https://cran.r-project.org/web/packages/available_packages_by_name.html). If you are interested in packages on a specific topic, the [CRAN task views](http://cran.r-project.org/web/views/) provide curated descriptions of packages sorted by topic.\n\nTo install an R package from CRAN, one can simply call the `install.packages()` function and pass the name of the package as an argument. For example, to install the `ggplot2` package from CRAN: open RStudio,go to the R prompt (the `>` symbol) in the lower-left corner and type\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"ggplot2\")\n\n## Below is an example for installing more than one package at a time:\n\n## Install R packages for project 0\ninstall.packages(\n c(\"postcards\", \"usethis\", \"gitcreds\")\n)\n```\n:::\n\n\nand the appropriate version of the package will be installed.\n\nOften, a package needs other packages to work (called dependencies), and they are installed automatically. It usually does not matter if you use a single or double quotation mark around the name of the package.\n\n::: callout-note\n## Questions\n\n1. As you installed the `ggplot2` package, what other packages were installed?\n2. What happens if you tried to install `GGplot2`?\n:::\n\nIt could be that you already have all packages required by `ggplot2` installed. In that case, you will not see any other packages installed. To see which of the packages above `ggplot2` needs (and thus installs if it is not present), type into the R console:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntools::package_dependencies(\"ggplot2\")\n```\n:::\n\n\nIn RStudio, you can also install (and update/remove) packages by clicking on the 'Packages' tab in the bottom right window.\n\nIt is very common these days for packages to be developed on GitHub. It is possible to install packages from GitHub directly. Those usually contain the latest version of the package, with features that might not be available yet on the CRAN website. Sometimes, in early development stages, a package is only on GitHub until the developer(s) feel it is good enough for CRAN submission. So installing from GitHub gives you the latest. The downside is that packages under development can often be buggy and not working right. To install packages from GitHub, you need to install the `remotes` package and then use the following function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nremotes::install_github()\n```\n:::\n\n\nWe will not do that now, but it is quite likely that at one point later in this course we will.\n\nYou only need to install a package once, unless you upgrade/re-install R. Once installed, you still need to load the package before you can use it. That has to happen every time you start a new R session. You do that using the `library()` command. For instance to load the `ggplot2` package, type\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(\"ggplot2\")\n```\n:::\n\n\nYou may or may not see a short message on the screen. Some packages show messages when you load them, and others do not.\n\nThis was a quick overview of R packages. We will use a lot of them, so you will get used to them rather quickly.\n\n### Getting started in RStudio\n\nWhile one can use R and do pretty much every task, including all the ones we cover in this class, without using RStudio, RStudio is very useful, has lots of features that make your R coding life easier and has become pretty much the default integrated development environment (IDE) for R. Since RStudio has lots of features, it takes time to learn them. A good resource to learn more about RStudio are the [R Studio Essentials](https://resources.rstudio.com/) collection of videos.\n\n::: callout-tip\nFor more information on setting up and getting started with R, RStudio, and R packages, read the Getting Started chapter in the `dsbook`:\n\n- \n\nThis chapter gives some tips, shortcuts, and ideas that might be of interest even to those of you who already have R and/or RStudio experience.\n:::\n\n# Post-lecture materials\n\n### Links to inspiring use cases\n\n- [RStudio Conf 2019](https://www.rstudio.com/resources/rstudioconf-2019/) talk by [Brooke Watson Madubuonwu](https://x.com/brookLYNevery1) on \"[R at the ACLU: Joining tables to to reunite families](https://posit.co/resources/videos/r-at-the-aclu-joining-tables-to-to-reunite-families/)\"\n\n- [RStudio Conf 2020](https://resources.rstudio.com/resources/rstudioconf-2020/) talk by [MarΓ­a Teresa OrtΓ­z Mancera](https://x.com/TeresaOM) on \"[Mexican electoral quick count night with R](https://posit.co/resources/videos/mexican-electoral-quick-count-night-with-r-maria-ortiz-mancera/)\"\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n## Questions\n\n1. If a software company asks you, as a requirement for using their software, to sign a license that restricts you from using their software to commit illegal activities, is this consistent with the \"Four Freedoms\" of Free Software?\n\n2. What is an R package and what is it used for?\n\n3. What function in R can be used to install packages from CRAN?\n\n4. What is a limitation of the current R system?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- [R for Data Science (2e)](https://r4ds.hadley.nz/) by Wickham & Grolemund (2017, 2e is from July 18th 2023). Covers most of the basics of using R for data analysis.\n\n- [Advanced R](https://adv-r.hadley.nz) by Wickham (2014). Covers a number of areas including object-oriented, programming, functional programming, profiling and other advanced topics.\n\n- [RStudio IDE cheatsheet](https://github.com/rstudio/cheatsheets/raw/main/rstudio-ide.pdf)\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.4.1 (2024-06-14)\n os macOS Sonoma 14.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2024-08-27\n pandoc 3.2 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)\n colorout * 1.3-0.2 2024-05-03 [1] Github (jalvesaq/colorout@c6113a2)\n digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.0)\n evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)\n fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)\n htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)\n htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)\n jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)\n knitr 1.48 2024-07-07 [1] CRAN (R 4.4.0)\n rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)\n rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)\n rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)\n xfun 0.46 2024-07-18 [1] CRAN (R 4.4.0)\n yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/posts/01-welcome/index.qmd b/posts/01-welcome/index.qmd index b682585..0940d14 100644 --- a/posts/01-welcome/index.qmd +++ b/posts/01-welcome/index.qmd @@ -18,10 +18,25 @@ categories: [course-admin, module 1, week 1] Welcome! I am very excited to have you in our one-term (i.e. half a semester) course on Statistical Computing course number (140.776) offered by the [Department of Biostatistics](https://publichealth.jhu.edu/departments/biostatistics) at the [Johns Hopkins Bloomberg School of Public Health](https://publichealth.jhu.edu). - + + + + +```{=html} + +``` This course is designed for ScM and PhD students at Johns Hopkins Bloomberg School of Public Health. I am pretty flexible about permitting outside students, but I want everyone to be aware of the goals and assumptions so no one feels like they are surprised by how the class works. ::: callout-note @@ -82,7 +97,7 @@ Some resources that may be useful if you feel you may be missing pieces of this ### Getting set up -You must install [R](https://cran.r-project.org) and [RStudio](https://rstudio.com) on your computing environment in order to complete this course. +You must install [R](https://cran.r-project.org) and [RStudio](https://rstudio.com) (available from [Posit](https://posit.co/)) on your computing environment in order to complete this course. These are two **different** applications that must be installed separately before they can be used together: diff --git a/posts/02-introduction-to-r-and-rstudio/index.qmd b/posts/02-introduction-to-r-and-rstudio/index.qmd index 1b8c966..8c63b18 100644 --- a/posts/02-introduction-to-r-and-rstudio/index.qmd +++ b/posts/02-introduction-to-r-and-rstudio/index.qmd @@ -310,6 +310,12 @@ This chapter gives some tips, shortcuts, and ideas that might be of interest eve # Post-lecture materials +### Links to inspiring use cases + +- [RStudio Conf 2019](https://www.rstudio.com/resources/rstudioconf-2019/) talk by [Brooke Watson Madubuonwu](https://x.com/brookLYNevery1) on "[R at the ACLU: Joining tables to to reunite families](https://posit.co/resources/videos/r-at-the-aclu-joining-tables-to-to-reunite-families/)" + +- [RStudio Conf 2020](https://resources.rstudio.com/resources/rstudioconf-2020/) talk by [MarΓ­a Teresa OrtΓ­z Mancera](https://x.com/TeresaOM) on "[Mexican electoral quick count night with R](https://posit.co/resources/videos/mexican-electoral-quick-count-night-with-r-maria-ortiz-mancera/)" + ### Final Questions Here are some post-lecture questions to help you think about the material discussed.