diff --git a/_freeze/posts/25-python-for-r-users/index/execute-results/html.json b/_freeze/posts/25-python-for-r-users/index/execute-results/html.json index 705176d..cbba398 100644 --- a/_freeze/posts/25-python-for-r-users/index/execute-results/html.json +++ b/_freeze/posts/25-python-for-r-users/index/execute-results/html.json @@ -2,7 +2,7 @@ "hash": "039cc021ce1a179f9f83e00a467f7b3e", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"25 - Python for R Users\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to using Python in R and the reticulate package\"\ncategories: [week 8, module 6, python, R, programming]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing/commits/main/posts/25-python-for-r-users/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n3. [The Python Tutorial](https://docs.python.org/3/tutorial)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n1. Install the `reticulate` R package on your machine (I'm assuming you have python installed already)\n2. Learn about `reticulate` to work interoperability between Python and R\n3. Be able to translate between R and Python objects\n:::\n\n# Python for R Users\n\nAs the number of computational and statistical methods for the analysis data continue to increase, you will find many will be implemented in other languages.\n\nOften **Python is the language of choice**.\n\nPython is incredibly powerful and I increasingly interact with it on very frequent basis these days. To be able to leverage software tools implemented in Python, today I am giving an overview of using Python from the perspective of an R user.\n\n## Overview\n\nFor this lecture, we will be using the [`reticulate` R package](https://rstudio.github.io/reticulate), which provides a set of tools for interoperability between Python and R. The package includes facilities for:\n\n- **Calling Python from R** in a variety of ways including (i) R Markdown, (ii) sourcing Python scripts, (iii) importing Python modules, and (iv) using Python interactively within an R session.\n- **Translation between R and Python objects** (for example, between R and Pandas data frames, or between R matrices and NumPy arrays).\n\n![](https://rstudio.github.io/reticulate/images/reticulated_python.png){preview=\"TRUE\"}\n\n\\[**Source**: [Rstudio](https://rstudio.github.io/reticulate/index.html)\\]\n\n::: callout-tip\n### Pro-tip for installing python\n\n**Installing python**: If you would like recommendations on installing python, I like these resources:\n\n- Py Pkgs: \n- Using conda environments with mini-forge: \n- from `reticulate`: \n\n**What's happening under the hood?**: `reticulate` embeds a Python session within your R session, enabling seamless, high-performance interoperability.\n\nIf you are an R developer that uses Python for some of your work or a member of data science team that uses both languages, `reticulate` can make your life better!\n\n- If you make an R package with Python dependencies, you might want to use `basilisk` \n:::\n\n## Install `reticulate`\n\nLet's try it out. Before we get started, you will need to install the packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"reticulate\")\n```\n:::\n\n\nWe will also load the `here` and `tidyverse` packages for our lesson:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\nlibrary(tidyverse)\nlibrary(reticulate)\n```\n:::\n\n\n## python path\n\nIf python is not installed on your computer, you can use the `install_python()` function from `reticulate` to install it.\n\n- \n\nIf python is already installed, by default, `reticulate` uses the version of Python found on your `PATH`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nSys.which(\"python3\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n python3 \n\"/usr/bin/python3\" \n```\n\n\n:::\n:::\n\n\nThe `use_python()` function enables you to specify an alternate version, for example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nuse_python(\"/usr///local/bin/python\")\n```\n:::\n\n\nFor example, I can define the path explicitly:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nuse_python(\"/opt/homebrew/Caskroom/miniforge/base/bin/python\")\n```\n:::\n\n\nYou can confirm that `reticulate` is using the correct version of python that you requested using the `py_discover_config` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npy_discover_config()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\npython: /usr/bin/python3\nlibpython: /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/config-3.9-darwin/libpython3.9.dylib\npythonhome: /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9:/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9\nversion: 3.9.6 (default, Feb 3 2024, 15:58:27) [Clang 15.0.0 (clang-1500.3.9.4)]\nnumpy: /Users/leocollado/Library/Python/3.9/lib/python/site-packages/numpy\nnumpy_version: 2.0.1\n\nNOTE: Python version was forced by PATH\n```\n\n\n:::\n:::\n\n\n## Calling Python in R\n\nThere are a variety of ways to integrate Python code into your R projects:\n\n1. **Python in R Markdown** --- A new Python language engine for R Markdown that supports bi-directional communication between R and Python (R chunks can access Python objects and vice-versa).\n\n2. **Importing Python modules** --- The `import()` function enables you to import any Python module and call its functions directly from R.\n\n3. **Sourcing Python scripts** --- The `source_python()` function enables you to source a Python script the same way you would `source()` an R script (Python functions and objects defined within the script become directly available to the R session).\n\n4. **Python REPL** --- The `repl_python()` function creates an interactive Python console within R. Objects you create within Python are available to your R session (and vice-versa).\n\nBelow I will focus on introducing the first and last one. However, before we do that, let's introduce a bit about Python basics.\n\n# Python basics\n\nPython is a **high-level**, **object-oriented programming** language useful to know for anyone analyzing data.\n\nThe most important thing to know before learning Python, is that in Python, **everything is an object**.\n\n- There is no compiling and no need to define the type of variables before using them.\n- No need to allocate memory for variables.\n- The code is easy to learn and easy to read (syntax).\n\nThere is a large scientific community contributing to Python. Some of the most widely used libraries in Python are `numpy`, `scipy`, `pandas`, and `matplotlib`.\n\n## start python\n\nThere are two modes you can write Python code in: **interactive mode** or **script mode**. If you open up a UNIX command window and have a command-line interface, you can simply type `python` (or `python3`) in the shell:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\npython3\n```\n:::\n\n\nand the **interactive mode** will open up. You can write code in the interactive mode and Python will *interpret* the code using the **python interpreter**.\n\nAnother way to pass code to Python is to store code in a file ending in `.py`, and execute the file in the **script mode** using\n\n\n::: {.cell}\n\n```{.bash .cell-code}\npython3 myscript.py\n```\n:::\n\n\nTo check what version of Python you are using, type the following in the shell:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\npython3 --version\n```\n:::\n\n\n## R or python via terminal\n\n(Demo in class)\n\n## objects in python\n\nEverything in Python is an object. Think of an object as a data structure that contains both data as well as functions. These objects can be variables, functions, and modules which are all objects. We can operate on these objects with what are called **operators** (e.g. addition, subtraction, concatenation or other operations), define/apply functions, test/apply for conditionals statements, (e.g. `if`, `else` statements) or iterate over the objects.\n\nNot all objects are required to have **attributes** and **methods** to operate on the objects in Python, but **everything is an object** (i.e. all objects can be assigned to a variable or passed as an argument to a function). A user can work with built-in defined classes of objects or can create new classes of objects. Using these objects, a user can perform operations on the objects by modifying / interacting with them.\n\n## variables\n\nVariable names are case sensitive, can contain numbers and letters, can contain underscores, cannot begin with a number, cannot contain illegal characters and cannot be one of the 31 keywords in Python:\n\n> \"and, as, assert, break, class, continue, def, del, elif, else, except, exec, finally, for, from, global, if, import, in, is, lambda, not, or, pass, print, raise, return, try, while, with, yield\"\n\n## operators\n\n- Numeric operators are `+`, `-`, `*`, `/`, `**` (exponent), `%` (modulus if applied to integers)\n- String and list operators: `+` and `*` .\n- Assignment operator: `=`\n- The augmented assignment operator `+=` (or `-=`) can be used like `n += x` which is equal to `n = n + x`\n- Boolean relational operators: `==` (equal), `!=` (not equal), `>`, `<`, `>=` (greater than or equal to), `<=` (less than or equal to)\n- Boolean expressions will produce `True` or `False`\n- Logical operators: `and`, `or`, and `not`. e.g. `x > 1 and x <= 5`\n\n\n::: {.cell}\n\n```{.python .cell-code}\n2 ** 3\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n8\n```\n\n\n:::\n\n```{.python .cell-code}\nx = 3 \nx > 1 and x <= 5\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTrue\n```\n\n\n:::\n:::\n\n\nAnd in R, the execution changes from Python to R seamlessly\n\n\n::: {.cell}\n\n```{.r .cell-code}\n2^3\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 8\n```\n\n\n:::\n\n```{.r .cell-code}\nx <- 3\nx > 1 & x <= 5\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\n## format operators\n\nIf `%` is applied to strings, this operator is the **format operator**. It tells Python how to format a list of values in a string. For example,\n\n- `%d` says to format the value as an integer\n- `%g` says to format the value as an float\n- `%s` says to format the value as an string\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint('In %d days, I have eaten %g %s.' % (5, 3.5, 'cupcakes'))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nIn 5 days, I have eaten 3.5 cupcakes.\n```\n\n\n:::\n:::\n\n\n## functions\n\nPython contains a small list of very useful **built-in functions**.\n\nAll other functions need defined by the user or need to be imported from modules.\n\n::: callout-tip\n### Pro-tip\n\nFor a more detailed list on the built-in functions in Python, see [Built-in Python Functions](https://docs.python.org/2/library/functions.html).\n:::\n\nThe first function we will discuss, `type()`, reports the type of any object, which is very useful when handling multiple data types (remember, everything in Python is an object). Here are some the main types you will encounter:\n\n- integer (`int`)\n- floating-point (`float`)\n- string (`str`)\n- list (`list`)\n- dictionary (`dict`)\n- tuple (`tuple`)\n- function (`function`)\n- module (`module`)\n- boolean (`bool`): e.g. True, False\n- enumerate (`enumerate`)\n\nIf we asked for the type of a string \"Let's go Ravens!\"\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntype(\"Let's go Ravens!\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n```\n\n\n:::\n:::\n\n\nThis would return the `str` type.\n\nYou have also seen how to use the `print()` function. The function print will accept an argument and print the argument to the screen.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(\"Let's go Ravens!\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nLet's go Ravens!\n```\n\n\n:::\n:::\n\n\n## new functions\n\nNew functions can be `def`ined using one of the 31 keywords in Python: `def`.\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef new_world(): \n return 'Hello world!'\n \nprint(new_world())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nHello world!\n```\n\n\n:::\n:::\n\n\nThe first line of the function (the header) must start with `def`, the name of the function (which can contain underscores), parentheses (with any arguments inside of it) and a colon. The arguments can be specified in any order.\n\nThe rest of the function (the body) always has an indentation of four spaces. If you define a function in the interactive mode, the interpreter will print ellipses (`...`) to let you know the function is not complete. To complete the function, enter an empty line (not necessary in a script).\n\nTo return a value from a function, use `return`. The function will immediately terminate and not run any code written past this point.\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef squared(x):\n \"\"\" Return the square of a \n value \"\"\"\n return x ** 2\n\nprint(squared(4))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n16\n```\n\n\n:::\n:::\n\n\n::: callout-tip\n### Note\n\npython has its version of `...` (also from docs.python.org)\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef concat(*args, sep=\"/\"):\n return sep.join(args) \n\nconcat(\"a\", \"b\", \"c\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'a/b/c'\n```\n\n\n:::\n:::\n\n:::\n\n## iteration\n\n**Iterative loops** can be written with the `for`, `while` and `break` statements.\n\nDefining a **`for` loop** is similar to defining a new function.\n\n- The header ends with a colon and the body is indented.\n- The function `range(n)` takes in an integer `n` and creates a set of values from `0` to `n - 1`.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfor i in range(3):\n print('Baby shark, doo doo doo doo doo doo!')\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nBaby shark, doo doo doo doo doo doo!\nBaby shark, doo doo doo doo doo doo!\nBaby shark, doo doo doo doo doo doo!\n```\n\n\n:::\n\n```{.python .cell-code}\nprint('Baby shark!')\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nBaby shark!\n```\n\n\n:::\n:::\n\n\n`for` loops are not just for counters, but they can iterate through many types of objects such as strings, lists and dictionaries.\n\nThe **function `len()`** can be used to:\n\n- Calculate the length of a string\n- Calculate the number of elements in a list\n- Calculate the number of items (key-value pairs) in a dictionary\n- Calculate the number elements in the tuple\n\n\n::: {.cell}\n\n```{.python .cell-code}\nx = 'Baby shark!'\nlen(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n11\n```\n\n\n:::\n:::\n\n\n## methods for each type of object (dot notation)\n\nFor strings, lists and dictionaries, there are set of methods you can use to manipulate the objects. In general, the notation for methods is the **dot notation**.\n\nThe syntax is the **name of the object** followed by a **dot** (or period) followed by the **name of the method**.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nx = \"Hello Baltimore!\"\nx.split()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['Hello', 'Baltimore!']\n```\n\n\n:::\n:::\n\n\n## Data structures\n\nWe have already seen lists. Python has other **data structures** built in.\n\n- Sets `{\"a\", \"a\", \"a\", \"b\"}` (unique elements)\n- Tuples `(1, 2, 3)` (a lot like lists but not mutable, i.e. need to create a new to modify)\n- Dictionaries\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndict = {\"a\" : 1, \"b\" : 2}\ndict['a']\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n1\n```\n\n\n:::\n\n```{.python .cell-code}\ndict['b']\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2\n```\n\n\n:::\n:::\n\n\nMore about data structures can be founds [at the python docs](https://docs.python.org/3/tutorial/datastructures.html)\n\n# `reticulate`\n\n## Python engine within R Markdown\n\nThe `reticulate` package includes a Python engine for R Markdown with the following features:\n\n1. Run **Python chunks in a single Python session embedded within your R session** (shared variables/state between Python chunks)\n\n2. **Printing of Python output**, including graphical output from `matplotlib`.\n\n3. **Access to objects created within Python chunks from R** using the `py` object (e.g. `py$x` would access an `x` variable created within Python from R).\n\n4. **Access to objects created within R chunks from Python** using the `r` object (e.g. `r.x` would access to `x` variable created within R from Python)\n\n::: callout-tip\n### Conversions\n\nBuilt in conversion for many Python object types is provided, including [NumPy](https://numpy.org) arrays and [Pandas](https://pandas.pydata.org) data frames.\n:::\n\n## From Python to R\n\nAs an example, you can use Pandas to read and manipulate data then easily plot the Pandas data frame using `ggplot2`:\n\nLet's first create a `flights.csv` dataset in R and save it using `write_csv` from `readr`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# checks to see if a folder called \"data\" exists; if not, it installs it\nif (!file.exists(here(\"data\"))) {\n dir.create(here(\"data\"))\n}\n\n# checks to see if a file called \"flights.csv\" exists; if not, it saves it to the data folder\nif (!file.exists(here(\"data\", \"flights.csv\"))) {\n readr::write_csv(nycflights13::flights,\n file = here(\"data\", \"flights.csv\")\n )\n}\n\nnycflights13::flights %>%\n head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 19\n year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n \n1 2013 1 1 517 515 2 830 819\n2 2013 1 1 533 529 4 850 830\n3 2013 1 1 542 540 2 923 850\n4 2013 1 1 544 545 -1 1004 1022\n5 2013 1 1 554 600 -6 812 837\n6 2013 1 1 554 558 -4 740 728\n# ℹ 11 more variables: arr_delay , carrier , flight ,\n# tailnum , origin , dest , air_time , distance ,\n# hour , minute , time_hour \n```\n\n\n:::\n:::\n\n\nNext, we **use Python to read in the file** and do some data wrangling\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport pandas\nflights_path = \"/Users/leocollado/Dropbox/Code/jhustatcomputing/data/flights.csv\"\nflights = pandas.read_csv(flights_path)\nflights = flights[flights['dest'] == \"ORD\"]\nflights = flights[['carrier', 'dep_delay', 'arr_delay']]\nflights = flights.dropna()\nflights\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n carrier dep_delay arr_delay\n5 UA -4.0 12.0\n9 AA -2.0 8.0\n25 MQ 8.0 32.0\n38 AA -1.0 14.0\n57 AA -4.0 4.0\n... ... ... ...\n336645 AA -12.0 -37.0\n336669 UA -7.0 -13.0\n336675 MQ -7.0 -11.0\n336696 B6 -5.0 -23.0\n336709 AA -13.0 -38.0\n\n[16566 rows x 3 columns]\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(py$flights)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n carrier dep_delay arr_delay\n5 UA -4 12\n9 AA -2 8\n25 MQ 8 32\n38 AA -1 14\n57 AA -4 4\n70 UA 9 20\n```\n\n\n:::\n\n```{.r .cell-code}\npy$flights_path\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"/Users/leocollado/Dropbox/Code/jhustatcomputing/data/flights.csv\"\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(py$flights)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"data.frame\"\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(py$flights_path)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\nNext, we can use R to **visualize the Pandas** `DataFrame`.\n\nThe data frame is **loaded in as an R object now** stored in the variable `py`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(py$flights, aes(x = carrier, y = arr_delay)) +\n geom_point() +\n geom_jitter()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-26-1.png){width=672}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe `reticulate` Python engine is enabled by default within R Markdown whenever `reticulate` is installed.\n:::\n\n### From R to Python\n\nUse R to read and manipulate data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nflights <- read_csv(here(\"data\", \"flights.csv\")) %>%\n filter(dest == \"ORD\") %>%\n select(carrier, dep_delay, arr_delay) %>%\n na.omit()\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 336776 Columns: 19\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (4): carrier, tailnum, origin, dest\ndbl (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, ...\ndttm (1): time_hour\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n\n```{.r .cell-code}\nflights\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 16,566 × 3\n carrier dep_delay arr_delay\n \n 1 UA -4 12\n 2 AA -2 8\n 3 MQ 8 32\n 4 AA -1 14\n 5 AA -4 4\n 6 UA 9 20\n 7 UA 2 21\n 8 AA -6 -12\n 9 MQ 39 49\n10 B6 -2 15\n# ℹ 16,556 more rows\n```\n\n\n:::\n:::\n\n\n### Use Python to print R dataframe\n\nIf you recall, we can **access objects created within R chunks from Python** using the `r` object (e.g. `r.x` would access to `x` variable created within R from Python).\n\nWe can then ask for the first ten rows using the `head()` function in python.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nr.flights.head(10)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n carrier dep_delay arr_delay\n0 UA -4.0 12.0\n1 AA -2.0 8.0\n2 MQ 8.0 32.0\n3 AA -1.0 14.0\n4 AA -4.0 4.0\n5 UA 9.0 20.0\n6 UA 2.0 21.0\n7 AA -6.0 -12.0\n8 MQ 39.0 49.0\n9 B6 -2.0 15.0\n```\n\n\n:::\n:::\n\n\n## import python modules\n\nYou can use the `import()` function to import any Python module and call it from R. For example, this code imports the Python `os` module in python and calls the `listdir()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nos <- import(\"os\")\nos$listdir(\".\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"index.R\" \"index.qmd\" \"index_files\" \"index.rmarkdown\"\n```\n\n\n:::\n:::\n\n\nFunctions and other data within Python modules and classes can be accessed via the `$` operator (analogous to the way you would interact with an R list, environment, or reference class).\n\nImported Python modules support code completion and inline help:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Using reticulate tab completion](https://rstudio.github.io/reticulate/images/reticulate_completion.png){fig-align='center'}\n:::\n:::\n\n\n\\[**Source**: [Rstudio](https://rstudio.github.io/reticulate)\\]\n\nSimilarly, we can import the pandas library:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npd <- import(\"pandas\")\ntest <- pd$read_csv(here(\"data\", \"flights.csv\"))\nhead(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n1 2013 1 1 517 515 2 830 819\n2 2013 1 1 533 529 4 850 830\n3 2013 1 1 542 540 2 923 850\n4 2013 1 1 544 545 -1 1004 1022\n5 2013 1 1 554 600 -6 812 837\n6 2013 1 1 554 558 -4 740 728\n arr_delay carrier flight tailnum origin dest air_time distance hour minute\n1 11 UA 1545 N14228 EWR IAH 227 1400 5 15\n2 20 UA 1714 N24211 LGA IAH 227 1416 5 29\n3 33 AA 1141 N619AA JFK MIA 160 1089 5 40\n4 -18 B6 725 N804JB JFK BQN 183 1576 5 45\n5 -25 DL 461 N668DN LGA ATL 116 762 6 0\n6 12 UA 1696 N39463 EWR ORD 150 719 5 58\n time_hour\n1 2013-01-01T10:00:00Z\n2 2013-01-01T10:00:00Z\n3 2013-01-01T10:00:00Z\n4 2013-01-01T10:00:00Z\n5 2013-01-01T11:00:00Z\n6 2013-01-01T10:00:00Z\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"data.frame\"\n```\n\n\n:::\n:::\n\n\nor the scikit-learn python library:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nskl_lr <- import(\"sklearn.linear_model\")\nskl_lr\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nModule(sklearn.linear_model)\n```\n\n\n:::\n:::\n\n\n## Calling python scripts\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsource_python(\"secret_functions.py\")\nsubject_1 <- read_subject(\"secret_data.csv\")\n```\n:::\n\n\n## Calling the python repl\n\nIf you want to work with Python interactively you can call the `repl_python()` function, which provides a Python REPL embedded within your R session.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrepl_python()\n```\n:::\n\n\nObjects created within the Python REPL can be accessed from R using the `py` object exported from `reticulate`. For example:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Using the repl_python() function](https://rstudio.github.io/reticulate/images/python_repl.png){fig-align='center'}\n:::\n:::\n\n\n\\[**Source**: [Rstudio](https://rstudio.github.io/reticulate)\\]\n\ni.e. objects do have permenancy in R after exiting the python repl.\n\nSo typing `x = 4` in the repl will put `py$x` as 4 in R after you exit the repl.\n\nEnter exit within the Python REPL to return to the R prompt.\n\n# Community\n\n*Sharing the Recipe for rOpenSci's Unconf Ice Breaker* is a great activity you can use.\n\n\"Todos los caminos llevan a Roma\" (all roads lead to Rome)... or `R`\n\nYet, we are all unique. You might have had some privileges, you likely faced obstacles, you might have made mistakes, you likely were made to feel unwelcome at times; ultimately, you have accumulated many experiences. (Here's a bit of [my own history](https://lcolladotor.github.io/2018/11/06/a-knot-of-threads-from-cshl-to-lcg-unam-to-aldo-barrientos-to-diversity-scholarship-opportunities/)). You are the best person to help others like you. And you are not alone. Also, you can belong to more than one community.\n\n- \n- \n- \n- \n- \n- \n- \n- \n- \n- \n- \n- \n- \n- \n\nRUGS (R User Groups) Program from the R Consortium: . Get \\$200 to \\$1000 USD for supporting your group.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. Try to use tab completion for a function.\n2. Try to install and load a different python module in R using `import()`.\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.4.1 (2024-06-14)\n os macOS Sonoma 14.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2024-10-16\n pandoc 3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n bit 4.5.0 2024-09-20 [1] CRAN (R 4.4.1)\n bit64 4.5.2 2024-09-22 [1] CRAN (R 4.4.1)\n cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)\n colorout * 1.3-0.2 2024-05-03 [1] Github (jalvesaq/colorout@c6113a2)\n colorspace 2.1-1 2024-07-26 [1] CRAN (R 4.4.0)\n crayon 1.5.3 2024-06-20 [1] CRAN (R 4.4.0)\n digest 0.6.37 2024-08-19 [1] CRAN (R 4.4.1)\n dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)\n evaluate 1.0.1 2024-10-10 [1] CRAN (R 4.4.1)\n fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)\n farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0)\n fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)\n ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)\n glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.1)\n gtable 0.3.5 2024-04-22 [1] CRAN (R 4.4.0)\n here * 1.0.1 2020-12-13 [1] CRAN (R 4.4.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0)\n htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)\n htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)\n jsonlite 1.8.9 2024-09-20 [1] CRAN (R 4.4.1)\n knitr 1.48 2024-07-07 [1] CRAN (R 4.4.0)\n labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.0)\n lattice 0.22-6 2024-03-20 [1] CRAN (R 4.4.1)\n lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)\n lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)\n Matrix 1.7-0 2024-04-26 [1] CRAN (R 4.4.1)\n munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)\n nycflights13 1.0.2 2021-04-12 [1] CRAN (R 4.4.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)\n png 0.1-8 2022-11-29 [1] CRAN (R 4.4.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)\n Rcpp 1.0.13 2024-07-17 [1] CRAN (R 4.4.0)\n readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)\n reticulate * 1.39.0 2024-09-05 [1] CRAN (R 4.4.1)\n rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)\n rmarkdown 2.28 2024-08-17 [1] CRAN (R 4.4.0)\n rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.4.0)\n rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)\n scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)\n stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)\n stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)\n tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)\n tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.0)\n timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)\n utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)\n vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)\n vroom 1.6.5 2023-12-05 [1] CRAN (R 4.4.0)\n withr 3.0.1 2024-07-31 [1] CRAN (R 4.4.0)\n xfun 0.48 2024-10-03 [1] CRAN (R 4.4.1)\n yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library\n\n─ Python configuration ───────────────────────────────────────────────────────────────────────────────────────────────\n python: /usr/bin/python3\n libpython: /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/config-3.9-darwin/libpython3.9.dylib\n pythonhome: /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9:/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9\n version: 3.9.6 (default, Feb 3 2024, 15:58:27) [Clang 15.0.0 (clang-1500.3.9.4)]\n numpy: /Users/leocollado/Library/Python/3.9/lib/python/site-packages/numpy\n numpy_version: 2.0.1\n \n NOTE: Python version was forced by PATH\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n", + "markdown": "---\ntitle: \"25 - Python for R Users\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to using Python in R and the reticulate package\"\ncategories: [week 8, module 6, python, R, programming]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing/commits/main/posts/25-python-for-r-users/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n3. [The Python Tutorial](https://docs.python.org/3/tutorial)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n1. Install the `reticulate` R package on your machine (I'm assuming you have python installed already)\n2. Learn about `reticulate` to work interoperability between Python and R\n3. Be able to translate between R and Python objects\n:::\n\n# Python for R Users\n\nAs the number of computational and statistical methods for the analysis data continue to increase, you will find many will be implemented in other languages.\n\nOften **Python is the language of choice**.\n\nPython is incredibly powerful and I increasingly interact with it on very frequent basis these days. To be able to leverage software tools implemented in Python, today I am giving an overview of using Python from the perspective of an R user.\n\n## Overview\n\nFor this lecture, we will be using the [`reticulate` R package](https://rstudio.github.io/reticulate), which provides a set of tools for interoperability between Python and R. The package includes facilities for:\n\n- **Calling Python from R** in a variety of ways including (i) R Markdown, (ii) sourcing Python scripts, (iii) importing Python modules, and (iv) using Python interactively within an R session.\n- **Translation between R and Python objects** (for example, between R and Pandas data frames, or between R matrices and NumPy arrays).\n\n![](https://rstudio.github.io/reticulate/images/reticulated_python.png){preview=\"TRUE\"}\n\n\\[**Source**: [Rstudio](https://rstudio.github.io/reticulate/index.html)\\]\n\n::: callout-tip\n### Pro-tip for installing python\n\n**Installing python**: If you would like recommendations on installing python, I like these resources:\n\n- Py Pkgs: \n- Using conda environments with mini-forge: \n- from `reticulate`: \n\n**What's happening under the hood?**: `reticulate` embeds a Python session within your R session, enabling seamless, high-performance interoperability.\n\nIf you are an R developer that uses Python for some of your work or a member of data science team that uses both languages, `reticulate` can make your life better!\n\n- If you make an R package with Python dependencies, you might want to use `basilisk` \n:::\n\n## Install `reticulate`\n\nLet's try it out. Before we get started, you will need to install the packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"reticulate\")\n```\n:::\n\n\nWe will also load the `here` and `tidyverse` packages for our lesson:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\nlibrary(tidyverse)\nlibrary(reticulate)\n```\n:::\n\n\n## python path\n\nIf python is not installed on your computer, you can use the `install_python()` function from `reticulate` to install it.\n\n- \n\nIf python is already installed, by default, `reticulate` uses the version of Python found on your `PATH`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nSys.which(\"python3\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n python3 \n\"/usr/bin/python3\" \n```\n\n\n:::\n:::\n\n\nThe `use_python()` function enables you to specify an alternate version, for example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nuse_python(\"/usr///local/bin/python\")\n```\n:::\n\n\nFor example, I can define the path explicitly:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nuse_python(\"/opt/homebrew/Caskroom/miniforge/base/bin/python\")\n```\n:::\n\n\nYou can confirm that `reticulate` is using the correct version of python that you requested using the `py_discover_config` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npy_discover_config()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\npython: /usr/bin/python3\nlibpython: /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/config-3.9-darwin/libpython3.9.dylib\npythonhome: /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9:/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9\nversion: 3.9.6 (default, Feb 3 2024, 15:58:27) [Clang 15.0.0 (clang-1500.3.9.4)]\nnumpy: /Users/leocollado/Library/Python/3.9/lib/python/site-packages/numpy\nnumpy_version: 2.0.1\n\nNOTE: Python version was forced by PATH\n```\n\n\n:::\n:::\n\n\n## Calling Python in R\n\nThere are a variety of ways to integrate Python code into your R projects:\n\n1. **Python in R Markdown** --- A new Python language engine for R Markdown that supports bi-directional communication between R and Python (R chunks can access Python objects and vice-versa).\n\n2. **Importing Python modules** --- The `import()` function enables you to import any Python module and call its functions directly from R.\n\n3. **Sourcing Python scripts** --- The `source_python()` function enables you to source a Python script the same way you would `source()` an R script (Python functions and objects defined within the script become directly available to the R session).\n\n4. **Python REPL** --- The `repl_python()` function creates an interactive Python console within R. Objects you create within Python are available to your R session (and vice-versa).\n\nBelow I will focus on introducing the first and last one. However, before we do that, let's introduce a bit about Python basics.\n\n# Python basics\n\nPython is a **high-level**, **object-oriented programming** language useful to know for anyone analyzing data.\n\nThe most important thing to know before learning Python, is that in Python, **everything is an object**.\n\n- There is no compiling and no need to define the type of variables before using them.\n- No need to allocate memory for variables.\n- The code is easy to learn and easy to read (syntax).\n\nThere is a large scientific community contributing to Python. Some of the most widely used libraries in Python are `numpy`, `scipy`, `pandas`, and `matplotlib`.\n\n## start python\n\nThere are two modes you can write Python code in: **interactive mode** or **script mode**. If you open up a UNIX command window and have a command-line interface, you can simply type `python` (or `python3`) in the shell:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\npython3\n```\n:::\n\n\nand the **interactive mode** will open up. You can write code in the interactive mode and Python will *interpret* the code using the **python interpreter**.\n\nAnother way to pass code to Python is to store code in a file ending in `.py`, and execute the file in the **script mode** using\n\n\n::: {.cell}\n\n```{.bash .cell-code}\npython3 myscript.py\n```\n:::\n\n\nTo check what version of Python you are using, type the following in the shell:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\npython3 --version\n```\n:::\n\n\n## R or python via terminal\n\n(Demo in class)\n\n## objects in python\n\nEverything in Python is an object. Think of an object as a data structure that contains both data as well as functions. These objects can be variables, functions, and modules which are all objects. We can operate on these objects with what are called **operators** (e.g. addition, subtraction, concatenation or other operations), define/apply functions, test/apply for conditionals statements, (e.g. `if`, `else` statements) or iterate over the objects.\n\nNot all objects are required to have **attributes** and **methods** to operate on the objects in Python, but **everything is an object** (i.e. all objects can be assigned to a variable or passed as an argument to a function). A user can work with built-in defined classes of objects or can create new classes of objects. Using these objects, a user can perform operations on the objects by modifying / interacting with them.\n\n## variables\n\nVariable names are case sensitive, can contain numbers and letters, can contain underscores, cannot begin with a number, cannot contain illegal characters and cannot be one of the 31 keywords in Python:\n\n> \"and, as, assert, break, class, continue, def, del, elif, else, except, exec, finally, for, from, global, if, import, in, is, lambda, not, or, pass, print, raise, return, try, while, with, yield\"\n\n## operators\n\n- Numeric operators are `+`, `-`, `*`, `/`, `**` (exponent), `%` (modulus if applied to integers)\n- String and list operators: `+` and `*` .\n- Assignment operator: `=`\n- The augmented assignment operator `+=` (or `-=`) can be used like `n += x` which is equal to `n = n + x`\n- Boolean relational operators: `==` (equal), `!=` (not equal), `>`, `<`, `>=` (greater than or equal to), `<=` (less than or equal to)\n- Boolean expressions will produce `True` or `False`\n- Logical operators: `and`, `or`, and `not`. e.g. `x > 1 and x <= 5`\n\n\n::: {.cell}\n\n```{.python .cell-code}\n2 ** 3\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n8\n```\n\n\n:::\n\n```{.python .cell-code}\nx = 3 \nx > 1 and x <= 5\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTrue\n```\n\n\n:::\n:::\n\n\nAnd in R, the execution changes from Python to R seamlessly\n\n\n::: {.cell}\n\n```{.r .cell-code}\n2^3\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 8\n```\n\n\n:::\n\n```{.r .cell-code}\nx <- 3\nx > 1 & x <= 5\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\n## format operators\n\nIf `%` is applied to strings, this operator is the **format operator**. It tells Python how to format a list of values in a string. For example,\n\n- `%d` says to format the value as an integer\n- `%g` says to format the value as an float\n- `%s` says to format the value as an string\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint('In %d days, I have eaten %g %s.' % (5, 3.5, 'cupcakes'))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nIn 5 days, I have eaten 3.5 cupcakes.\n```\n\n\n:::\n:::\n\n\n## functions\n\nPython contains a small list of very useful **built-in functions**.\n\nAll other functions need defined by the user or need to be imported from modules.\n\n::: callout-tip\n### Pro-tip\n\nFor a more detailed list on the built-in functions in Python, see [Built-in Python Functions](https://docs.python.org/2/library/functions.html).\n:::\n\nThe first function we will discuss, `type()`, reports the type of any object, which is very useful when handling multiple data types (remember, everything in Python is an object). Here are some the main types you will encounter:\n\n- integer (`int`)\n- floating-point (`float`)\n- string (`str`)\n- list (`list`)\n- dictionary (`dict`)\n- tuple (`tuple`)\n- function (`function`)\n- module (`module`)\n- boolean (`bool`): e.g. True, False\n- enumerate (`enumerate`)\n\nIf we asked for the type of a string \"Let's go Ravens!\"\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntype(\"Let's go Ravens!\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n```\n\n\n:::\n:::\n\n\nThis would return the `str` type.\n\nYou have also seen how to use the `print()` function. The function print will accept an argument and print the argument to the screen.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint(\"Let's go Ravens!\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nLet's go Ravens!\n```\n\n\n:::\n:::\n\n\n## new functions\n\nNew functions can be `def`ined using one of the 31 keywords in Python: `def`.\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef new_world(): \n return 'Hello world!'\n \nprint(new_world())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nHello world!\n```\n\n\n:::\n:::\n\n\nThe first line of the function (the header) must start with `def`, the name of the function (which can contain underscores), parentheses (with any arguments inside of it) and a colon. The arguments can be specified in any order.\n\nThe rest of the function (the body) always has an indentation of four spaces. If you define a function in the interactive mode, the interpreter will print ellipses (`...`) to let you know the function is not complete. To complete the function, enter an empty line (not necessary in a script).\n\nTo return a value from a function, use `return`. The function will immediately terminate and not run any code written past this point.\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef squared(x):\n \"\"\" Return the square of a \n value \"\"\"\n return x ** 2\n\nprint(squared(4))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n16\n```\n\n\n:::\n:::\n\n\n::: callout-tip\n### Note\n\npython has its version of `...` (also from docs.python.org)\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef concat(*args, sep=\"/\"):\n return sep.join(args) \n\nconcat(\"a\", \"b\", \"c\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'a/b/c'\n```\n\n\n:::\n:::\n\n:::\n\n## iteration\n\n**Iterative loops** can be written with the `for`, `while` and `break` statements.\n\nDefining a **`for` loop** is similar to defining a new function.\n\n- The header ends with a colon and the body is indented.\n- The function `range(n)` takes in an integer `n` and creates a set of values from `0` to `n - 1`.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfor i in range(3):\n print('Baby shark, doo doo doo doo doo doo!')\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nBaby shark, doo doo doo doo doo doo!\nBaby shark, doo doo doo doo doo doo!\nBaby shark, doo doo doo doo doo doo!\n```\n\n\n:::\n\n```{.python .cell-code}\nprint('Baby shark!')\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nBaby shark!\n```\n\n\n:::\n:::\n\n\n`for` loops are not just for counters, but they can iterate through many types of objects such as strings, lists and dictionaries.\n\nThe **function `len()`** can be used to:\n\n- Calculate the length of a string\n- Calculate the number of elements in a list\n- Calculate the number of items (key-value pairs) in a dictionary\n- Calculate the number elements in the tuple\n\n\n::: {.cell}\n\n```{.python .cell-code}\nx = 'Baby shark!'\nlen(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n11\n```\n\n\n:::\n:::\n\n\n## methods for each type of object (dot notation)\n\nFor strings, lists and dictionaries, there are set of methods you can use to manipulate the objects. In general, the notation for methods is the **dot notation**.\n\nThe syntax is the **name of the object** followed by a **dot** (or period) followed by the **name of the method**.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nx = \"Hello Baltimore!\"\nx.split()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n['Hello', 'Baltimore!']\n```\n\n\n:::\n:::\n\n\n## Data structures\n\nWe have already seen lists. Python has other **data structures** built in.\n\n- Sets `{\"a\", \"a\", \"a\", \"b\"}` (unique elements)\n- Tuples `(1, 2, 3)` (a lot like lists but not mutable, i.e. need to create a new to modify)\n- Dictionaries\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndict = {\"a\" : 1, \"b\" : 2}\ndict['a']\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n1\n```\n\n\n:::\n\n```{.python .cell-code}\ndict['b']\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n2\n```\n\n\n:::\n:::\n\n\nMore about data structures can be founds [at the python docs](https://docs.python.org/3/tutorial/datastructures.html)\n\n# `reticulate`\n\n## Python engine within R Markdown\n\nThe `reticulate` package includes a Python engine for R Markdown with the following features:\n\n1. Run **Python chunks in a single Python session embedded within your R session** (shared variables/state between Python chunks)\n\n2. **Printing of Python output**, including graphical output from `matplotlib`.\n\n3. **Access to objects created within Python chunks from R** using the `py` object (e.g. `py$x` would access an `x` variable created within Python from R).\n\n4. **Access to objects created within R chunks from Python** using the `r` object (e.g. `r.x` would access to `x` variable created within R from Python)\n\n::: callout-tip\n### Conversions\n\nBuilt in conversion for many Python object types is provided, including [NumPy](https://numpy.org) arrays and [Pandas](https://pandas.pydata.org) data frames.\n:::\n\n## From Python to R\n\nAs an example, you can use Pandas to read and manipulate data then easily plot the Pandas data frame using `ggplot2`:\n\nLet's first create a `flights.csv` dataset in R and save it using `write_csv` from `readr`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# checks to see if a folder called \"data\" exists; if not, it installs it\nif (!file.exists(here(\"data\"))) {\n dir.create(here(\"data\"))\n}\n\n# checks to see if a file called \"flights.csv\" exists; if not, it saves it to the data folder\nif (!file.exists(here(\"data\", \"flights.csv\"))) {\n readr::write_csv(nycflights13::flights,\n file = here(\"data\", \"flights.csv\")\n )\n}\n\nnycflights13::flights %>%\n head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 19\n year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n \n1 2013 1 1 517 515 2 830 819\n2 2013 1 1 533 529 4 850 830\n3 2013 1 1 542 540 2 923 850\n4 2013 1 1 544 545 -1 1004 1022\n5 2013 1 1 554 600 -6 812 837\n6 2013 1 1 554 558 -4 740 728\n# ℹ 11 more variables: arr_delay , carrier , flight ,\n# tailnum , origin , dest , air_time , distance ,\n# hour , minute , time_hour \n```\n\n\n:::\n:::\n\n\nNext, we **use Python to read in the file** and do some data wrangling\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport pandas\nflights_path = \"/Users/leocollado/Dropbox/Code/jhustatcomputing/data/flights.csv\"\nflights = pandas.read_csv(flights_path)\nflights = flights[flights['dest'] == \"ORD\"]\nflights = flights[['carrier', 'dep_delay', 'arr_delay']]\nflights = flights.dropna()\nflights\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n carrier dep_delay arr_delay\n5 UA -4.0 12.0\n9 AA -2.0 8.0\n25 MQ 8.0 32.0\n38 AA -1.0 14.0\n57 AA -4.0 4.0\n... ... ... ...\n336645 AA -12.0 -37.0\n336669 UA -7.0 -13.0\n336675 MQ -7.0 -11.0\n336696 B6 -5.0 -23.0\n336709 AA -13.0 -38.0\n\n[16566 rows x 3 columns]\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(py$flights)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n carrier dep_delay arr_delay\n5 UA -4 12\n9 AA -2 8\n25 MQ 8 32\n38 AA -1 14\n57 AA -4 4\n70 UA 9 20\n```\n\n\n:::\n\n```{.r .cell-code}\npy$flights_path\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"/Users/leocollado/Dropbox/Code/jhustatcomputing/data/flights.csv\"\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(py$flights)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"data.frame\"\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(py$flights_path)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\nNext, we can use R to **visualize the Pandas** `DataFrame`.\n\nThe data frame is **loaded in as an R object now** stored in the variable `py`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(py$flights, aes(x = carrier, y = arr_delay)) +\n geom_point() +\n geom_jitter()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-26-1.png){width=672}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe `reticulate` Python engine is enabled by default within R Markdown whenever `reticulate` is installed.\n:::\n\n### From R to Python\n\nUse R to read and manipulate data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nflights <- read_csv(here(\"data\", \"flights.csv\")) %>%\n filter(dest == \"ORD\") %>%\n select(carrier, dep_delay, arr_delay) %>%\n na.omit()\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 336776 Columns: 19\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (4): carrier, tailnum, origin, dest\ndbl (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, ...\ndttm (1): time_hour\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n\n```{.r .cell-code}\nflights\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 16,566 × 3\n carrier dep_delay arr_delay\n \n 1 UA -4 12\n 2 AA -2 8\n 3 MQ 8 32\n 4 AA -1 14\n 5 AA -4 4\n 6 UA 9 20\n 7 UA 2 21\n 8 AA -6 -12\n 9 MQ 39 49\n10 B6 -2 15\n# ℹ 16,556 more rows\n```\n\n\n:::\n:::\n\n\n### Use Python to print R dataframe\n\nIf you recall, we can **access objects created within R chunks from Python** using the `r` object (e.g. `r.x` would access to `x` variable created within R from Python).\n\nWe can then ask for the first ten rows using the `head()` function in python.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nr.flights.head(10)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n carrier dep_delay arr_delay\n0 UA -4.0 12.0\n1 AA -2.0 8.0\n2 MQ 8.0 32.0\n3 AA -1.0 14.0\n4 AA -4.0 4.0\n5 UA 9.0 20.0\n6 UA 2.0 21.0\n7 AA -6.0 -12.0\n8 MQ 39.0 49.0\n9 B6 -2.0 15.0\n```\n\n\n:::\n:::\n\n\n## import python modules\n\nYou can use the `import()` function to import any Python module and call it from R. For example, this code imports the Python `os` module in python and calls the `listdir()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nos <- import(\"os\")\nos$listdir(\".\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"index.R\" \"index.qmd\" \"index_files\" \"index.rmarkdown\"\n```\n\n\n:::\n:::\n\n\nFunctions and other data within Python modules and classes can be accessed via the `$` operator (analogous to the way you would interact with an R list, environment, or reference class).\n\nImported Python modules support code completion and inline help:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Using reticulate tab completion](https://rstudio.github.io/reticulate/images/reticulate_completion.png){fig-align='center'}\n:::\n:::\n\n\n\\[**Source**: [Rstudio](https://rstudio.github.io/reticulate)\\]\n\nSimilarly, we can import the pandas library:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npd <- import(\"pandas\")\ntest <- pd$read_csv(here(\"data\", \"flights.csv\"))\nhead(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n1 2013 1 1 517 515 2 830 819\n2 2013 1 1 533 529 4 850 830\n3 2013 1 1 542 540 2 923 850\n4 2013 1 1 544 545 -1 1004 1022\n5 2013 1 1 554 600 -6 812 837\n6 2013 1 1 554 558 -4 740 728\n arr_delay carrier flight tailnum origin dest air_time distance hour minute\n1 11 UA 1545 N14228 EWR IAH 227 1400 5 15\n2 20 UA 1714 N24211 LGA IAH 227 1416 5 29\n3 33 AA 1141 N619AA JFK MIA 160 1089 5 40\n4 -18 B6 725 N804JB JFK BQN 183 1576 5 45\n5 -25 DL 461 N668DN LGA ATL 116 762 6 0\n6 12 UA 1696 N39463 EWR ORD 150 719 5 58\n time_hour\n1 2013-01-01T10:00:00Z\n2 2013-01-01T10:00:00Z\n3 2013-01-01T10:00:00Z\n4 2013-01-01T10:00:00Z\n5 2013-01-01T11:00:00Z\n6 2013-01-01T10:00:00Z\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"data.frame\"\n```\n\n\n:::\n:::\n\n\nor the scikit-learn python library:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nskl_lr <- import(\"sklearn.linear_model\")\nskl_lr\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nModule(sklearn.linear_model)\n```\n\n\n:::\n:::\n\n\n## Calling python scripts\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsource_python(\"secret_functions.py\")\nsubject_1 <- read_subject(\"secret_data.csv\")\n```\n:::\n\n\n## Calling the python repl\n\nIf you want to work with Python interactively you can call the `repl_python()` function, which provides a Python REPL embedded within your R session.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrepl_python()\n```\n:::\n\n\nObjects created within the Python REPL can be accessed from R using the `py` object exported from `reticulate`. For example:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Using the repl_python() function](https://rstudio.github.io/reticulate/images/python_repl.png){fig-align='center'}\n:::\n:::\n\n\n\\[**Source**: [Rstudio](https://rstudio.github.io/reticulate)\\]\n\ni.e. objects do have permenancy in R after exiting the python repl.\n\nSo typing `x = 4` in the repl will put `py$x` as 4 in R after you exit the repl.\n\nEnter exit within the Python REPL to return to the R prompt.\n\n# Community\n\n*Sharing the Recipe for rOpenSci's Unconf Ice Breaker* is a great activity you can use.\n\n\"Todos los caminos llevan a Roma\" (all roads lead to Rome)... or `R`\n\nYet, we are all unique. You might have had some privileges, you likely faced obstacles, you might have made mistakes, you likely were made to feel unwelcome at times; ultimately, you have accumulated many experiences. (Here's a bit of [my own history](https://lcolladotor.github.io/2018/11/06/a-knot-of-threads-from-cshl-to-lcg-unam-to-aldo-barrientos-to-diversity-scholarship-opportunities/)). You are the best person to help others like you. And you are not alone. Also, you can belong to more than one community.\n\n- \n- \n- \n- \n- \n- \n- \n- \n- \n- \n- \n- \n- \n- \n\nRUGS (R User Groups) Program from the R Consortium: . Get \\$200 to \\$1000 USD for supporting your group.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. Try to use tab completion for a function.\n2. Try to install and load a different python module in R using `import()`.\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.4.1 (2024-06-14)\n os macOS Sonoma 14.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2024-10-17\n pandoc 3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n bit 4.5.0 2024-09-20 [1] CRAN (R 4.4.1)\n bit64 4.5.2 2024-09-22 [1] CRAN (R 4.4.1)\n cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)\n colorout * 1.3-0.2 2024-05-03 [1] Github (jalvesaq/colorout@c6113a2)\n colorspace 2.1-1 2024-07-26 [1] CRAN (R 4.4.0)\n crayon 1.5.3 2024-06-20 [1] CRAN (R 4.4.0)\n digest 0.6.37 2024-08-19 [1] CRAN (R 4.4.1)\n dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)\n evaluate 1.0.1 2024-10-10 [1] CRAN (R 4.4.1)\n fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)\n farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0)\n fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)\n ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)\n glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.1)\n gtable 0.3.5 2024-04-22 [1] CRAN (R 4.4.0)\n here * 1.0.1 2020-12-13 [1] CRAN (R 4.4.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0)\n htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)\n htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)\n jsonlite 1.8.9 2024-09-20 [1] CRAN (R 4.4.1)\n knitr 1.48 2024-07-07 [1] CRAN (R 4.4.0)\n labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.0)\n lattice 0.22-6 2024-03-20 [1] CRAN (R 4.4.1)\n lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)\n lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)\n Matrix 1.7-0 2024-04-26 [1] CRAN (R 4.4.1)\n munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)\n nycflights13 1.0.2 2021-04-12 [1] CRAN (R 4.4.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)\n png 0.1-8 2022-11-29 [1] CRAN (R 4.4.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)\n Rcpp 1.0.13 2024-07-17 [1] CRAN (R 4.4.0)\n readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)\n reticulate * 1.39.0 2024-09-05 [1] CRAN (R 4.4.1)\n rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)\n rmarkdown 2.28 2024-08-17 [1] CRAN (R 4.4.0)\n rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.4.0)\n rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)\n scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)\n stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)\n stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)\n tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)\n tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.0)\n timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)\n utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)\n vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)\n vroom 1.6.5 2023-12-05 [1] CRAN (R 4.4.0)\n withr 3.0.1 2024-07-31 [1] CRAN (R 4.4.0)\n xfun 0.48 2024-10-03 [1] CRAN (R 4.4.1)\n yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library\n\n─ Python configuration ───────────────────────────────────────────────────────────────────────────────────────────────\n python: /usr/bin/python3\n libpython: /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/config-3.9-darwin/libpython3.9.dylib\n pythonhome: /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9:/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9\n version: 3.9.6 (default, Feb 3 2024, 15:58:27) [Clang 15.0.0 (clang-1500.3.9.4)]\n numpy: /Users/leocollado/Library/Python/3.9/lib/python/site-packages/numpy\n numpy_version: 2.0.1\n \n NOTE: Python version was forced by PATH\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n", "supporting": [ "index_files" ], diff --git a/_freeze/posts/25-python-for-r-users/index/figure-html/unnamed-chunk-26-1.png b/_freeze/posts/25-python-for-r-users/index/figure-html/unnamed-chunk-26-1.png index 9592c14..4c5565b 100644 Binary files a/_freeze/posts/25-python-for-r-users/index/figure-html/unnamed-chunk-26-1.png and b/_freeze/posts/25-python-for-r-users/index/figure-html/unnamed-chunk-26-1.png differ