Skip to content

Commit

Permalink
notebook 5: update deprecated arguments and fix typos
Browse files Browse the repository at this point in the history
  • Loading branch information
robinengler committed Mar 3, 2023
1 parent 360ec31 commit b0c5d7c
Showing 1 changed file with 52 additions and 36 deletions.
88 changes: 52 additions & 36 deletions notebooks/05_module_pandas.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@
"metadata": {},
"outputs": [],
"source": [
"# The \"shape\" attribute retruns a tuple with row and column count:\n",
"# The \"shape\" attribute returns a tuple with row and column count:\n",
"df.shape"
]
},
Expand Down Expand Up @@ -326,7 +326,7 @@
"* **`pd.read_excel()`**: import data from Excel files.\n",
"* ... see [here for an exhaustive list of pandas reader and writer functions](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).\n",
"\n",
"To illustrate the `read_table()` function, let's try to load the `data/titanic.csv` file. As its name suggest, this table contains data about the ill-fated [Titanic](https://en.wikipedia.org/wiki/Titanic) passengers, travelling from England to New York in April 1912.\n",
"To illustrate the `read_table()` function, let's try to load the `data/titanic.csv` file. As its name suggest, this table contains data about the ill-fated [Titanic](https://en.wikipedia.org/wiki/Titanic) passengers, traveling from England to New York in April 1912.\n",
"\n",
"**Tip:** when working with large datasets, it is convenient to be able to look at a fraction of the data only. For this, the methods [**`head()`**](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) and [**`tail()`**](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) are very helpful. Without any argument `head()`/`tail()` display the first/last 5 lines of a DataFrame. A custom number of lines can be displayed by passing a number: e.g. `head(10)` will display the first 10 lines.\n",
"\n",
Expand All @@ -351,7 +351,7 @@
"source": [
"Take a look above at how the data have been read. By default, `read_table()` expects the input data to be **tab-delimited**, but since this is not the case of the `titanic.csv` file, each line was treated as a single field (column), thus creating a DataFrame with a single column.\n",
"\n",
"As implied by its `.csv` extension (for \"comma-separeted values\"), the `titanic.csv` file contains **comma-delimited** values. To load a CSV file, we can either:\n",
"As implied by its `.csv` extension (for \"comma-separated values\"), the `titanic.csv` file contains **comma-delimited** values. To load a CSV file, we can either:\n",
"* Specify the separator value in `read_table(sep=\",\")`.\n",
"* Use `read_csv()`, a function that will use comma as separator by default.\n",
"\n",
Expand Down Expand Up @@ -457,7 +457,7 @@
"\n",
"<br>\n",
"\n",
"To prevent pandas from wrongly using the values from the first line of the file as column name, we must explicitely tell it that the data contains no header by passing the `header=None` argument:\n",
"To prevent pandas from wrongly using the values from the first line of the file as column name, we must explicitly tell it that the data contains no header by passing the `header=None` argument:\n",
"\n",
"<div>"
]
Expand Down Expand Up @@ -651,7 +651,7 @@
"### Accessing, editing and adding columns <a id='11'></a>\n",
"\n",
"DataFrame columns can be accessed, added and modified using the following syntax (here illustrated with a DataFrame is named `df`):\n",
"* `df[\"column name\"]`: returns the content of the specifed column (as a Series).\n",
"* `df[\"column name\"]`: returns the content of the specified column (as a Series).\n",
"* `df[\"new column name\"] = value`: creates a new column with the specified values. If the column already\n",
" exists, its values are updated.\n",
" * When a **single value** is passed as `values`, all rows get that same value.\n",
Expand Down Expand Up @@ -1417,7 +1417,7 @@
"[Back to ToC](#toc)\n",
"\n",
"### Conditional row selection (row filtering) <a id='18'></a>\n",
"The **`.loc[]`** indexer allows **row selection based on a bloolean (`True`/`False`) vector of values**, returning only rows for which the selection vector values are `True`. This is extremely useful to filter DataFrames.\n",
"The **`.loc[]`** indexer allows **row selection based on a boolean (`True`/`False`) vector of values**, returning only rows for which the selection vector values are `True`. This is extremely useful to filter DataFrames.\n",
"* Testing a condition on a **DataFrame** column returns a boolean **Series**: `df[\"age\"] < 35`.\n",
"* This Series can then be used to filter the DataFrame: `df.loc[df[\"age\"] < 35, :]`.\n",
"* Several **condition can be combined** with the **`&`** (and) and **`|`** (or) operators, e.g.:\n",
Expand Down Expand Up @@ -1533,7 +1533,7 @@
" ```python\n",
" df.loc[df[\"Age\"] <= 35, df.columns[1:3]]\n",
" ```\n",
"* If the **index has the same values are row postions (0, 1, 2, ...)**, the `.index` attribute can be \n",
"* If the **index has the same values are row positions (0, 1, 2, ...)**, the `.index` attribute can be \n",
" used to get row positions and use them with `.iloc[]`:\n",
" ```python\n",
" df.iloc[df[df[\"Age\"] <= 35].index, 1:3]\n",
Expand Down Expand Up @@ -1592,8 +1592,8 @@
"Using the `df` data frame:\n",
"\n",
"* Select all passengers from the `Barber` family.\n",
"* Select passenger that are either amercian, or older than 30 years.\n",
"* **If you have time:** select british passengers that are either women or men travelling 1st class. The passenger class info is found in the `Pclass` column.\n",
"* Select passenger that are either american, or older than 30 years.\n",
"* **If you have time:** select british passengers that are either women or men traveling 1st class. The passenger class info is found in the `Pclass` column.\n",
"\n",
"<div>"
]
Expand Down Expand Up @@ -1626,7 +1626,7 @@
},
"outputs": [],
"source": [
"# Select passenger that are either amercian, or older than 30 years ...\n"
"# Select passenger that are either american, or older than 30 years ...\n"
]
},
{
Expand Down Expand Up @@ -1995,7 +1995,7 @@
"## Grouping data by factor <a id='26'></a>\n",
"---------------------------------\n",
"\n",
"When analysing a dataset where some variables (columns) are factors (categorical values), it is often useful to group the samples (rows) by these factors.\n",
"When analyzing a dataset where some variables (columns) are factors (categorical values), it is often useful to group the samples (rows) by these factors.\n",
"\n",
"For instance, we earlier computed the proportions of women and men that survived by subsetting the original DataFrame. Using **`groupby()`** can make this a lot easier.\n",
"\n",
Expand All @@ -2013,6 +2013,18 @@
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"scrolled": true
},
"source": [
"* Here we compute mean values of all numeric columns by gender (i.e. the mean value is computed separately\n",
" for \"female\" and \"male\"). \n",
" *Note:* since a mean value can only be computed for numeric values, the argument `numeric_only` must be\n",
" set to `True`."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -2021,8 +2033,7 @@
},
"outputs": [],
"source": [
"# Compute means of all numeric columns by gender:\n",
"df.groupby(\"Sex\").mean()"
"df.groupby(\"Sex\").mean(numeric_only=True)"
]
},
{
Expand All @@ -2043,7 +2054,7 @@
"outputs": [],
"source": [
"# Compute mean values by gender and passenger class:\n",
"df.groupby([\"Sex\", \"Pclass\"]).mean()"
"df.groupby([\"Sex\", \"Pclass\"]).mean(numeric_only=True)"
]
},
{
Expand Down Expand Up @@ -2115,9 +2126,9 @@
"\n",
"### About the example dataset used in the Additional Theory section\n",
"\n",
"To illustrate pandas' functionalities, we will here use an example dataset that contains gene expression data. This dataset originates from a [study that investigated stress response in the hearts of mice deficient in the SRC-2 gene](http://www.ncbi.nlm.nih.gov/pubmed/23300926) (transcriptional regulator steroid receptor coactivator-2).\n",
"To illustrate pandas' functionalities, we will here use an example dataset that contains gene expression data. This dataset originates from a [study that investigated stress response in the hearts of mice deficient in the SRC-2 gene](http://www.ncbi.nlm.nih.gov/pubmed/23300926) (transcriptional regulator steroid receptor co-activator-2).\n",
"\n",
"The dataset is in the \"tab\" delimited file `data/mouse_heart_gene_expresssion.tsv` and is structured as follows:\n",
"The dataset is in the \"tab\" delimited file `data/mouse_heart_gene_expression.tsv` and is structured as follows:\n",
"* Rows contain the expression values of a particular gene (higher values = gene is more expressed).\n",
"* Columns corresponds to one sample/condition and contains the expression of values of all genes in that sample.\n",
"* The sample names are given in the first row (header). \n",
Expand All @@ -2130,7 +2141,7 @@
"\n",
"\n",
"\n",
"Based on the names, we can guess that we have gene expression values for heart tissue of two types: \"WT\" (wildtype) and \"KO\" (knock out), and four replicates for each condition:\n",
"Based on the names, we can guess that we have gene expression values for heart tissue of two types: \"WT\" (wild type) and \"KO\" (knock out), and four replicates for each condition:\n",
"\n",
"Heart_WT_1 Heart_WT_2 Heart_WT_3 Heart_WT_4 Heart_KO_1 Heart_KO_2 Heart_KO_3 Heart_KO_4"
]
Expand All @@ -2143,7 +2154,7 @@
},
"outputs": [],
"source": [
"df = pd.read_csv(\"data/mouse_heart_gene_expresssion.tsv\", sep='\\t')\n",
"df = pd.read_csv(\"data/mouse_heart_gene_expression.tsv\", sep='\\t')\n",
"df.head()"
]
},
Expand Down Expand Up @@ -2189,8 +2200,8 @@
"metadata": {},
"outputs": [],
"source": [
"myslice = df['Heart_WT_1']>250\n",
"print(type(myslice))"
"my_slice = df['Heart_WT_1']>250\n",
"print(type(my_slice))"
]
},
{
Expand All @@ -2199,7 +2210,7 @@
"metadata": {},
"outputs": [],
"source": [
"myslice.head()"
"my_slice.head()"
]
},
{
Expand All @@ -2215,8 +2226,8 @@
"metadata": {},
"outputs": [],
"source": [
"mymysteriousobj = df[df['Heart_WT_1']>250]\n",
"print(type(mymysteriousobj))"
"my_mysterious_obj = df[df['Heart_WT_1']>250]\n",
"print(type(my_mysterious_obj))"
]
},
{
Expand Down Expand Up @@ -2439,7 +2450,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Finding the gene with mininum expression, but on the first 3 rows only\n",
"# Finding the gene with minimum expression, but on the first 3 rows only.\n",
"df[0:3].apply(my_filter)"
]
},
Expand Down Expand Up @@ -2469,10 +2480,10 @@
"metadata": {},
"outputs": [],
"source": [
"dfavg = pd.DataFrame()\n",
"dfavg['Heart_WT_avg'] = (df['Heart_WT_1']+df['Heart_WT_2']+df['Heart_WT_3']+df['Heart_WT_4'])/4\n",
"dfavg['Heart_KO_avg'] = (df['Heart_KO_1']+df['Heart_KO_2']+df['Heart_KO_3']+df['Heart_KO_4'])/4\n",
"dfavg.head()"
"df_avg = pd.DataFrame()\n",
"df_avg['Heart_WT_avg'] = (df['Heart_WT_1'] + df['Heart_WT_2'] + df['Heart_WT_3'] + df['Heart_WT_4'])/4\n",
"df_avg['Heart_KO_avg'] = (df['Heart_KO_1'] + df['Heart_KO_2'] + df['Heart_KO_3'] + df['Heart_KO_4'])/4\n",
"df_avg.head()"
]
},
{
Expand All @@ -2483,12 +2494,12 @@
},
"outputs": [],
"source": [
"dfavg = pd.DataFrame()\n",
"dfavg['Heart_WT_avg'] = (df['Heart_WT_1']+df['Heart_WT_2']+df['Heart_WT_3']+df['Heart_WT_4'])/4\n",
"dfavg['Heart_KO_avg'] = (df['Heart_KO_1']+df['Heart_KO_2']+df['Heart_KO_3']+df['Heart_KO_4'])/4\n",
"df_avg = pd.DataFrame()\n",
"df_avg['Heart_WT_avg'] = (df['Heart_WT_1'] + df['Heart_WT_2'] + df['Heart_WT_3'] + df['Heart_WT_4'])/4\n",
"df_avg['Heart_KO_avg'] = (df['Heart_KO_1'] + df['Heart_KO_2'] + df['Heart_KO_3'] + df['Heart_KO_4'])/4\n",
"\n",
"dfall = pd.concat([df, dfavg], axis=1)\n",
"dfall.head()"
"df_all = pd.concat([df, df_avg], axis=1)\n",
"df_all.head()"
]
},
{
Expand All @@ -2504,8 +2515,8 @@
"metadata": {},
"outputs": [],
"source": [
"df['Heart_WT_avg'] = (df['Heart_WT_1']+df['Heart_WT_2']+df['Heart_WT_3']+df['Heart_WT_4'])/4\n",
"df['Heart_KO_avg'] = (df['Heart_KO_1']+df['Heart_KO_2']+df['Heart_KO_3']+df['Heart_KO_4'])/4\n",
"df['Heart_WT_avg'] = (df['Heart_WT_1'] + df['Heart_WT_2'] + df['Heart_WT_3'] + df['Heart_WT_4'])/4\n",
"df['Heart_KO_avg'] = (df['Heart_KO_1'] + df['Heart_KO_2'] + df['Heart_KO_3'] + df['Heart_KO_4'])/4\n",
"df.head()"
]
},
Expand Down Expand Up @@ -2616,7 +2627,7 @@
"\n",
"The [`merge()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) and [`join()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html) methods allow to combine DataFrames, linking their rows based on their keys. \n",
"\n",
"Here's how we construct a dataframe from a dictionary data structure, where dictionary keys are treated as column names, list of values associated with a key is treated as list of elements in the corresponding column, and rows are contructed based on the index of elements within the list of elements in the column (note however that all columns should have the same length):"
"Here's how we construct a dataframe from a dictionary data structure, where dictionary keys are treated as column names, list of values associated with a key is treated as list of elements in the corresponding column, and rows are constructed based on the index of elements within the list of elements in the column (note however that all columns should have the same length):"
]
},
{
Expand Down Expand Up @@ -2906,6 +2917,11 @@
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"vscode": {
"interpreter": {
"hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
}
}
},
"nbformat": 4,
Expand Down

0 comments on commit b0c5d7c

Please sign in to comment.