-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
eb0f804
commit 837709a
Showing
164 changed files
with
16,356 additions
and
29,648 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
{ | ||
"hash": "ff2c37c4574981983ec8f602dae77924", | ||
"result": { | ||
"markdown": "---\ndescription-meta: |\n These are the guidelines to all the technical decisions\n taken during the creation of packges for the PIP\n workflow.\n---\n\n\n\n\n\n# Welcome! {.unnumbered}\n\nThese are the internal guidelines for the technical process of the PIP project. Visit the [Github repository for this book](https://github.com/PIP-Technical-Team/PIPmanual). \n\n## Team {-}\n\nThe PIP technical team is composed of .... \n", | ||
"supporting": [], | ||
"filters": [ | ||
"rmarkdown/pagebreak.lua" | ||
], | ||
"includes": {}, | ||
"engineDependencies": {}, | ||
"preserve": {}, | ||
"postProcess": true | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
{ | ||
"hash": "50c95cc9c1c27a6c760d0bf8007e0cba", | ||
"result": { | ||
"markdown": "# Introduction {#intro}\n\n[NOTE:Andres, finish chapter]{style=\"color:red\"}\n\n## Ojectives\n\nThis book explains several things,\n\n1. An overview of the project from a technical perspective.\n2. The interaction between R packages developed to manage the data and do the calculations.\n3. The different types of data in the PIP project and the interaction between them.\n4. How the poverty calculator, the table maker, and the Statistics Online (SOL) platform are updated.\n5. Technical standalone procedure necessary for the execution of some parts of the project.\n\n## Technical requirements\n\n\n\n\n\n<!-- ::: {.rmdbox .rmdwarning} -->\n\n<!-- DISCLAIMER: This book contains information relevant exclusively for the internal -->\n\n<!-- Technical team of the PIP project. It has been made available to the public for -->\n\n<!-- the sake of transparency, but its information has no use outside the internal -->\n\n<!-- team. -->\n\n<!-- ::: -->\n\nYou need to make sure that the `bookdown` package is installed in your local computer\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"bookdown\")\n\n# or the development version\n devtools::install_github(\"rstudio/bookdown\")\n```\n:::\n\n\nRemember each Rmd file contains one and only one chapter, and a chapter is defined by the first-level heading `#`.\n\nMake sure to install the latest version of PIP R packages by typing the following\n\n\n::: {.cell}\n\n```{.r .cell-code}\nipkg <- utils::installed.packages()[,1]\n\npip_install <- function(pkg, ipkg) {\n if (isFALSE(pkg %in% ipkg)) {\n gitcall <- paste0(\"PIP-Technical-Team/\", pkg)\n remotes::install_github(gitcall, dependencies = TRUE)\n TRUE\n } else {\n FALSE\n }\n}\n\npkgs <- c(\"pipload\", \"pipaux\", \"wbpip\", \"piptb\", \"pipdm\", \"pipapi\")\n\n\npurrr::walk(pkgs, pip_install, ipkg = ipkg)\n```\n:::\n", | ||
"supporting": [], | ||
"filters": [ | ||
"rmarkdown/pagebreak.lua" | ||
], | ||
"includes": {}, | ||
"engineDependencies": {}, | ||
"preserve": {}, | ||
"postProcess": true | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
{ | ||
"hash": "751c6eda1408876ec1d1f463213c1b1d", | ||
"result": { | ||
"markdown": "# Joining Data\n\nSince the PIP project is comprised of several databases at very different domain levels, this chapter provides all the guidelines to join data frames correctly. The first thing to understand is that the reporting measures in PIP (e.g. poverty and inequality) are uniquely identified by four variables: *Country*, *year*, *domain*, and *welfare type*.\n\n- *Country* refers to independent economies that conduct independent household\n surveys. For instance, China and Taiwan are treated as **two different\n economies** by the World Bank, and hence by PIP, even though under some\n criteria some people think that Taiwan is part of China.\n\n- *Year* refers to reporting year rather than the actual calendar years over\n which the survey was conducted. Some household surveys like India 2011/2012\n are conducted over the of two calendar year, but the welfare aggregate is\n deflated to the reporting year, 2011.\n\n- *Domain* refers to the smallest geographical disaggregation for which it is\n possible to deflate and line up to PPP values the welfare aggregate of a\n household survey. The criteria to determine the reporting domain of a\n household survey is still under consideration, but ideally it such for which\n there is CPI, PPP, and population auxiliary data, as well as a household\n survey representative at that level. There are some exceptions to this\n criterion like China or the Philippines, but this cases explained in\n detailed in Section \\@ref(special-data-cases). As of today\n (August 24, 2023), most country/years are reported at\n the national domain and few are reported at the urban/rural domain. However,\n the PIP technical infrastructure has been designed to incorporate other\n domain levels if, at some point in time, it is the case.\n\n- Finally, the *welfare type* specifies whether the welfare aggregate is based\n on income or in consumption. For the latter case, though some households\n surveys capture expenditure instead, they are still considered\n consumption-based surveys.\n\nThe challenge of joining different data frames in PIP is that these four variables that uniquely identify the reporting measures [**are not**]{style=\"color:red\"} available on any of the PIP data files---with the exception of the cache files that we discuss below. This challenge is easily addressed by having a clear understanding of the Price FrameWork (pfw) data frame. This file does not only contain valuable metadata, but it could also be considered as the anchor among all PIP data.\n\n## The Price FrameWork (pfw) data {#pfw-join}\n\nAs always, this file can be loaded by typing,\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(data.table)\n#pfw <- pipload::pip_load_aux(\"pfw\")\n#joyn::is_id(pfw, by = c(\"country_code\", \"surveyid_year\", \"survey_acronym\"))\n```\n:::\n\n\nFirst of all, notice that `pfw` is uniquely identified by country code, survey\nyear, and survey acronym. The reason for this is that pfw aims at providing a\nlink between every single household survey and all the other auxiliary data.\nSince welfare data is stored following the naming convention of the\n[International Household Survey Network (IHSN)](http://ihsn.org/), data is\nstored according to country, survey year, acronym of the survey, and vintage\ncontrol of master and alternative versions. The vintage control of the master\nand alternative version of the data is not relevant for joining data because PIP\nuses, by default, the most recent version.\n\nKeep in mind that PIP estimates are reported at the country, year, domain, and\nwelfare type level, but the last two of these are not found either in the survey\nID nor as unique identifiers of the pfw. To solve this problem, the pfw data\nmakes use of the variables `welfare_type,` `aaa_domain`, and `aaa_domain_var`.\n\nAs the name suggests, `welfare_type` indicates the **main** welfare aggregate\ntype (i.e, income or consumption ) of the variable `welfare` in the GMD datasets\nthat correspond to the survey ID formed by concatenating variables\n`country_code`, `surveyid_year`, and `survey_acronym`. For example, the\n`welfare_type` of the `welfare` variable in the datasets of\n***COL_2018_GEIH**\\_V01_M\\_V03_A\\_GMD* is income.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# pfw[ country_code == \"COL\"\n# & surveyid_year == 2018\n# & survey_acronym == \"GEIH\", # Not necessary since it is the only one\n# unique(welfare_type)]\n```\n:::\n\n\nThe prefix `aaa` in variables `aaa_domain` and `aaa_domain_var` refers to the\nidentification code of any of the auxiliary data. Thus, you will find a\n`gdp_domain`, `cpi_domain`, `ppp_domain` and several others. All `aaa_domain`\nvariables contain the *lower* level of geographical disaggregation of the\ncorresponding `aaa` auxiliary data. There are only three possible levels of\ndisaagregation,\n\n\n::: {.cell}\n\n:::\n\n\nAs of now, no survey or auxiliary data is broken down at level 3 (i.e.,\nsubnational), but it is important to know that the PIP internal code takes that\npossibility into account for future cases.\n\nDepending on the country, the domain level of each auxiliary data might be\ndifferent. In Indonesia, for instance, the CPI domain is national, whereas the\nPPP domain is \"urban/rural.\"\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# pfw[ country_code == \"IDN\" & surveyid_year == 2018, \n# .(cpi = unique(cpi_domain), \n# ppp = unique(ppp_domain))]\n```\n:::\n\n\nFinally, [and this is really important]{style=\"color:red\"}, variables\n`aaa_domain_var` contains the name of variable in the GMD dataset that uniquely\nidentify the household survey *in the corresponding `aaa`* *auxiliary data*. In\nother words, `aaa_domain_var` contains *the name* of the variable in GMD that\nmust be used as *key* *to join* GMD to `aaa`. You may ask, does the name of the\nvariable in the `aaa` auxiliary data have the same variable name in the GMD data\nspecified in `aaa_domain_var`? No, it does not. Since the domain level to\nidentify observations in the `aaa` auxiliary data is unique, there is only one\nvariable in auxiliary data used to merge any welfare data, `aaa_data_level`.\nSince all this process is a little cumbersome, the\n{[pipdp](https://github.com/PIP-Technical-Team/pipdp)} Stata package, during the\nprocess of cleaning GMD databases to PIP databases, creates as many\n`aaa_data_level` variables as needed in order to make the join of welfare data\nand auxiliary data simpler. You can see the lines of code that create these\nvariables in [this\nsection](https://github.com/PIP-Technical-Team/pipdp/blob/9c4e32636dd1c71954816f8a7c8ad743349959a6/pipdp_md_clean.ado#L225-L294)\nof the file \"pipdp_md_clean.ado.\"[^joining_data-1]\n\n[^joining_data-1]: You can find more information about the conversion from GMD\n to PIP databases in Section \\@ref(welfare-data)\n\n### Joining data example\n\nLet's see the case of Indonesia above. The pfw says that the CPI domain is\n\"national\" and the PPP domain is \"urban/rural.\" That means that the welfare data\njoin to each of these auxiliary data with two different variables,\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# domains <- \n# pfw[ country_code == \"IDN\" & surveyid_year == 2018, \n# .(cpi = unique(cpi_domain_var), \n# ppp = unique(ppp_domain_var))][]\n```\n:::\n\n\nThis says that the name of the variable in the welfare data to join PPP data is\ncalled `uban`, but there is not seem to be a variable name in GMD to join the\nCPI data. When the name of the variable is missing, it indicates that the\nwelfare data is not split by any variable to merge CPI data. That is, it is at\nthe national level.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# ccode <- \"CHN\"\n# cpi <- pipload::pip_load_aux(\"cpi\")\n# ppp <- pipload::pip_load_aux(\"ppp\")\n# \n# CHN <- pipload::pip_load_data(country = ccode, \n# year = 2015)\n# \n# dt <- joyn::merge(CHN, cpi,\n# by = c(\"country_code\", \"survey_year\",\n# \"survey_acronym\", \"cpi_data_level\"),\n# match_type = \"m:1\", \n# keep = \"left\")\n```\n:::\n\n\n## Special data cases\n\n[NOTE: Andres. Add this section]{style=\"color:red\"}\n", | ||
"supporting": [], | ||
"filters": [ | ||
"rmarkdown/pagebreak.lua" | ||
], | ||
"includes": {}, | ||
"engineDependencies": {}, | ||
"preserve": {}, | ||
"postProcess": true | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
{ | ||
"hash": "08ef23a7c3a8022d84a5e724f666d359", | ||
"result": { | ||
"markdown": "# Load microdata and Auxiliary data {#load}\n\nMake sure you have all the packages installed and loaded into memory. Given that they are hosted in Github, the code below makes sure that any package in the PIP workflow can be installed correctly. \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# ## First specify the packages of interest\n# packages = c(\"pipaux\", \"pipload\")\n# \n# ## Now load or install&load all\n# package.check <- lapply(\n# packages,\n# FUN = function(x) {\n# if (!require(x, character.only = TRUE)) {\n# pck_name <- paste0(\"PIP-Technical-Team/\", x)\n# devtools::install_github(pck_name)\n# library(x, character.only = TRUE)\n# }\n# }\n# )\n\n```\n:::\n\n\n## Auxiilary data\n\n\n\n\nEven though `pipaux` has more than functions, most of its features \ncan be executed by using only the `pipaux::load_aux` and `pipaux::update_aux` functions. \n\n\n### udpate data\n\n\n\n\nthe main function of the `pipaux` package is `udpate_aux`. The first argument of this function is `measure` and it refers to the measure data to be loaded. The measures available are ****.\n\n::: {.cell}\n\n```{.r .cell-code}\n# pipaux::update_aux(measure = \"cpi\")\n```\n:::\n\n\n### Load data\nLoading auxiliary data is the job of the package `pipload` through the function `pipload::pip_load_aux()`, though `pipaux` also provides `pipaux::load_aux()` for the same purpose. Notice that, though both function do exactly the same, the loading function from `pipload` has the prefix `pip_` to distinguish it from the one in `pipaux`. However, we are going to limit the work of `pipaux` to update auxiliary data and the work of `pipload` to load data. Thus, all the examples below use `pipload` for loading either microdata or auxiliary data. \n\n\n::: {.cell}\n\n```{.r .cell-code}\n# df <- pipload::pip_load_aux(measure = \"cpi\")\n# head(df)\n```\n:::\n\n\n\n## Microdata \n\nLoading PIP microdata is the most practical action in the `pipload` package. However, it is important to understand the logic of microdata. \n\nPIP microdata has several characteristics, \n\n* There could be more than once survey for each Country/Year. This happens when there are more than one welfare variable available such as income and consumption. \n* Some countries, like Mexico, have the two different welfare types in the same survey for the same country/year. This add a layer of complexity when the objective is to known which is default one. \n* There are multiple version of the same harmonized survey. These version are organized in a two-type vintage control. It is possible to have a new version of the data because the Raw data--the one provided by the official NSO--has been updated, or because there has been un update in the harmonization process.\n* Each survey could be use for more than one analytic tool in PIP (e.g., Poverty Calculator, Table Maker, or SOL). Thus, the data to be loaded depends on the tool in which it is going to be used. \n\nThus, in order to make the process of finding and loading data efficiently, `pipload` is a three-step process.\n\n### Inventory file\nThe inventory file resides in `y:/PIP-Data/_inventory/inventory.fst`. This file is a data frame with all the microdata available in the PIP structure. It has two main variables, `orig` and `filename`. The former refers to the full directory path of the database, whereas the latter is only the file name. the other variables in this data frame are derived from these two. \n\nThe inventory file is used to speed up the file searching process in `pipload`. In previous packages, each time the user wanted to find a particular data base, it was necessary to look into the folder structure and extract the name of all the file that meet a particular criteria. This is time-consuming and inefficient. The advantage of this method though, is that, by construction, it finds all the the data available. By contrast, the inventory file method is much faster than the \"searching\" method, as it only requires to load a light file with all the data available, filter the data, and return the required information. The drawback, however, is that it needs to be kept up to date as data changes constantly. \n\nTo update the inventory file, you need to use the function `pip_update_inventory`. If you don't provide any argument, it will update the whole inventory, which may take around 10 to 15 min--the function will warn you about it. By provide the country/ies you want to update, the process is way faster. \n\n\n::: {.cell}\n\n```{.r .cell-code}\n# # update one country\n# pip_update_inventory(\"MEX\")\n# \n# # Load inventory file\n# df <- pip_load_inventory()\n# head(df[, \"filename\"])\n\n```\n:::\n\n\n\n### Finding data \n\nEvery dataset in the PIP microdata repository is identified by seven variables! Country code, survey year, survey acronym, master version, alternative version, tool, and source. So giving the user the responsibility to know all the different combinations of each file is a heavy burden. Thus, the data finder, `pip_find_data()`, will provide the names of all the files available that meet the criteria in the arguments provided by the user. For instance, if the use wants to know the all the file available for Paraguay, we could type, \n\n\n::: {.cell}\n\n```{.r .cell-code}\n# pip_find_data(country = \"PRY\")[[\"filename\"]]\n```\n:::\n\n\nYet, if the user need to be more precise in its request, she can add information to the different arguments of the function. For example, this is data available in 2012, \n\n\n::: {.cell}\n\n```{.r .cell-code}\n# pip_find_data(country = \"PRY\", \n# year = 2012)[[\"filename\"]]\n```\n:::\n\n\n### Loading data\n\nFunction `pip_load_data` takes care of loading the data. The very first instruction within `pip_load_data` is to find the data avialable in the repository by using `pip_load_inventory()`. The difference however is two-fold. First, `pip_load_data` will load the default and/or most recent version of the country/year combination available. Second, it gives the user the possibility to load different datasets in either list or dataframe form. For instance, if the user wants to load the Paraguay data in 2014 and 2015 used in the Poverty Calculator tool, she may type, \n\n\n::: {.cell hash='load_md_aux_cache/html/load-data_04ffc979fa06b4fc509eb96be7c9cc97'}\n\n```{.r .cell-code}\n\n# df <- pip_load_data(country = \"PRY\",\n# year = c(2014, 2015), \n# tool = \"PC\")\n# \n# janitor::tabyl(df, survey_id)\n```\n:::\n", | ||
"supporting": [], | ||
"filters": [ | ||
"rmarkdown/pagebreak.lua" | ||
], | ||
"includes": {}, | ||
"engineDependencies": {}, | ||
"preserve": {}, | ||
"postProcess": true | ||
} | ||
} |
Large diffs are not rendered by default.
Oops, something went wrong.
Oops, something went wrong.