Skip to content

Commit

Permalink
Merge pull request #102 from Roche/dev
Browse files Browse the repository at this point in the history
version 1.0.7
  • Loading branch information
ofajardo authored Jan 8, 2021
2 parents 2906533 + b9feddf commit dbcaeb0
Show file tree
Hide file tree
Showing 22 changed files with 2,675 additions and 1,609 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@
.DS_Store
dist/
test_data/write/
.vscode
62 changes: 53 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,9 @@ the original applications in this regard.**
+ [More writing options](#more-writing-options)
- [File specific options](#file-specific-options)
- [Writing value labels](#writing-value-labels)
- [Writing user defined missing values](writing-user-defined-missing-values)
- [Variable type conversion](variable-type-conversion)
- [Writing user defined missing values](#writing-user-defined-missing-values)
- [Setting variable formats](#setting-variable-formats)
- [Variable type conversion](#variable-type-conversion)
* [Roadmap](#roadmap)
* [Known limitations](#known-limitations)
* [Python 2.7 support.](#python-27-support)
Expand Down Expand Up @@ -419,8 +420,8 @@ function. The original values will be replaced by the values in the catalog.
```python
import pyreadstat

# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column.
df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', catalog_file='/path/to/a/file.sas7bcat', formats_as_category=True)
# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column. There is also formats_as_ordered_category to get an ordered category, this by default is False.
df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', catalog_file='/path/to/a/file.sas7bcat', formats_as_category=True, formats_as_ordered_category=False)
```

If you prefer to read the sas7bcat file separately, you can apply the formats later with the function set_catalog_to_sas.
Expand All @@ -434,8 +435,9 @@ df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat')
# read_sas7bdat returns an emtpy data frame and the catalog
df_empty, catalog = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bcat')
# enrich the dataframe with the catalog
# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column.
df_enriched, meta_enriched = pyreadstat.set_catalog_to_sas(df, meta, catalog, formats_as_category=True)
# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column. formats_as_ordered_category is by default False meaning by default categories are not ordered.
df_enriched, meta_enriched = pyreadstat.set_catalog_to_sas(df, meta, catalog,
formats_as_category=True, formats_as_ordered_category=False)
```

For SPSS and STATA files, the value labels are included in the files. You can choose to replace the values by the labels
Expand All @@ -445,8 +447,9 @@ when reading the file using the option apply_value_formats, ...
import pyreadstat

# apply_value_formats is by default False, so you have to set it to True manually if you want the labels
# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column.
df, meta = pyreadstat.read_sav("/path/to/sav/file.sav", apply_value_formats=True, formats_as_category=True)
# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column. formats_as_ordered_category is by default False meaning by default categories are not ordered.
df, meta = pyreadstat.read_sav("/path/to/sav/file.sav", apply_value_formats=True,
formats_as_category=True, formats_as_ordered_category=False)
```

... or to do it later with the function set_value_labels:
Expand All @@ -457,7 +460,7 @@ import pyreadstat
# This time no value labels.
df, meta = pyreadstat.read_sav("/path/to/sav/file.sav", apply_value_formats=False)
# now let's add them to a second copy
df_enriched = pyreadstat.set_value_labels(df, meta, formats_as_category=True)
df_enriched = pyreadstat.set_value_labels(df, meta, formats_as_category=True, formats_as_ordered_category=False)
```

Internally each variable is associated with a label set. This information is stored in meta.variable_to_label. Each
Expand Down Expand Up @@ -719,6 +722,47 @@ path = "/path/to/somefile.sav"
pyreadstat.write_sav(df, path, missing_ranges=missing_ranges, variable_value_labels=variable_value_labels)
```

#### Setting variable formats

Numeric types in SPSS, SAS and STATA can have formats that affect how those values are displayed to the user
in the application. Pyreadstat automatically sets the formatting in some cases, as for example when translating
dates or datetimes (which in SPSS/SAS/STATA are just numbers with a special format). The user can however specify custom formats
for their columns with the argument "variable_format", which is
a dictionary with the column name as key and a string with the format as values:

```python
import pandas as pd
import pyreadstat

path = "path/where/to/write/file.sav"
df = pd.DataFrame({'restricted':[1023, 10], 'integer':[1,2]})
formats = {'restricted':'N4', 'integer':'F1.0'}
pyreadstat.write_sav(df, path, variable_format=formats)
```

The appropiate formats to use are beyond the scope of this documentation. Probably you want to read a file
produced in the original application and use meta.original_value_formats to get the formats. Otherwise look
for the documentation of the original application.

##### SPSS

In the case of SPSS we have some presets for some formats:
* restricted_integer: with leading zeros, equivalent to N + variable width (e.g N4)
* integer: Numeric with no decimal places, equivalent to F + variable width + ".0" (0 decimal positions). A
pandas column of type integer will also be translated into this format automatically.

```python
import pandas as pd
import pyreadstat

path = "path/where/to/write/file.sav"
df = pd.DataFrame({'restricted':[1023, 10], 'integer':[1,2]})
formats = {'restricted':'restricted_integer', 'integer':'integer'}
pyreadstat.write_sav(df, path, variable_format=formats)
```

There is some information about the possible formats [here](https://www.gnu.org/software/pspp/pspp-dev/html_node/Variable-Record.html).

#### Variable type conversion

The following rules are used in order to convert from pandas/numpy/python types to the target file types:
Expand Down
4 changes: 4 additions & 0 deletions change_log.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# 1.0.7 (github, pypi and conda 2021.01.09)
* Added formats_as_ordered_category to get an ordered category.
* Added value_formats in order to be able to set the variable format
when writing.
# 1.0.6 (github, pypi and conda 2020.12.17)
* Updated Readstat to version 1.1.5, this fixes: reading sas7bdat file labels,
reading newer por files date-like columns, and few others.
Expand Down
Binary file modified docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/_build/doctrees/index.doctree
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/_build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 91713509866076208214b3c819ac7b2d
config: 1a49789c316ca66f9eda1c9c11bf6954
tags: 645f666f9bcd5a90fca523b33c5a78b7
2 changes: 1 addition & 1 deletion docs/_build/html/_static/documentation_options.js
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
var DOCUMENTATION_OPTIONS = {
URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'),
VERSION: '1.0.6',
VERSION: '1.0.7',
LANGUAGE: 'None',
COLLAPSE_INDEX: false,
BUILDER: 'html',
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Index &mdash; pyreadstat 1.0.6 documentation</title>
<title>Index &mdash; pyreadstat 1.0.7 documentation</title>



Expand Down
32 changes: 31 additions & 1 deletion docs/_build/html/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.0.6 documentation</title>
<title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.0.7 documentation</title>



Expand Down Expand Up @@ -215,6 +215,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
if any appropiate are found.</p></li>
<li><p><strong>formats_as_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – by default True. Takes effect only if apply_value_formats is True. If True, variables with values changed
for their formatted version will be transformed into pandas categories.</p></li>
<li><p><strong>formats_as_ordered_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories.
it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of
the value of formats_as_category.</p></li>
<li><p><strong>encoding</strong> (<em>str</em><em>, </em><em>optional</em>) – Defaults to None. If set, the system will use the defined encoding instead of guessing it. It has to be an
iconv-compatible name</p></li>
<li><p><strong>usecols</strong> (<em>list</em><em>, </em><em>optional</em>) – a list with column names to read from the file. Only those columns will be imported. Case sensitive!</p></li>
Expand Down Expand Up @@ -307,6 +310,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
if any appropiate are found.</p></li>
<li><p><strong>formats_as_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – by default True. Takes effect only if apply_value_formats is True. If True, variables with values changed
for their formatted version will be transformed into pandas categories.</p></li>
<li><p><strong>formats_as_ordered_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories.
it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of
the value of formats_as_category.</p></li>
<li><p><strong>encoding</strong> (<em>str</em><em>, </em><em>optional</em>) – Defaults to None. If set, the system will use the defined encoding instead of guessing it. It has to be an
iconv-compatible name</p></li>
<li><p><strong>usecols</strong> (<em>list</em><em>, </em><em>optional</em>) – a list with column names to read from the file. Only those columns will be imported. Case sensitive!</p></li>
Expand Down Expand Up @@ -377,6 +383,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
of the sas7bdat and set_catalog_to_sas to apply the resulting format into sas7bdat files.</p></li>
<li><p><strong>formats_as_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – Will take effect only if the catalog_file was specified. If True the variables whose values were replaced
by the formats will be transformed into pandas categories.</p></li>
<li><p><strong>formats_as_ordered_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories.
it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of
the value of formats_as_category.</p></li>
<li><p><strong>encoding</strong> (<em>str</em><em>, </em><em>optional</em>) – Defaults to None. If set, the system will use the defined encoding instead of guessing it. It has to be an
iconv-compatible name</p></li>
<li><p><strong>usecols</strong> (<em>list</em><em>, </em><em>optional</em>) – a list with column names to read from the file. Only those columns will be imported. Case sensitive!</p></li>
Expand Down Expand Up @@ -420,6 +429,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
if any appropiate are found.</p></li>
<li><p><strong>formats_as_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – by default True. Takes effect only if apply_value_formats is True. If True, variables with values changed
for their formatted version will be transformed into pandas categories.</p></li>
<li><p><strong>formats_as_ordered_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories.
it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of
the value of formats_as_category.</p></li>
<li><p><strong>encoding</strong> (<em>str</em><em>, </em><em>optional</em>) – Defaults to None. If set, the system will use the defined encoding instead of guessing it. It has to be an
iconv-compatible name</p></li>
<li><p><strong>usecols</strong> (<em>list</em><em>, </em><em>optional</em>) – a list with column names to read from the file. Only those columns will be imported. Case sensitive!</p></li>
Expand Down Expand Up @@ -493,6 +505,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
<li><p><strong>sas_metadata</strong> (<em>pyreadstat metadata object</em>) – resulting from parsing a sas7bdat file</p></li>
<li><p><strong>catalog_metadata</strong> (<em>pyreadstat metadata object</em>) – resulting from parsing a sas7bcat (catalog) file</p></li>
<li><p><strong>formats_as_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – defaults to True. If True the variables having formats will be transformed into pandas categories.</p></li>
<li><p><strong>formats_as_ordered_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories.
it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of
the value of formats_as_category.</p></li>
</ul>
</dd>
<dt class="field-even">Returns</dt>
Expand All @@ -518,6 +533,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
<li><p><strong>dataframe</strong> (<em>pandas dataframe</em>) – resulting from parsing a file</p></li>
<li><p><strong>metadata</strong> (<em>dictionary</em>) – resulting from parsing a file</p></li>
<li><p><strong>formats_as_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – defaults to True. If True the variables having formats will be transformed into pandas categories.</p></li>
<li><p><strong>formats_as_ordered_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories.
it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of
the value of formats_as_category.</p></li>
</ul>
</dd>
<dt class="field-even">Returns</dt>
Expand Down Expand Up @@ -549,6 +567,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
<li><p><strong>missing_user_values</strong> (<em>dict</em><em>, </em><em>optional</em>) – user defined missing values for numeric variables. Must be a dictionary with keys being variable
names and values being a list of missing values. Missing values must be a single character
between a and z.</p></li>
<li><p><strong>variable_format</strong> (<em>dict</em><em>, </em><em>optional</em>) – sets the format of a variable. Must be a dictionary with keys being the variable names and
values being strings defining the format. See README, setting variable formats section,
for more information.</p></li>
</ul>
</dd>
</dl>
Expand All @@ -566,6 +587,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
<li><p><strong>file_label</strong> (<em>str</em><em>, </em><em>optional</em>) – a label for the file</p></li>
<li><p><strong>column_labels</strong> (<em>list</em><em>, </em><em>optional</em>) – list of labels for columns (variables), must be the same length as the number of columns. Variables with no
labels must be represented by None.</p></li>
<li><p><strong>variable_format</strong> (<em>dict</em><em>, </em><em>optional</em>) – sets the format of a variable. Must be a dictionary with keys being the variable names and
values being strings defining the format. See README, setting variable formats section,
for more information.</p></li>
</ul>
</dd>
</dl>
Expand Down Expand Up @@ -599,6 +623,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
values being integers.</p></li>
<li><p><strong>variable_measure</strong> (<em>dict</em><em>, </em><em>optional</em>) – sets the measure type for a variable. Must be a dictionary with keys being variable names and
values being strings one of “nominal”, “ordinal”, “scale” or “unknown” (default).</p></li>
<li><p><strong>variable_format</strong> (<em>dict</em><em>, </em><em>optional</em>) – sets the format of a variable. Must be a dictionary with keys being the variable names and
values being strings defining the format. See README, setting variable formats section,
for more information.</p></li>
</ul>
</dd>
</dl>
Expand All @@ -621,6 +648,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
labels must be represented by None.</p></li>
<li><p><strong>table_name</strong> (<em>str</em><em>, </em><em>optional</em>) – name of the dataset, by default DATASET</p></li>
<li><p><strong>file_format_version</strong> (<em>int</em><em>, </em><em>optional</em>) – XPORT file version, either 8 or 5, default is 8</p></li>
<li><p><strong>variable_format</strong> (<em>dict</em><em>, </em><em>optional</em>) – sets the format of a variable. Must be a dictionary with keys being the variable names and
values being strings defining the format. See README, setting variable formats section,
for more information.</p></li>
</ul>
</dd>
</dl>
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/py-modindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Python Module Index &mdash; pyreadstat 1.0.6 documentation</title>
<title>Python Module Index &mdash; pyreadstat 1.0.7 documentation</title>



Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/search.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Search &mdash; pyreadstat 1.0.6 documentation</title>
<title>Search &mdash; pyreadstat 1.0.7 documentation</title>



Expand Down
Loading

0 comments on commit dbcaeb0

Please sign in to comment.