Merge pull request #102 from Roche/dev

version 1.0.7
Roche · Jan 8, 2021 · dbcaeb0 · dbcaeb0
2 parents 2906533 + b9feddf
commit dbcaeb0
Show file tree

Hide file tree

Showing 22 changed files with 2,675 additions and 1,609 deletions.
diff --git a/.gitignore b/.gitignore
@@ -10,3 +10,4 @@
 .DS_Store
 dist/
 test_data/write/
+.vscode
diff --git a/README.md b/README.md
@@ -49,8 +49,9 @@ the original applications in this regard.**
   + [More writing options](#more-writing-options)
     - [File specific options](#file-specific-options)
     - [Writing value labels](#writing-value-labels)
-    - [Writing user defined missing values](writing-user-defined-missing-values)
-    - [Variable type conversion](variable-type-conversion)
+    - [Writing user defined missing values](#writing-user-defined-missing-values)
+    - [Setting variable formats](#setting-variable-formats)
+    - [Variable type conversion](#variable-type-conversion)
 * [Roadmap](#roadmap)
 * [Known limitations](#known-limitations)
 * [Python 2.7 support.](#python-27-support)
@@ -419,8 +420,8 @@ function. The original values will be replaced by the values in the catalog.
 ```python
 import pyreadstat
 
-# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column.
-df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', catalog_file='/path/to/a/file.sas7bcat', formats_as_category=True)
+# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column. There is also formats_as_ordered_category to get an ordered category, this by default is False.
+df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', catalog_file='/path/to/a/file.sas7bcat', formats_as_category=True, formats_as_ordered_category=False)
 ```
 
 If you prefer to read the sas7bcat file separately, you can apply the formats later with the function set_catalog_to_sas.
@@ -434,8 +435,9 @@ df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat')
 # read_sas7bdat returns an emtpy data frame and the catalog
 df_empty, catalog = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bcat')
 # enrich the dataframe with the catalog
-# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column.
-df_enriched, meta_enriched = pyreadstat.set_catalog_to_sas(df, meta, catalog, formats_as_category=True)
+# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column. formats_as_ordered_category is by default False meaning by default categories are not ordered.
+df_enriched, meta_enriched = pyreadstat.set_catalog_to_sas(df, meta, catalog, 
+                             formats_as_category=True, formats_as_ordered_category=False)
 ```
 
 For SPSS and STATA files, the value labels are included in the files. You can choose to replace the values by the labels
@@ -445,8 +447,9 @@ when reading the file using the option apply_value_formats, ...
 import pyreadstat
 
 # apply_value_formats is by default False, so you have to set it to True manually if you want the labels
-# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column.
-df, meta = pyreadstat.read_sav("/path/to/sav/file.sav", apply_value_formats=True, formats_as_category=True)
+# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column. formats_as_ordered_category is by default False meaning by default categories are not ordered.
+df, meta = pyreadstat.read_sav("/path/to/sav/file.sav", apply_value_formats=True, 
+                                formats_as_category=True, formats_as_ordered_category=False)
 ```
 
 ... or to do it later with the function set_value_labels:
@@ -457,7 +460,7 @@ import pyreadstat
 # This time no value labels.
 df, meta = pyreadstat.read_sav("/path/to/sav/file.sav", apply_value_formats=False)
 # now let's add them to a second copy
-df_enriched = pyreadstat.set_value_labels(df, meta, formats_as_category=True)
+df_enriched = pyreadstat.set_value_labels(df, meta, formats_as_category=True, formats_as_ordered_category=False)
 ```
 
 Internally each variable is associated with a label set. This information is stored in meta.variable_to_label. Each
@@ -719,6 +722,47 @@ path = "/path/to/somefile.sav"
 pyreadstat.write_sav(df, path, missing_ranges=missing_ranges, variable_value_labels=variable_value_labels)
 ```
 
+#### Setting variable formats
+
+Numeric types in SPSS, SAS and STATA can have formats that affect how those values are displayed to the user
+in the application. Pyreadstat automatically sets the formatting in some cases, as for example when translating
+dates or datetimes (which in SPSS/SAS/STATA are just numbers with a special format). The user can however specify custom formats
+for their columns with the argument "variable_format", which is
+a dictionary with the column name as key and a string with the format as values:
+
+```python
+import pandas as pd
+import pyreadstat
+
+path = "path/where/to/write/file.sav"
+df = pd.DataFrame({'restricted':[1023, 10], 'integer':[1,2]})
+formats = {'restricted':'N4', 'integer':'F1.0'}
+pyreadstat.write_sav(df, path, variable_format=formats)
+```
+
+The appropiate formats to use are beyond the scope of this documentation. Probably you want to read a file
+produced in the original application and use meta.original_value_formats to get the formats. Otherwise look
+for the documentation of the original application.
+
+##### SPSS
+
+In the case of SPSS we have some presets for some formats:
+* restricted_integer: with leading zeros, equivalent to N + variable width (e.g N4)
+* integer: Numeric with no decimal places, equivalent to F + variable width + ".0" (0 decimal positions). A 
+  pandas column of type integer will also be translated into this format automatically.
+
+```python
+import pandas as pd
+import pyreadstat
+
+path = "path/where/to/write/file.sav"
+df = pd.DataFrame({'restricted':[1023, 10], 'integer':[1,2]})
+formats = {'restricted':'restricted_integer', 'integer':'integer'}
+pyreadstat.write_sav(df, path, variable_format=formats)
+```
+
+There is some information about the possible formats [here](https://www.gnu.org/software/pspp/pspp-dev/html_node/Variable-Record.html).
+
 #### Variable type conversion
 
 The following rules are used in order to convert from pandas/numpy/python types to the target file types:

diff --git a/change_log.md b/change_log.md
@@ -1,3 +1,7 @@
+# 1.0.7 (github, pypi and conda 2021.01.09)
+* Added formats_as_ordered_category to get an ordered category.
+* Added value_formats in order to be able to set the variable format
+  when writing.
 # 1.0.6 (github, pypi and conda 2020.12.17)
 * Updated Readstat to version 1.1.5, this fixes: reading sas7bdat file labels, 
   reading newer por files date-like columns, and few others.

diff --git a/docs/_build/doctrees/environment.pickle b/docs/_build/doctrees/environment.pickle
diff --git a/docs/_build/doctrees/index.doctree b/docs/_build/doctrees/index.doctree
diff --git a/docs/_build/html/.buildinfo b/docs/_build/html/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: 91713509866076208214b3c819ac7b2d
+config: 1a49789c316ca66f9eda1c9c11bf6954
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/docs/_build/html/_static/documentation_options.js b/docs/_build/html/_static/documentation_options.js
@@ -1,6 +1,6 @@
 var DOCUMENTATION_OPTIONS = {
     URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'),
-    VERSION: '1.0.6',
+    VERSION: '1.0.7',
     LANGUAGE: 'None',
     COLLAPSE_INDEX: false,
     BUILDER: 'html',

diff --git a/docs/_build/html/genindex.html b/docs/_build/html/genindex.html
@@ -7,7 +7,7 @@
 
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
-  <title>Index &mdash; pyreadstat 1.0.6 documentation</title>
+  <title>Index &mdash; pyreadstat 1.0.7 documentation</title>
 
 
 

diff --git a/docs/_build/html/index.html b/docs/_build/html/index.html
@@ -7,7 +7,7 @@
 
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
-  <title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.0.6 documentation</title>
+  <title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.0.7 documentation</title>
 
 
 
@@ -215,6 +215,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
 if any appropiate are found.</p></li>
 <li><p><strong>formats_as_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – by default True. Takes effect only if apply_value_formats is True. If True, variables with values changed
 for their formatted version will be transformed into pandas categories.</p></li>
+<li><p><strong>formats_as_ordered_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories.
+it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of
+the value of formats_as_category.</p></li>
 <li><p><strong>encoding</strong> (<em>str</em><em>, </em><em>optional</em>) – Defaults to None. If set, the system will use the defined encoding instead of guessing it. It has to be an
 iconv-compatible name</p></li>
 <li><p><strong>usecols</strong> (<em>list</em><em>, </em><em>optional</em>) – a list with column names to read from the file. Only those columns will be imported. Case sensitive!</p></li>
@@ -307,6 +310,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
 if any appropiate are found.</p></li>
 <li><p><strong>formats_as_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – by default True. Takes effect only if apply_value_formats is True. If True, variables with values changed
 for their formatted version will be transformed into pandas categories.</p></li>
+<li><p><strong>formats_as_ordered_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories.
+it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of
+the value of formats_as_category.</p></li>
 <li><p><strong>encoding</strong> (<em>str</em><em>, </em><em>optional</em>) – Defaults to None. If set, the system will use the defined encoding instead of guessing it. It has to be an
 iconv-compatible name</p></li>
 <li><p><strong>usecols</strong> (<em>list</em><em>, </em><em>optional</em>) – a list with column names to read from the file. Only those columns will be imported. Case sensitive!</p></li>
@@ -377,6 +383,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
 of the sas7bdat and set_catalog_to_sas to apply the resulting format into sas7bdat files.</p></li>
 <li><p><strong>formats_as_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – Will take effect only if the catalog_file was specified. If True the variables whose values were replaced
 by the formats will be transformed into pandas categories.</p></li>
+<li><p><strong>formats_as_ordered_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories.
+it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of
+the value of formats_as_category.</p></li>
 <li><p><strong>encoding</strong> (<em>str</em><em>, </em><em>optional</em>) – Defaults to None. If set, the system will use the defined encoding instead of guessing it. It has to be an
 iconv-compatible name</p></li>
 <li><p><strong>usecols</strong> (<em>list</em><em>, </em><em>optional</em>) – a list with column names to read from the file. Only those columns will be imported. Case sensitive!</p></li>
@@ -420,6 +429,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
 if any appropiate are found.</p></li>
 <li><p><strong>formats_as_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – by default True. Takes effect only if apply_value_formats is True. If True, variables with values changed
 for their formatted version will be transformed into pandas categories.</p></li>
+<li><p><strong>formats_as_ordered_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories.
+it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of
+the value of formats_as_category.</p></li>
 <li><p><strong>encoding</strong> (<em>str</em><em>, </em><em>optional</em>) – Defaults to None. If set, the system will use the defined encoding instead of guessing it. It has to be an
 iconv-compatible name</p></li>
 <li><p><strong>usecols</strong> (<em>list</em><em>, </em><em>optional</em>) – a list with column names to read from the file. Only those columns will be imported. Case sensitive!</p></li>
@@ -493,6 +505,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
 <li><p><strong>sas_metadata</strong> (<em>pyreadstat metadata object</em>) – resulting from parsing a sas7bdat file</p></li>
 <li><p><strong>catalog_metadata</strong> (<em>pyreadstat metadata object</em>) – resulting from parsing a sas7bcat (catalog) file</p></li>
 <li><p><strong>formats_as_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – defaults to True. If True the variables having formats will be transformed into pandas categories.</p></li>
+<li><p><strong>formats_as_ordered_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories.
+it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of
+the value of formats_as_category.</p></li>
 </ul>
 </dd>
 <dt class="field-even">Returns</dt>
@@ -518,6 +533,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
 <li><p><strong>dataframe</strong> (<em>pandas dataframe</em>) – resulting from parsing a file</p></li>
 <li><p><strong>metadata</strong> (<em>dictionary</em>) – resulting from parsing a file</p></li>
 <li><p><strong>formats_as_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – defaults to True. If True the variables having formats will be transformed into pandas categories.</p></li>
+<li><p><strong>formats_as_ordered_category</strong> (<em>bool</em><em>, </em><em>optional</em>) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories.
+it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of
+the value of formats_as_category.</p></li>
 </ul>
 </dd>
 <dt class="field-even">Returns</dt>
@@ -549,6 +567,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
 <li><p><strong>missing_user_values</strong> (<em>dict</em><em>, </em><em>optional</em>) – user defined missing values for numeric variables. Must be a dictionary with keys being variable
 names and values being a list of missing values. Missing values must be a single character
 between a and z.</p></li>
+<li><p><strong>variable_format</strong> (<em>dict</em><em>, </em><em>optional</em>) – sets the format of a variable. Must be a dictionary with keys being the variable names and
+values being strings defining the format. See README, setting variable formats section,
+for more information.</p></li>
 </ul>
 </dd>
 </dl>
@@ -566,6 +587,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
 <li><p><strong>file_label</strong> (<em>str</em><em>, </em><em>optional</em>) – a label for the file</p></li>
 <li><p><strong>column_labels</strong> (<em>list</em><em>, </em><em>optional</em>) – list of labels for columns (variables), must be the same length as the number of columns. Variables with no
 labels must be represented by None.</p></li>
+<li><p><strong>variable_format</strong> (<em>dict</em><em>, </em><em>optional</em>) – sets the format of a variable. Must be a dictionary with keys being the variable names and
+values being strings defining the format. See README, setting variable formats section,
+for more information.</p></li>
 </ul>
 </dd>
 </dl>
@@ -599,6 +623,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
 values being integers.</p></li>
 <li><p><strong>variable_measure</strong> (<em>dict</em><em>, </em><em>optional</em>) – sets the measure type for a variable. Must be a dictionary with keys being variable names and
 values being strings one of “nominal”, “ordinal”, “scale” or “unknown” (default).</p></li>
+<li><p><strong>variable_format</strong> (<em>dict</em><em>, </em><em>optional</em>) – sets the format of a variable. Must be a dictionary with keys being the variable names and
+values being strings defining the format. See README, setting variable formats section,
+for more information.</p></li>
 </ul>
 </dd>
 </dl>
@@ -621,6 +648,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
 labels must be represented by None.</p></li>
 <li><p><strong>table_name</strong> (<em>str</em><em>, </em><em>optional</em>) – name of the dataset, by default DATASET</p></li>
 <li><p><strong>file_format_version</strong> (<em>int</em><em>, </em><em>optional</em>) – XPORT file version, either 8 or 5, default is 8</p></li>
+<li><p><strong>variable_format</strong> (<em>dict</em><em>, </em><em>optional</em>) – sets the format of a variable. Must be a dictionary with keys being the variable names and
+values being strings defining the format. See README, setting variable formats section,
+for more information.</p></li>
 </ul>
 </dd>
 </dl>

diff --git a/docs/_build/html/py-modindex.html b/docs/_build/html/py-modindex.html
@@ -7,7 +7,7 @@
 
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
-  <title>Python Module Index &mdash; pyreadstat 1.0.6 documentation</title>
+  <title>Python Module Index &mdash; pyreadstat 1.0.7 documentation</title>
 
 
 

diff --git a/docs/_build/html/search.html b/docs/_build/html/search.html
@@ -7,7 +7,7 @@
 
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
-  <title>Search &mdash; pyreadstat 1.0.6 documentation</title>
+  <title>Search &mdash; pyreadstat 1.0.7 documentation</title>
-Original file line number
+Diff line change
@@ Expand Up / @@ -10,3 +10,4 @@ @@
     .DS_Store
     dist/
     test_data/write/
+    .vscode
Original file line number	Diff line number	Diff line change
Expand Up		@@ -7,7 +7,7 @@

		<meta name="viewport" content="width=device-width, initial-scale=1.0">

		<title>Index — pyreadstat 1.0.6 documentation</title>
		<title>Index — pyreadstat 1.0.7 documentation</title>



Expand Down