Skip to content

Commit

Permalink
Merge pull request #88 from Roche/dev
Browse files Browse the repository at this point in the history
v 1.0.4
  • Loading branch information
ofajardo authored Nov 12, 2020
2 parents 2a0d9ca + be56154 commit 5a1b9bb
Show file tree
Hide file tree
Showing 21 changed files with 2,502 additions and 1,307 deletions.
57 changes: 48 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@ the original applications in this regard.**
+ [More reading options](#more-reading-options)
- [Reading only the headers](#reading-only-the-headers)
- [Reading selected columns](#reading-selected-columns)
- [Reading rows in chunks](#reading-rows-in-chunks)
- [Reading files in parallel processes](#reading-files-in-parallel-processes)
- [Reading rows in chunks](#reading-rows-in-chunks)
- [Reading value labels](#reading-value-labels)
- [Missing Values](#missing-values)
+ [SPSS](#spss)
Expand Down Expand Up @@ -323,6 +323,44 @@ df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', usecols=["variab

```

#### Reading files in parallel processes

A challenge when reading large files is the time consumed in the operation. In order to alleviate this
pyreadstat provides a function "read_file_multiprocessing" to read a file in parallel processes using the python multiprocessing library. As it reads the whole file in one go you need to have enough RAM for the operation. If
that is not the case look at Reading rows in chunks (next section)

Speed ups in the process will depend on a number of factors such as number of processes available, RAM,
content of the file etc.

```python
import pyreadstat

fpath = "path/to/file.sav"
df, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, fpath, num_processes=4)
```

num_processes is the number of workers and it defaults to 4 (or the number of cores if less than 4). You can play with it to see where you
get the best performance. You can also get the number of all available workers like this:

```
import multiprocessing
num_processes = multiprocessing.cpu_count()
```

**Notes for windows**

1. For this to work you must include a __name__ == "__main__" section in your script. See [this issue](#85)
for more details.

```
import pyreadstat
if __name__ == "__main__":
df, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, 'sample.sav')
```
2. If you include too many workers or you run out of RAM you main get a message about not enough page file
size. See [this issue](#87)

#### Reading rows in chunks

Reading large files with hundred of thouseds of rows can be challenging due to memory restrictions. In such cases, it may be helpful
Expand Down Expand Up @@ -353,18 +391,19 @@ for df, meta in reader:
# do some cool calculations here for the chunk
```

#### Reading files in parallel processes

Another challenge when reading large files is the time consumed in the operation. In order to alleviate this
pyreadstat provides a function "read_file_multiprocessing" to read a file in parallel processes using the python multiprocessing library.
Speed ups in the process will depend on a number of factors such as number of processes available, RAM,
content of the file etc.
For very large files it may be convienient to speed up the process by reading each chunks in parallel. For
this purpose you can pass the argument multiprocess=True. This is a combination of read_file_in_chunks and
read_file_multiprocessing. Here you can use the arguments row_offset and row_limit to start reading the
file from an offest and stop after a row_offset+row_limit.

```python
import pyreadstat
fpath = "path/to/file.sav"
reader = pyreadstat.read_file_in_chunks(pyreadstat.read_sav, fpath, chunksize= 10000, multiprocess=True, num_processes=4)

fpath = "path/to/file.sav"
df, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, fpath)
for df, meta in reader:
print(df) # df will contain 10000 rows except for the last one
# do some cool calculations here for the chunk
```

#### Reading value labels
Expand Down
Binary file modified docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/_build/doctrees/index.doctree
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/_build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 4c9fea8c6d7cc40754033f4e92df1160
config: e9744ea50bd658454f12dc039aea659e
tags: 645f666f9bcd5a90fca523b33c5a78b7
1 change: 1 addition & 0 deletions docs/_build/html/_static/basic.css
Original file line number Diff line number Diff line change
Expand Up @@ -764,6 +764,7 @@ div.code-block-caption code {
}

table.highlighttable td.linenos,
span.linenos,
div.doctest > div.highlight span.gp { /* gp: Generic.Prompt */
user-select: none;
}
Expand Down
5 changes: 3 additions & 2 deletions docs/_build/html/_static/doctools.js
Original file line number Diff line number Diff line change
Expand Up @@ -285,9 +285,10 @@ var Documentation = {
initOnKeyListeners: function() {
$(document).keydown(function(event) {
var activeElementType = document.activeElement.tagName;
// don't navigate when in search box or textarea
// don't navigate when in search box, textarea, dropdown or button
if (activeElementType !== 'TEXTAREA' && activeElementType !== 'INPUT' && activeElementType !== 'SELECT'
&& !event.altKey && !event.ctrlKey && !event.metaKey && !event.shiftKey) {
&& activeElementType !== 'BUTTON' && !event.altKey && !event.ctrlKey && !event.metaKey
&& !event.shiftKey) {
switch (event.keyCode) {
case 37: // left
var prevHref = $('link[rel="prev"]').prop('href');
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/_static/documentation_options.js
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
var DOCUMENTATION_OPTIONS = {
URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'),
VERSION: '1.0.3',
VERSION: '1.0.4',
LANGUAGE: 'None',
COLLAPSE_INDEX: false,
BUILDER: 'html',
Expand Down
7 changes: 6 additions & 1 deletion docs/_build/html/_static/pygments.css
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
pre { line-height: 125%; margin: 0; }
td.linenos pre { color: #000000; background-color: #f0f0f0; padding-left: 5px; padding-right: 5px; }
span.linenos { color: #000000; background-color: #f0f0f0; padding-left: 5px; padding-right: 5px; }
td.linenos pre.special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
span.linenos.special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
.highlight .hll { background-color: #ffffcc }
.highlight { background: #eeffcc; }
.highlight { background: #eeffcc; }
.highlight .c { color: #408090; font-style: italic } /* Comment */
.highlight .err { border: 1px solid #FF0000 } /* Error */
.highlight .k { color: #007020; font-weight: bold } /* Keyword */
Expand Down
8 changes: 4 additions & 4 deletions docs/_build/html/_static/searchtools.js
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,10 @@ var Search = {
_pulse_status : -1,

htmlToText : function(htmlString) {
var htmlElement = document.createElement('span');
htmlElement.innerHTML = htmlString;
$(htmlElement).find('.headerlink').remove();
docContent = $(htmlElement).find('[role=main]')[0];
var virtualDocument = document.implementation.createHTMLDocument('virtual');
var htmlElement = $(htmlString, virtualDocument);
htmlElement.find('.headerlink').remove();
docContent = htmlElement.find('[role=main]')[0];
if(docContent === undefined) {
console.warn("Content block not found. Sphinx search tries to obtain it " +
"via '[role=main]'. Could you check your theme or template.");
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Index &mdash; pyreadstat 1.0.3 documentation</title>
<title>Index &mdash; pyreadstat 1.0.4 documentation</title>



Expand Down
9 changes: 5 additions & 4 deletions docs/_build/html/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.0.3 documentation</title>
<title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.0.4 documentation</title>



Expand Down Expand Up @@ -252,6 +252,8 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
<li><p><strong>chunksize</strong> (<em>integer</em><em>, </em><em>optional</em>) – size of the chunks to read</p></li>
<li><p><strong>offset</strong> (<em>integer</em><em>, </em><em>optional</em>) – start reading the file after certain number of rows</p></li>
<li><p><strong>limit</strong> (<em>integer</em><em>, </em><em>optional</em>) – stop reading the file after certain number of rows, will be added to offset</p></li>
<li><p><strong>multiprocess</strong> (<em>integer</em><em>, </em><em>optional</em>) – use multiprocessing to read each chunk?</p></li>
<li><p><strong>num_processes</strong> (<em>bool</em><em>, </em><em>optional</em>) – in case multiprocess is true, how many workers/processes to spawn?</p></li>
<li><p><strong>kwargs</strong> (<em>dict</em><em>, </em><em>optional</em>) – any other keyword argument to pass to the read_function. row_limit and row_offset will be discarded if present.</p></li>
</ul>
</dd>
Expand All @@ -275,9 +277,8 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
<dd class="field-odd"><ul class="simple">
<li><p><strong>read_function</strong> (<em>pyreadstat function</em>) – a pyreadstat reading function</p></li>
<li><p><strong>file_path</strong> (<em>string</em>) – path to the file to be read</p></li>
<li><p><strong>num_processes</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of processes to spawn, by default the total number of cores</p></li>
<li><p><strong>kwargs</strong> (<em>dict</em><em>, </em><em>optional</em>) – any other keyword argument to pass to the read_function.
row_limit and row_offset will be discarded if present as they are used internally.</p></li>
<li><p><strong>num_processes</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of processes to spawn, by default 4</p></li>
<li><p><strong>kwargs</strong> (<em>dict</em><em>, </em><em>optional</em>) – any other keyword argument to pass to the read_function.</p></li>
</ul>
</dd>
<dt class="field-even">Returns</dt>
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/py-modindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Python Module Index &mdash; pyreadstat 1.0.3 documentation</title>
<title>Python Module Index &mdash; pyreadstat 1.0.4 documentation</title>



Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/search.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Search &mdash; pyreadstat 1.0.3 documentation</title>
<title>Search &mdash; pyreadstat 1.0.4 documentation</title>



Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/searchindex.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
# The short X.Y version
version = ''
# The full version, including alpha/beta/rc tags
release = '1.0.3'
release = '1.0.4'


# -- General configuration ---------------------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion pyreadstat/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,4 @@
from .pyreadstat import read_file_in_chunks, read_file_multiprocessing
from ._readstat_parser import ReadstatError, metadata_container

__version__ = "1.0.3"
__version__ = "1.0.4"
Loading

0 comments on commit 5a1b9bb

Please sign in to comment.