Merge pull request #88 from Roche/dev

v 1.0.4
Roche · Nov 12, 2020 · 5a1b9bb · 5a1b9bb
2 parents 2a0d9ca + be56154
commit 5a1b9bb
Show file tree

Hide file tree

Showing 21 changed files with 2,502 additions and 1,307 deletions.
diff --git a/README.md b/README.md
@@ -39,8 +39,8 @@ the original applications in this regard.**
   + [More reading options](#more-reading-options)
     - [Reading only the headers](#reading-only-the-headers)
     - [Reading selected columns](#reading-selected-columns)
-    - [Reading rows in chunks](#reading-rows-in-chunks)
     - [Reading files in parallel processes](#reading-files-in-parallel-processes)
+    - [Reading rows in chunks](#reading-rows-in-chunks)
     - [Reading value labels](#reading-value-labels)
     - [Missing Values](#missing-values)
       + [SPSS](#spss)
@@ -323,6 +323,44 @@ df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', usecols=["variab
 
 ```
 
+#### Reading files in parallel processes
+
+A challenge when reading large files is the time consumed in the operation. In order to alleviate this
+pyreadstat provides a function "read_file_multiprocessing" to read a file in parallel processes using the python multiprocessing library. As it reads the whole file in one go you need to have enough RAM for the operation. If
+that is not the case look at Reading rows in chunks (next section)
+
+Speed ups in the process will depend on a number of factors such as number of processes available, RAM, 
+content of the file etc.
+
+```python
+import pyreadstat
+
+fpath = "path/to/file.sav" 
+df, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, fpath, num_processes=4) 
+```
+
+num_processes is the number of workers and it defaults to 4 (or the number of cores if less than 4). You can play with it to see where you 
+get the best performance. You can also get the number of all available workers like this:
+
+```
+import multiprocessing
+num_processes = multiprocessing.cpu_count()
+```
+
+**Notes for windows**
+
+1. For this to work you must include a __name__ == "__main__" section in your script. See [this issue](#85)
+for more details.
+
+```
+import pyreadstat
+
+if __name__ == "__main__":
+     df, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, 'sample.sav')
+```
+2. If you include too many workers or you run out of RAM you main get a message about not enough page file
+size. See [this issue](#87)
+
 #### Reading rows in chunks
 
 Reading large files with hundred of thouseds of rows can be challenging due to memory restrictions. In such cases, it may be helpful
@@ -353,18 +391,19 @@ for df, meta in reader:
     # do some cool calculations here for the chunk
 ```
 
-#### Reading files in parallel processes
-
-Another challenge when reading large files is the time consumed in the operation. In order to alleviate this
-pyreadstat provides a function "read_file_multiprocessing" to read a file in parallel processes using the python multiprocessing library.
-Speed ups in the process will depend on a number of factors such as number of processes available, RAM, 
-content of the file etc.
+For very large files it may be convienient to speed up the process by reading each chunks in parallel. For
+this purpose you can pass the argument multiprocess=True. This is a combination of read_file_in_chunks and
+read_file_multiprocessing. Here you can use the arguments row_offset and row_limit to start reading the
+file from an offest and stop after a row_offset+row_limit. 
 
 ```python
 import pyreadstat
+fpath = "path/to/file.sav"
+reader = pyreadstat.read_file_in_chunks(pyreadstat.read_sav, fpath, chunksize= 10000, multiprocess=True, num_processes=4)
 
-fpath = "path/to/file.sav" 
-df, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, fpath) 
+for df, meta in reader:
+    print(df) # df will contain 10000 rows except for the last one
+    # do some cool calculations here for the chunk
 ```
 
 #### Reading value labels

diff --git a/docs/_build/doctrees/environment.pickle b/docs/_build/doctrees/environment.pickle
diff --git a/docs/_build/doctrees/index.doctree b/docs/_build/doctrees/index.doctree
diff --git a/docs/_build/html/.buildinfo b/docs/_build/html/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: 4c9fea8c6d7cc40754033f4e92df1160
+config: e9744ea50bd658454f12dc039aea659e
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/docs/_build/html/_static/basic.css b/docs/_build/html/_static/basic.css
@@ -764,6 +764,7 @@ div.code-block-caption code {
 }
 
 table.highlighttable td.linenos,
+span.linenos,
 div.doctest > div.highlight span.gp {  /* gp: Generic.Prompt */
     user-select: none;
 }

diff --git a/docs/_build/html/_static/doctools.js b/docs/_build/html/_static/doctools.js
@@ -285,9 +285,10 @@ var Documentation = {
   initOnKeyListeners: function() {
     $(document).keydown(function(event) {
       var activeElementType = document.activeElement.tagName;
-      // don't navigate when in search box or textarea
+      // don't navigate when in search box, textarea, dropdown or button
       if (activeElementType !== 'TEXTAREA' && activeElementType !== 'INPUT' && activeElementType !== 'SELECT'
-          && !event.altKey && !event.ctrlKey && !event.metaKey && !event.shiftKey) {
+          && activeElementType !== 'BUTTON' && !event.altKey && !event.ctrlKey && !event.metaKey
+          && !event.shiftKey) {
         switch (event.keyCode) {
           case 37: // left
             var prevHref = $('link[rel="prev"]').prop('href');

diff --git a/docs/_build/html/_static/documentation_options.js b/docs/_build/html/_static/documentation_options.js
@@ -1,6 +1,6 @@
 var DOCUMENTATION_OPTIONS = {
     URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'),
-    VERSION: '1.0.3',
+    VERSION: '1.0.4',
     LANGUAGE: 'None',
     COLLAPSE_INDEX: false,
     BUILDER: 'html',

diff --git a/docs/_build/html/_static/pygments.css b/docs/_build/html/_static/pygments.css
@@ -1,5 +1,10 @@
+pre { line-height: 125%; margin: 0; }
+td.linenos pre { color: #000000; background-color: #f0f0f0; padding-left: 5px; padding-right: 5px; }
+span.linenos { color: #000000; background-color: #f0f0f0; padding-left: 5px; padding-right: 5px; }
+td.linenos pre.special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
+span.linenos.special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
 .highlight .hll { background-color: #ffffcc }
-.highlight  { background: #eeffcc; }
+.highlight { background: #eeffcc; }
 .highlight .c { color: #408090; font-style: italic } /* Comment */
 .highlight .err { border: 1px solid #FF0000 } /* Error */
 .highlight .k { color: #007020; font-weight: bold } /* Keyword */

diff --git a/docs/_build/html/_static/searchtools.js b/docs/_build/html/_static/searchtools.js
@@ -59,10 +59,10 @@ var Search = {
   _pulse_status : -1,
 
   htmlToText : function(htmlString) {
-      var htmlElement = document.createElement('span');
-      htmlElement.innerHTML = htmlString;
-      $(htmlElement).find('.headerlink').remove();
-      docContent = $(htmlElement).find('[role=main]')[0];
+      var virtualDocument = document.implementation.createHTMLDocument('virtual');
+      var htmlElement = $(htmlString, virtualDocument);
+      htmlElement.find('.headerlink').remove();
+      docContent = htmlElement.find('[role=main]')[0];
       if(docContent === undefined) {
           console.warn("Content block not found. Sphinx search tries to obtain it " +
                        "via '[role=main]'. Could you check your theme or template.");

diff --git a/docs/_build/html/genindex.html b/docs/_build/html/genindex.html
@@ -7,7 +7,7 @@
 
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
-  <title>Index &mdash; pyreadstat 1.0.3 documentation</title>
+  <title>Index &mdash; pyreadstat 1.0.4 documentation</title>
 
 
 

diff --git a/docs/_build/html/index.html b/docs/_build/html/index.html
@@ -7,7 +7,7 @@
 
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
-  <title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.0.3 documentation</title>
+  <title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.0.4 documentation</title>
 
 
 
@@ -252,6 +252,8 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
 <li><p><strong>chunksize</strong> (<em>integer</em><em>, </em><em>optional</em>) – size of the chunks to read</p></li>
 <li><p><strong>offset</strong> (<em>integer</em><em>, </em><em>optional</em>) – start reading the file after certain number of rows</p></li>
 <li><p><strong>limit</strong> (<em>integer</em><em>, </em><em>optional</em>) – stop reading the file after certain number of rows, will be added to offset</p></li>
+<li><p><strong>multiprocess</strong> (<em>integer</em><em>, </em><em>optional</em>) – use multiprocessing to read each chunk?</p></li>
+<li><p><strong>num_processes</strong> (<em>bool</em><em>, </em><em>optional</em>) – in case multiprocess is true, how many workers/processes to spawn?</p></li>
 <li><p><strong>kwargs</strong> (<em>dict</em><em>, </em><em>optional</em>) – any other keyword argument to pass to the read_function. row_limit and row_offset will be discarded if present.</p></li>
 </ul>
 </dd>
@@ -275,9 +277,8 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
 <dd class="field-odd"><ul class="simple">
 <li><p><strong>read_function</strong> (<em>pyreadstat function</em>) – a pyreadstat reading function</p></li>
 <li><p><strong>file_path</strong> (<em>string</em>) – path to the file to be read</p></li>
-<li><p><strong>num_processes</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of processes to spawn, by default the total number of cores</p></li>
-<li><p><strong>kwargs</strong> (<em>dict</em><em>, </em><em>optional</em>) – any other keyword argument to pass to the read_function.
-row_limit and row_offset will be discarded if present as they are used internally.</p></li>
+<li><p><strong>num_processes</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of processes to spawn, by default 4</p></li>
+<li><p><strong>kwargs</strong> (<em>dict</em><em>, </em><em>optional</em>) – any other keyword argument to pass to the read_function.</p></li>
 </ul>
 </dd>
 <dt class="field-even">Returns</dt>

diff --git a/docs/_build/html/py-modindex.html b/docs/_build/html/py-modindex.html
@@ -7,7 +7,7 @@
 
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
-  <title>Python Module Index &mdash; pyreadstat 1.0.3 documentation</title>
+  <title>Python Module Index &mdash; pyreadstat 1.0.4 documentation</title>
 
 
 

diff --git a/docs/_build/html/search.html b/docs/_build/html/search.html
@@ -7,7 +7,7 @@
 
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
-  <title>Search &mdash; pyreadstat 1.0.3 documentation</title>
+  <title>Search &mdash; pyreadstat 1.0.4 documentation</title>
 
 
 

diff --git a/docs/_build/html/searchindex.js b/docs/_build/html/searchindex.js
diff --git a/docs/conf.py b/docs/conf.py
@@ -26,7 +26,7 @@
 # The short X.Y version
 version = ''
 # The full version, including alpha/beta/rc tags
-release = '1.0.3'
+release = '1.0.4'
 
 
 # -- General configuration ---------------------------------------------------

diff --git a/pyreadstat/__init__.py b/pyreadstat/__init__.py
@@ -20,4 +20,4 @@
 from .pyreadstat import read_file_in_chunks, read_file_multiprocessing
 from ._readstat_parser import ReadstatError, metadata_container
 
-__version__ = "1.0.3"
+__version__ = "1.0.4"
Original file line number	Diff line number	Diff line change
Expand Up		@@ -7,7 +7,7 @@

		<meta name="viewport" content="width=device-width, initial-scale=1.0">

		<title>Index — pyreadstat 1.0.3 documentation</title>
		<title>Index — pyreadstat 1.0.4 documentation</title>



Expand Down