Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes
list_ods_sheets
Now has additional parameter
include_external_data
. When set toFALSE
(default) hides linked data sheets.Why? This is the default behaviour for spreadsheet software, and, I would argue, what users are expecting.
Linked external data is not generally accessible in this manner, and would usually be expected to be accessed
through the front-facing sheet into which it was imported.
This is also the default behaviour for packages reading
.xls
and.xlsx
files, and external data is presentedfor ods files largely because the file format considers it in the same manner as other sheets.
Why not? Many other packages (previous versions of readODS, tidyODS, pandas) do not exhibit this behaviour
by default, and so this would represent a change of functionality.
Changed behaviour when accesing small amounts of data
true
for row names or column names, now raises a warning and returns the data without names (previously would name the columns, add a row above/or below the column, and also return the data, e.g. a sheet with justA
in A1 would return:Now raises
Cannot make column/row names if this would cause the dataframe to be empty.
and returnsRange now accepts sheets as part of text
e.g. "Sheet1!A2:B3" will now access Sheet1 if it exists (or throw an error if it does not). Sheets defined in ranges overrule sheets defined using
sheets
Merged cells now work correctly (Issue #2)
The value for the merged cell is placed in the top-left cell. All other cells that were part of the merge are
NA
.Note: This isn't actually necessarily true. The OO spec allows cells covered by merges to contain values,, although these are not displayed (by Excel anyway). This contents is in fact parsed in the same way as a normal cell however. This poses a small problem as the output in this case is not the same as the user might expect, however it faithfully records the information held in the sheet. It's not possible to add data to cells this way using spreadsheet software (that I know of) so this is unlikely to be an issue.
The problem mentioned by DataStategist in Issue #2 still persists, in that moving the data to the top-left of the merged cell does not always line up visually with the intention of the author, however without a lot of second-guessing it is not possible to know which cell is the most appropriate for the contents to lie in.
Speed (Issue #37)
Parsing is now done in C++ which should significantly improve speed. Performance is similar for small sheets, and significantly faster for larger workbooks. The memory footprint for larger sheets is also reduced, which goes some way towards solving Issue #71, although the whole of
contents.xml
still needs to be loaded into memory at once, and so there is still a limit. The file referenced in the issue now takes ~10GB of ram, and loads in ~2 mins on my machine, which is still slow, but an improvement.Speed is also improved as the package will now only parse the requested range of a sheet if requested, which significantly speeds up gathering a small amount of data from a large sheet
read_fods
Now reads flat ods files. This checks to make sure that it's a correct single-document ODS file, and uses a common .read_ods function internally to read either flat or packaged ODS files
Why no objects and classes? That's what readxl does!
I think they make things hard to read (and I am bad at them). We could create a worksheet object and use internal methods to read the data,
but it would still need to be instantiated every time we read the sheet.
How do read_ods_ (and read_flat_ods_) work now?
.standardise_limits()
converts the range request into limits in the formc(min_row, max_row, min_col, max_col)
. Ifmax_row
ormax_col
are -1, this reads to the edge of the spreadsheet in the y or x direction respectively.The list of sheets in the file is then parsed to check if the sheet name is valid.
(This is not optimal, as this parses the sheet twice when it could be done during the main read. It's not actually that slow though, so I've left it for now).
The xml contents file is then read into memory using
rapidxml
.read_ods_()
orread_flat_ods_()
then attempts to parse the sheet, runningfind_rows()
to build an array of pointers to the nodes containing the cells of of the spreadsheet (although only those requested in the range). The maximal width of this array is then the width of the resulting array.We then loop through the array to build a vector of strings, beginning with the width and length of the output as strings. We then parse the cells into this vector in order, padding any rows which are not of sufficient length. This string is passed back to R.
The remainder is similar to the previous version, this list of strings is turned into a matrix of appropriate dimensions, which becomes a dataframe, is given headers and has the column types assigned by
readr
.writeODS
I haven't significantly updated writeODS, however I think I have gone some way towards fixing issue #107, which it turns out is a character encoding issue specific to old versions R on Windows (versions of R since 4.2 now use UTF-8 as default and shoudln't have this issue).
cat
will write whatever you ask it to into the file, even if the encoding does not match the rest of the file. However we can try andforce characters to be written in UTF-8 by setting the file-connection encoding to UTF-8, which fixes some of the issue although not all.On older versions of R (tested on 3.6.1 and 4.1.2) this doesn't quite fix the issuse, as R seems to print characters outside the current locale as escaped UTF-8 codes (e.g ぅ is printed as
<U+3045>
). It will write this code in place of the letter whenever it prints it. This means that things that look a lot like invalid XML nodes appear in the output. (This isn't picked up by.escapexml()
as it still views the letter as a single character.)Practical solutions to this for old versions of R are either to remove characters outside the current locale from printing, or to write a new writeODS function in C++, as this will not care about R's locale. The real solution however is to update R past 4.2. Aside from this, it still works if
Running R in Radian can cause the same issue even with a modern version of R (>=4.2) however (see this issue) due to Python's handling of UTF-8. Using the .Rprofile settings from the comment fixes the issue.
I've added a note to the readme to address this.
Removed read.ods and ods_sheets
They have been deprecated for several years, and are only used in two CRAN packages, ElementR and IgoRRR. Both are easy fixes.