Stream database file line by line to avoid string length limit in Node.js #5

eliot-akira · 2021-08-11T14:22:36Z

This pull request implements streaming the database file line by line, based on louischatriot#463. Solves issue #3.

It follows the original PR closely, excluding unnecessary formatting changes. I adapted the syntax to the new codebase, such as the use of let and const instead of var, and some arrow functions.

It passes all server-side tests, and in the second commit, I made changes necessary to pass all browser tests. It turned out that localforage returns the entire database file content, so it was impractical to turn that into a stream. So, in the browser it reads the file the same way as before.

Based on louischatriot#463

…ase file content; Passes all browser tests

eliot-akira · 2021-09-18T15:48:09Z

@tex0l Just in case you might have missed this PR..

The original pull request mentions that streaming files line by line allows loading large databases "without running into the string length limit for node", which is reportedly 256MB. So, it seems like a useful feature to support larger file sizes. (Ideally it'd be best to generate a huge file for testing, but I have not done this.)

tex0l · 2021-09-20T08:58:25Z

Hi @eliot-akira
Thank you for the ping, I'll try to give it a look this week or this weekend.

tex0l

Hi @eliot-akira,
First of all, thank you for porting the PR to @seald-io/nedb, and sorry for the delay in reviewing this PR..
I've reviewed it thoroughly, I've explained everything directly line by line in the files changed, but the TLDR is:

the PR is not properly linted, you can npm run lint to check the linting status, and if you want to auto-fix small issues, you can run npx standard --fix (will fix anything that doesn't change runtime) ;
I'd rather not import byline, the package makes 150 lines and hasn't been updated in 5 years (https://github.com/jahewson/node-byline/blob/master/lib/byline.js). I think we should either find a maintained replacement, or include the code and its tests into nedb, and reformat it to this package's standards.
I'm not comfortable with the behavior with the corruptAlertThreshold, I'd rather strictly adhere to the node convention of having at least err or res to be null. What I recommend is when the corruptAlertThreshold is reached to stop reading from the stream (and clean it properly without having it hanging in the process), and throw an error. We can open a new issue to return the corruptItems count if this information is useful to anyone.

@arantes555 do you share the same opinion?

I won't have time to implement the modifications though, can you make the @eliot-akira, please? I'll try to be quicker in my next review.

browser-version/lib/byline.js

lib/persistence.js

test/persistence.test.js

webpack.config.js

lib/persistence.js

…or this; Buffer.from instead of new Buffer

eliot-akira · 2021-10-03T03:30:55Z

Thank you for a detailed review of the PR. I ran the linter, made suggested improvements, and resolved the associated comments.

There's still a question about byline tests. I wonder if the forked byline maybe deserves its own repo.

tex0l · 2021-10-04T09:23:56Z

Thank you for implementing the fixes so quickly!

For byline, I agree with you, we'd be better off with a maintained fork. However, I won't have the time to maintain it. I'd rather consider this a helper function inside the package rather than a package in itself. Even if there are big resource files, I think we should include the test files inside nedb, it won't be included in the npm package, it only pollutes the git repo ;).

For the behavior of treatRawStream when there are corruptedItems, I made a suggestion in the comments, tell me what you think.

Once both items are fixed, the next step will be publishing a pre-version we'll test internally at Seald to check it doesn't break everything.

eliot-akira · 2021-10-05T09:58:00Z

Sounds good. If I find the time, I'd like to actually create a database larger than 256MB, and test the "string length limit for node" that they were talking about. It would be nice to confirm that this PR really solves it.

…ank line; On error, result is null

…test files

eliot-akira · 2021-10-05T11:06:10Z

Done. I've improved the function treatRawStream and its test; then moved over and adapted the tests from byline.

tex0l · 2021-10-05T15:56:52Z

Hi @eliot-akira
Thank you very much for the work.
We'll include it in our next test cycle to test it works as intended in all environements.
In the meantime, I've released a beta version 2.1.0-0 (it is in the branch https://github.com/seald/nedb/tree/stream-files-by-line) you can use from npm.

Once it passed this test cycle, I'll merge the branch, release a 2.1.0, and close the PR :).

By the way, I took the liberty to add you to the contributors, and name you in the changelog if that's ok for you.

…ect.values

eliot-akira · 2021-10-07T15:34:37Z

Nice, I didn't realize it was hacktoberfest, haha.

tex0l · 2021-10-07T20:19:02Z

Hi @eliot-akira,
I merged the modifications you made after @arantes555's review in the branch https://github.com/seald/nedb/tree/stream-files-by-line, fixed one thing we missed (we needed to reference the browser version of byline.js in the "browser" of the package.json for bundlers to pick the fake byline.js instead of the full one in web apps), and released another preversion 2.1.0-1.

eliot-akira · 2021-10-09T08:20:25Z

I learned more about the string length limit in Node.js. In the V8 source (src/objects/string.h), there is the following comment:

// Maximal string length.
// The max length is different on 32 and 64 bit platforms. Max length for
// 32-bit platforms is ~268.4M chars. On 64-bit platforms, max length is
// ~536.8M chars.

I created a test script to generate a database filled with large strings, and was able to confirm that there is in fact such a limit.

With the original nedb, when the database is larger than half a gigabyte, it throws while loading:

Error: Cannot create a string longer than 0x1fffffe8 characters
code: 'ERR_STRING_TOO_LONG'

Using this PR branch, I found that it does load the database but still ends with an error.

~/github/nedb/lib/persistence.js:81
      toPersist += this.afterSerialization(model.serialize(doc)) + '\n'
                                                                 ^
RangeError: Invalid string length

This is in the function persistCachedDatabase(), which performs the compaction operation. It tries to create the entire content of a database file as a string, before passing it to storage.crashSafeWriteFile(). To practically support such large databases, this part needs to stream the file content line by line, instead of all at once.

So, it looks like the PR is not complete yet. When I find the time, I will see how to adapt these two functions.

eliot-akira · 2021-10-09T09:33:37Z

OK, in commit fc1d0e0, I changed it so the file is written line by line when persisting cached database.

Great, now I've confirmed that this PR does allow loading large databases, beyond the string length limit.

browser-version/lib/storage.js

lib/storage.js

eliot-akira · 2021-10-15T09:25:33Z

lib/storage.js

+          stream.close(cb)
+        })
+        readable.on('error', () => cb)
+        stream.on('error', () => cb)


Shouldn't this call cb with an error?

readable.on('error', err => cb(err)) stream.on('error', err => cb(err))

..or maybe simpler:

readable.on('error', cb) stream.on('error', cb)

Good catch, that's what I wanted to write 🤦

arantes555 · 2021-10-15T09:49:49Z

@tex0l just noticed a last detail left to deal with: for the test_lac/loadAndCrash.test.js file, it mokey-patches fs.writeFile to test what happens in case of crash. This needs to be tweaked to trigger a crash with fs.createWriteStream.

I'll try doing that now, and then we should be good to publish :)

arantes555 · 2021-10-15T09:56:37Z

@tex0l just noticed a last detail left to deal with: for the test_lac/loadAndCrash.test.js file, it mokey-patches fs.writeFile to test what happens in case of crash. This needs to be tweaked to trigger a crash with fs.createWriteStream.

I'll try doing that now, and then we should be good to publish :)

My bad, you already took care of that ^^

arantes555 · 2021-10-15T10:11:36Z

Successfully published @seald-io/[email protected] 😄

It seems all good to me. We'll do some tests with this pre-version on our internal projects, and release ! Excellent job

arantes555 · 2021-10-15T12:06:32Z

Actually, I'll push another commit with a minor detail : I'm extracting writeFileLines into its own function, not inlined into the crashSafe version

tex0l · 2021-10-15T13:43:03Z

Hi @arantes555 @eliot-akira,
I rechecked everything, it seems ok for me.
We'll test this version in our internal packages, and if everything passes, we'll merge :).

eliot-akira · 2021-10-16T02:44:55Z

@tex0l @arantes555
Sounds good! Thank you both for a thorough review and improvement session.

eliot-akira added 5 commits August 11, 2021 15:10

Stream files line by line when parsing to avoid memory limits

a67622d

Based on louischatriot#463

Do not stream on browser side, since localforage returns entire datab…

2a3afcb

…ase file content; Passes all browser tests

Optimize browser build by replacing "byline" with empty object

b7e7f2b

Remove unused library "util" for browser side

a7ce6a7

Restore original tests for non-streaming method treatRawData

f3f6252

tex0l requested changes Oct 2, 2021

View reviewed changes

eliot-akira added 12 commits October 3, 2021 03:41

Lint: Formatting

842f3e9

Use regular expression literal instead of RegExp

84eee58

Test: Assert error is null

e6388c2

On error, result is null; on result, error is null

24dd632

Use arrow functions to preserve this

3729d4c

Use template literals for string interpolation

5b37cff

Remove console.log from test

143a08e

Return treated data even after corrupt threshold

631d672

Include fork of byline library

ec5f3ff

Lint byline: Use strict equal; const instead of var; arrow function f…

6884f23

…or this; Buffer.from instead of new Buffer

Handle error: For this test, it is expected to be *not* null

1cd3381

Remove dependency on byline

af5c5d7

eliot-akira added 4 commits October 5, 2021 12:11

Align behavior of treatRawData and treatRawStream in handling last bl…

a826a9c

…ank line; On error, result is null

Include tests for byline

575f46c

Adapt tests from byline: Lint; Use assert from chai; Local paths for …

225fe0f

…test files

Tests: Simplify getting local paths for test files

0af674d

tex0l approved these changes Oct 5, 2021

View reviewed changes

eliot-akira added 2 commits October 7, 2021 16:46

treatRawData and treatRawStream: Get values more efficiently with Obj…

a610cf8

…ect.values

assert.deepStrictEqual instead of deepEqual

823db2a

eliot-akira changed the title ~~Stream files line by line when parsing to avoid memory limits~~ Stream database file line by line to avoid string length limit in Node.js Oct 9, 2021

Write file line by line when persisting cached database

fc1d0e0

eliot-akira added 2 commits October 9, 2021 11:44

Write new line separately, instead of adding to line string

af1d64f

Lint

437a841

arantes555 reviewed Oct 12, 2021

View reviewed changes

browser-version/lib/storage.js Show resolved Hide resolved

lib/storage.js Outdated Show resolved Hide resolved

eliot-akira and others added 3 commits October 12, 2021 14:58

Add final new line to align behavior with Node.js version

ba14f11

Stream lines using setImmediate to ensure it doesn't block event loop

7404ee1

streaming write with Readable.from

8c4f56a

eliot-akira commented Oct 15, 2021

View reviewed changes

fix error callbacks

148d7c8

Timothée Rebours and others added 4 commits October 15, 2021 12:01

changelog and contributors

cbfd31f

fix package.json browser field & update changelog

3fe1df2

changelog

3d42f30

2.1.0-2

6960424

arantes555 added 2 commits October 15, 2021 14:06

extract writeFileLines into its own function

8b984f1

2.1.0-3

84e19ea

tex0l merged commit 1d8c888 into seald:master Oct 19, 2021

arantes555 mentioned this pull request Oct 19, 2021

did you look at this commit for bigger databases? #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream database file line by line to avoid string length limit in Node.js #5

Stream database file line by line to avoid string length limit in Node.js #5

eliot-akira commented Aug 11, 2021 •

edited

Loading

eliot-akira commented Sep 18, 2021

tex0l commented Sep 20, 2021

tex0l left a comment

eliot-akira commented Oct 3, 2021 •

edited

Loading

tex0l commented Oct 4, 2021

eliot-akira commented Oct 5, 2021

eliot-akira commented Oct 5, 2021

tex0l commented Oct 5, 2021 •

edited

Loading

eliot-akira commented Oct 7, 2021

tex0l commented Oct 7, 2021 •

edited

Loading

eliot-akira commented Oct 9, 2021 •

edited

Loading

eliot-akira commented Oct 9, 2021

eliot-akira Oct 15, 2021

arantes555 Oct 15, 2021

arantes555 commented Oct 15, 2021 •

edited

Loading

arantes555 commented Oct 15, 2021 •

edited

Loading

arantes555 commented Oct 15, 2021

arantes555 commented Oct 15, 2021

tex0l commented Oct 15, 2021

eliot-akira commented Oct 16, 2021

Stream database file line by line to avoid string length limit in Node.js #5

Stream database file line by line to avoid string length limit in Node.js #5

Conversation

eliot-akira commented Aug 11, 2021 • edited Loading

eliot-akira commented Sep 18, 2021

tex0l commented Sep 20, 2021

tex0l left a comment

Choose a reason for hiding this comment

eliot-akira commented Oct 3, 2021 • edited Loading

tex0l commented Oct 4, 2021

eliot-akira commented Oct 5, 2021

eliot-akira commented Oct 5, 2021

tex0l commented Oct 5, 2021 • edited Loading

eliot-akira commented Oct 7, 2021

tex0l commented Oct 7, 2021 • edited Loading

eliot-akira commented Oct 9, 2021 • edited Loading

eliot-akira commented Oct 9, 2021

eliot-akira Oct 15, 2021

Choose a reason for hiding this comment

arantes555 Oct 15, 2021

Choose a reason for hiding this comment

arantes555 commented Oct 15, 2021 • edited Loading

arantes555 commented Oct 15, 2021 • edited Loading

arantes555 commented Oct 15, 2021

arantes555 commented Oct 15, 2021

tex0l commented Oct 15, 2021

eliot-akira commented Oct 16, 2021

eliot-akira commented Aug 11, 2021 •

edited

Loading

eliot-akira commented Oct 3, 2021 •

edited

Loading

tex0l commented Oct 5, 2021 •

edited

Loading

tex0l commented Oct 7, 2021 •

edited

Loading

eliot-akira commented Oct 9, 2021 •

edited

Loading

arantes555 commented Oct 15, 2021 •

edited

Loading

arantes555 commented Oct 15, 2021 •

edited

Loading