Skip to content

Commit

Permalink
Merge branch 'release/1.2'
Browse files Browse the repository at this point in the history
  • Loading branch information
airbreather committed Jul 20, 2019
2 parents e18a956 + cdea175 commit 08abb33
Show file tree
Hide file tree
Showing 46 changed files with 3,404 additions and 379 deletions.
2 changes: 1 addition & 1 deletion Directory.Build.props
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
</PropertyGroup>

<ItemGroup Condition=" '$(LGTM)' != 'true' ">
<PackageReference Include="Microsoft.SourceLink.GitHub" Version="1.0.0-beta2-18618-05" PrivateAssets="All" />
<PackageReference Include="Microsoft.SourceLink.GitHub" Version="1.0.0-beta2-19351-01" PrivateAssets="All" />
</ItemGroup>

</Project>
21 changes: 12 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,28 +94,31 @@ public static void ProcessCsvFile(string csvFilePath)

### Simpler
1. Create a new instance of your visitor.
1. Call one of the `Csv.Process*` methods, passing in whatever format your data is in along with your visitor.
1. Use one of the `CsvSyncInput` or `CsvAsyncInput` methods to create an input object you can use to describe the data to your visitor.

Examples:
```csharp
public static void ProcessCsvFile(string csvFilePath)
{
Console.WriteLine($"Started reading '{csvFilePath}'.");
Csv.ProcessFile(csvFilePath, new MyVisitor(maxFieldLength: 1000));
CsvSyncInput.ForMemoryMappedFile(csvFilePath)
.Process(new MyVisitor(maxFieldLength: 1000));
Console.WriteLine($"Finished reading '{csvFilePath}'.");
}

public static void ProcessCsvStream(Stream csvStream)
{
Console.WriteLine($"Started reading '{csvFilePath}'.");
Csv.ProcessStream(csvStream, new MyVisitor(maxFieldLength: 1000));
Console.WriteLine($"Finished reading '{csvFilePath}'.");
Console.WriteLine($"Started reading CSV file.");
CsvSyncInput.ForStream(csvStream)
.Process(new MyVisitor(maxFieldLength: 1000));
Console.WriteLine($"Finished reading CSV file.");
}

public static async ValueTask ProcessCsvStreamAsync(Stream csvStream, IProgress<int> progress = null, CancellationToken cancellationToken = default)
public static async Task ProcessCsvStreamAsync(Stream csvStream)
{
Console.WriteLine($"Started reading '{csvFilePath}'.");
await Csv.ProcessStreamAsync(csvStream, new MyVisitor(maxFieldLength: 1000), progress, cancellationToken);
Console.WriteLine($"Finished reading '{csvFilePath}'.");
Console.WriteLine($"Started reading CSV file.");
await CsvAsyncInput.ForStream(csvStream)
.ProcessAsync(new MyVisitor(maxFieldLength: 1000));
Console.WriteLine($"Finished reading CSV file.");
}
```
1 change: 1 addition & 0 deletions calculate-coverage.cmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ tools\OpenCover.4.7.922\tools\OpenCover.Console.exe ^
"-target:%DotNetPath%" ^
"-targetArgs:test -c Release --no-build" ^
"-filter:+[Cursively]* +[Cursively.*]* -[Cursively.Tests]*" ^
-excludebyattribute:System.Diagnostics.CodeAnalysis.ExcludeFromCodeCoverageAttribute ^
-output:tools\raw-coverage-results.xml ^
-register:user ^
-oldstyle
Expand Down
77 changes: 77 additions & 0 deletions doc/benchmark-1.2.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
This benchmark tests the simple act of counting how many records are in a CSV file. It's not a simple count of how many lines are in the text file: line breaks within quoted fields must be treated as data, and multiple line breaks in a row must be treated as one, since each record must have at least one field. Therefore, assuming correct implementations, this benchmark should test the raw CSV processing speed.

Cursively eliminates a ton of overhead found in libraries such as CsvHelper by restricting the allowed input encodings and using the visitor pattern as its only means of output. Cursively can scan through the original bytes of the input to do its work, and it can give slices of the input data directly to the consumer without having to copy or allocate.

Therefore, these benchmarks are somewhat biased in favor of Cursively, as CsvHelper relies on external code to transform the data to UTF-16. This isn't as unfair as that makes it sound: the overwhelming majority of input files are probably UTF-8 anyway (or a compatible SBCS), so this transformation is something that practically every user will experience.

- Input files can be found here: https://github.com/airbreather/Cursively/tree/v1.2.0/test/Cursively.Benchmark/large-csv-files.zip
- Benchmark source code is this: https://github.com/airbreather/Cursively/tree/v1.2.0/test/Cursively.Benchmark

As of version 1.2.0, these benchmarks no longer run on .NET Framework targets, because earlier benchmarks have shown comparable ratios.

Raw BenchmarkDotNet output is at the bottom, but here are some numbers derived from it showing the throughput of CsvHelper compared to the throughput of each of five different ways of using Cursively on multiple different kinds of files. This summary does not indicate anything about the GC pressure:

| File | Size (bytes) | CsvHelper (MB/s) | Cursively 1* (MB/s) | Cursively 2* (MB/s) | Cursively 3* (MB/s) | Cursively 4* (MB/s) | Cursively 5* (MB/s) |
|-------------------------|-------------:|-----------------:|--------------------:|--------------------:|--------------------:|--------------------:|--------------------:|
| 100-huge-records | 2900444 | 27.68 | 482.15 (x17.42) | 528.08 (x19.08) | 443.99 (x16.04) | 448.17 (x16.19) | 408.70 (x14.77) |
| 100-huge-records-quoted | 4900444 | 29.40 | 304.64 (x10.36) | 325.31 (x11.07) | 295.21 (x10.04) | 293.08 (x09.97) | 285.03 (x09.70) |
| 10k-empty-records | 10020000 | 14.59 | 311.97 (x21.38) | 311.27 (x21.33) | 283.40 (x19.42) | 297.41 (x20.38) | 268.23 (x18.38) |
| mocked | 12731500 | 74.29 | 3871.72 (x52.11) | 3771.89 (x50.77) | 1748.01 (x23.53) | 2103.55 (x28.31) | 1240.09 (x16.69) |
| worldcitiespop | 151492068 | 39.39 | 622.85 (x15.81) | 617.22 (x15.67) | 518.00 (x13.15) | 538.54 (x13.67) | 450.63 (x11.44) |

\*Different Cursively methods are:
1. Directly using `CsvTokenizer`
1. `CsvSyncInput.ForMemory`
1. `CsvSyncInput.ForMemoryMappedFile`
1. `CsvSyncInput.ForStream` (using a `FileStream`)
1. `CsvAsyncInput.ForStream` (using a `FileStream` opened in asynchronous mode)

Raw BenchmarkDotNet output:

``` ini

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18362
Intel Core i7-6850K CPU 3.60GHz (Skylake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview6-012264
[Host] : .NET Core 2.2.6 (CoreCLR 4.6.27817.03, CoreFX 4.6.27818.02), 64bit RyuJIT
Job-UPPUKA : .NET Core 2.2.6 (CoreCLR 4.6.27817.03, CoreFX 4.6.27818.02), 64bit RyuJIT

Server=True

```
| Method | csvFile | Mean | Error | StdDev | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
|--------------------------------------------- |--------------------- |-------------:|-----------:|-----------:|------:|--------:|-----------:|---------:|------:|-------------:|
| CountRowsUsingCursivelyRaw | 100-huge-records | 5.737 ms | 0.0037 ms | 0.0033 ms | 1.00 | 0.00 | - | - | - | 48 B |
| CountRowsUsingCursivelyArrayInput | 100-huge-records | 5.238 ms | 0.0066 ms | 0.0062 ms | 0.91 | 0.00 | - | - | - | 96 B |
| CountRowsUsingCursivelyMemoryMappedFileInput | 100-huge-records | 6.230 ms | 0.0159 ms | 0.0141 ms | 1.09 | 0.00 | - | - | - | 544 B |
| CountRowsUsingCursivelyFileStreamInput | 100-huge-records | 6.172 ms | 0.0080 ms | 0.0067 ms | 1.08 | 0.00 | - | - | - | 272 B |
| CountRowsUsingCursivelyAsyncFileStreamInput | 100-huge-records | 6.768 ms | 0.1349 ms | 0.1262 ms | 1.18 | 0.02 | - | - | - | 1360 B |
| CountRowsUsingCsvHelper | 100-huge-records | 99.938 ms | 0.3319 ms | 0.3105 ms | 17.42 | 0.06 | 400.0000 | 200.0000 | - | 110256320 B |
| | | | | | | | | | | |
| CountRowsUsingCursivelyRaw | 100-h(...)uoted [23] | 15.341 ms | 0.0305 ms | 0.0255 ms | 1.00 | 0.00 | - | - | - | 48 B |
| CountRowsUsingCursivelyArrayInput | 100-h(...)uoted [23] | 14.366 ms | 0.0167 ms | 0.0156 ms | 0.94 | 0.00 | - | - | - | 96 B |
| CountRowsUsingCursivelyMemoryMappedFileInput | 100-h(...)uoted [23] | 15.831 ms | 0.0487 ms | 0.0455 ms | 1.03 | 0.00 | - | - | - | 544 B |
| CountRowsUsingCursivelyFileStreamInput | 100-h(...)uoted [23] | 15.946 ms | 0.0383 ms | 0.0358 ms | 1.04 | 0.00 | - | - | - | 272 B |
| CountRowsUsingCursivelyAsyncFileStreamInput | 100-h(...)uoted [23] | 16.396 ms | 0.2821 ms | 0.2771 ms | 1.07 | 0.02 | - | - | - | 1360 B |
| CountRowsUsingCsvHelper | 100-h(...)uoted [23] | 158.968 ms | 0.1382 ms | 0.1154 ms | 10.36 | 0.02 | 333.3333 | - | - | 153579848 B |
| | | | | | | | | | | |
| CountRowsUsingCursivelyRaw | 10k-empty-records | 30.631 ms | 0.1009 ms | 0.0894 ms | 1.00 | 0.00 | - | - | - | 48 B |
| CountRowsUsingCursivelyArrayInput | 10k-empty-records | 30.699 ms | 0.0624 ms | 0.0584 ms | 1.00 | 0.00 | - | - | - | 96 B |
| CountRowsUsingCursivelyMemoryMappedFileInput | 10k-empty-records | 33.718 ms | 0.0873 ms | 0.0817 ms | 1.10 | 0.00 | - | - | - | 544 B |
| CountRowsUsingCursivelyFileStreamInput | 10k-empty-records | 32.130 ms | 0.0944 ms | 0.0737 ms | 1.05 | 0.00 | - | - | - | 272 B |
| CountRowsUsingCursivelyAsyncFileStreamInput | 10k-empty-records | 35.625 ms | 0.7018 ms | 0.7801 ms | 1.17 | 0.03 | - | - | - | 1360 B |
| CountRowsUsingCsvHelper | 10k-empty-records | 654.743 ms | 13.0238 ms | 16.9346 ms | 21.42 | 0.51 | 2000.0000 | - | - | 420832856 B |
| | | | | | | | | | | |
| CountRowsUsingCursivelyRaw | mocked | 3.136 ms | 0.0038 ms | 0.0034 ms | 1.00 | 0.00 | - | - | - | 48 B |
| CountRowsUsingCursivelyArrayInput | mocked | 3.219 ms | 0.0623 ms | 0.0741 ms | 1.02 | 0.02 | - | - | - | 96 B |
| CountRowsUsingCursivelyMemoryMappedFileInput | mocked | 6.946 ms | 0.0553 ms | 0.0490 ms | 2.21 | 0.02 | - | - | - | 544 B |
| CountRowsUsingCursivelyFileStreamInput | mocked | 5.772 ms | 0.0365 ms | 0.0324 ms | 1.84 | 0.01 | - | - | - | 272 B |
| CountRowsUsingCursivelyAsyncFileStreamInput | mocked | 9.791 ms | 0.1129 ms | 0.1056 ms | 3.12 | 0.04 | - | - | - | 1360 B |
| CountRowsUsingCsvHelper | mocked | 163.426 ms | 3.2351 ms | 3.1773 ms | 52.08 | 0.97 | 333.3333 | - | - | 115757736 B |
| | | | | | | | | | | |
| CountRowsUsingCursivelyRaw | worldcitiespop | 231.955 ms | 1.0755 ms | 0.9534 ms | 1.00 | 0.00 | - | - | - | 48 B |
| CountRowsUsingCursivelyArrayInput | worldcitiespop | 234.071 ms | 1.0749 ms | 0.9529 ms | 1.01 | 0.01 | - | - | - | 96 B |
| CountRowsUsingCursivelyMemoryMappedFileInput | worldcitiespop | 278.909 ms | 3.0866 ms | 2.8872 ms | 1.20 | 0.01 | - | - | - | 544 B |
| CountRowsUsingCursivelyFileStreamInput | worldcitiespop | 268.271 ms | 3.4632 ms | 2.8920 ms | 1.16 | 0.02 | - | - | - | 272 B |
| CountRowsUsingCursivelyAsyncFileStreamInput | worldcitiespop | 320.606 ms | 1.6204 ms | 1.5157 ms | 1.38 | 0.01 | - | - | - | 1360 B |
| CountRowsUsingCsvHelper | worldcitiespop | 3,667.940 ms | 60.8394 ms | 56.9092 ms | 15.82 | 0.24 | 15000.0000 | - | - | 3096694312 B |
8 changes: 8 additions & 0 deletions doc/release-notes.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# Cursively Release Notes

## [1.2.0](https://github.com/airbreather/Cursively/milestone/4)
- Added fluent helpers to replace the `Csv.ProcessFoo` methods with something that's easier to maintain without being meaningfully less convenient to use ([#15](https://github.com/airbreather/Cursively/issues/15)).
- Deprecated the ability to ignore a leading UTF-8 byte order mark inside the header-aware visitor, per [#14](https://github.com/airbreather/Cursively/issues/14).
- Instead, it's up to the source of the input to skip (or not skip) sending a leading UTF-8 BOM to the tokenizer in the first place.
- By default, all the fluent helpers from the previous bullet point will ignore a leading UTF-8 BOM if present. This behavior may be disabled by chaining `.WithIgnoreUTF8ByteOrderMark(false)`.
- Improved how the header-aware visitor behaves when the creator requests very high limits ([#17](https://github.com/airbreather/Cursively/issues/17)).
- Fixed a rare off-by-one issue in the header-aware visitor that would happen when a header is exactly as long as the configured maximum **and** its last byte is exactly the last byte of the input chunk that happens to contain it ([#16](https://github.com/airbreather/Cursively/issues/16)).

## [1.1.0](https://github.com/airbreather/Cursively/milestone/1)
- Several further performance optimizations. Most significantly, inlining and tuning a critical `ReadOnlySpan<T>` extension method.
- In some cases, this increased throughput by a factor of 3.
Expand Down
2 changes: 1 addition & 1 deletion doc/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
- name: API Documentation
href: obj/api/
- name: Benchmark
href: benchmark-1.1.0.md
href: benchmark-1.2.0.md
- name: Release Notes
href: release-notes.md
- name: NuGet Package
Expand Down
2 changes: 1 addition & 1 deletion generate-docs.cmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
REM ===========================================================================
REM Regenerates the https://airbreather.github.io/Cursively content locally
REM ===========================================================================
set DOCFX_PACKAGE_VERSION=2.42.4
set DOCFX_PACKAGE_VERSION=2.43.1
pushd %~dp0
REM incremental / cached builds tweak things about the output, so let's do it
REM all fresh if we can help it...
Expand Down
Loading

0 comments on commit 08abb33

Please sign in to comment.