Merge branch 'release/1.2'

airbreather · Jul 20, 2019 · 08abb33 · 08abb33
2 parents e18a956 + cdea175
commit 08abb33
Show file tree

Hide file tree

Showing 46 changed files with 3,404 additions and 379 deletions.
diff --git a/Directory.Build.props b/Directory.Build.props
@@ -16,7 +16,7 @@
   </PropertyGroup>
 
   <ItemGroup Condition=" '$(LGTM)' != 'true' ">
-    <PackageReference Include="Microsoft.SourceLink.GitHub" Version="1.0.0-beta2-18618-05" PrivateAssets="All" />
+    <PackageReference Include="Microsoft.SourceLink.GitHub" Version="1.0.0-beta2-19351-01" PrivateAssets="All" />
   </ItemGroup>
 
 </Project>
diff --git a/README.md b/README.md
@@ -94,28 +94,31 @@ public static void ProcessCsvFile(string csvFilePath)
 
 ### Simpler
 1. Create a new instance of your visitor.
-1. Call one of the `Csv.Process*` methods, passing in whatever format your data is in along with your visitor.
+1. Use one of the `CsvSyncInput` or `CsvAsyncInput` methods to create an input object you can use to describe the data to your visitor.
 
 Examples:
 ```csharp
 public static void ProcessCsvFile(string csvFilePath)
 {
     Console.WriteLine($"Started reading '{csvFilePath}'.");
-    Csv.ProcessFile(csvFilePath, new MyVisitor(maxFieldLength: 1000));
+    CsvSyncInput.ForMemoryMappedFile(csvFilePath)
+                .Process(new MyVisitor(maxFieldLength: 1000));
     Console.WriteLine($"Finished reading '{csvFilePath}'.");
 }
 
 public static void ProcessCsvStream(Stream csvStream)
 {
-    Console.WriteLine($"Started reading '{csvFilePath}'.");
-    Csv.ProcessStream(csvStream, new MyVisitor(maxFieldLength: 1000));
-    Console.WriteLine($"Finished reading '{csvFilePath}'.");
+    Console.WriteLine($"Started reading CSV file.");
+    CsvSyncInput.ForStream(csvStream)
+                .Process(new MyVisitor(maxFieldLength: 1000));
+    Console.WriteLine($"Finished reading CSV file.");
 }
 
-public static async ValueTask ProcessCsvStreamAsync(Stream csvStream, IProgress<int> progress = null, CancellationToken cancellationToken = default)
+public static async Task ProcessCsvStreamAsync(Stream csvStream)
 {
-    Console.WriteLine($"Started reading '{csvFilePath}'.");
-    await Csv.ProcessStreamAsync(csvStream, new MyVisitor(maxFieldLength: 1000), progress, cancellationToken);
-    Console.WriteLine($"Finished reading '{csvFilePath}'.");
+    Console.WriteLine($"Started reading CSV file.");
+    await CsvAsyncInput.ForStream(csvStream)
+                       .ProcessAsync(new MyVisitor(maxFieldLength: 1000));
+    Console.WriteLine($"Finished reading CSV file.");
 }
 ```
diff --git a/calculate-coverage.cmd b/calculate-coverage.cmd
@@ -9,6 +9,7 @@ tools\OpenCover.4.7.922\tools\OpenCover.Console.exe ^
     "-target:%DotNetPath%" ^
     "-targetArgs:test -c Release --no-build" ^
     "-filter:+[Cursively]* +[Cursively.*]* -[Cursively.Tests]*" ^
+    -excludebyattribute:System.Diagnostics.CodeAnalysis.ExcludeFromCodeCoverageAttribute ^
     -output:tools\raw-coverage-results.xml ^
     -register:user ^
     -oldstyle

diff --git a/doc/benchmark-1.2.0.md b/doc/benchmark-1.2.0.md
@@ -0,0 +1,77 @@
+This benchmark tests the simple act of counting how many records are in a CSV file.  It's not a simple count of how many lines are in the text file: line breaks within quoted fields must be treated as data, and multiple line breaks in a row must be treated as one, since each record must have at least one field.  Therefore, assuming correct implementations, this benchmark should test the raw CSV processing speed.
+
+Cursively eliminates a ton of overhead found in libraries such as CsvHelper by restricting the allowed input encodings and using the visitor pattern as its only means of output.  Cursively can scan through the original bytes of the input to do its work, and it can give slices of the input data directly to the consumer without having to copy or allocate.
+
+Therefore, these benchmarks are somewhat biased in favor of Cursively, as CsvHelper relies on external code to transform the data to UTF-16.  This isn't as unfair as that makes it sound: the overwhelming majority of input files are probably UTF-8 anyway (or a compatible SBCS), so this transformation is something that practically every user will experience.
+
+- Input files can be found here: https://github.com/airbreather/Cursively/tree/v1.2.0/test/Cursively.Benchmark/large-csv-files.zip
+- Benchmark source code is this: https://github.com/airbreather/Cursively/tree/v1.2.0/test/Cursively.Benchmark
+
+As of version 1.2.0, these benchmarks no longer run on .NET Framework targets, because earlier benchmarks have shown comparable ratios.
+
+Raw BenchmarkDotNet output is at the bottom, but here are some numbers derived from it showing the throughput of CsvHelper compared to the throughput of each of five different ways of using Cursively on multiple different kinds of files.  This summary does not indicate anything about the GC pressure:
+
+| File                    | Size (bytes) | CsvHelper (MB/s) | Cursively 1* (MB/s) | Cursively 2* (MB/s) | Cursively 3* (MB/s) | Cursively 4* (MB/s) | Cursively 5* (MB/s) |
+|-------------------------|-------------:|-----------------:|--------------------:|--------------------:|--------------------:|--------------------:|--------------------:|
+| 100-huge-records        | 2900444      | 27.68            | 482.15 (x17.42)     | 528.08 (x19.08)     | 443.99 (x16.04)     | 448.17 (x16.19)     | 408.70 (x14.77)     |
+| 100-huge-records-quoted | 4900444      | 29.40            | 304.64 (x10.36)     | 325.31 (x11.07)     | 295.21 (x10.04)     | 293.08 (x09.97)     | 285.03 (x09.70)     |
+| 10k-empty-records       | 10020000     | 14.59            | 311.97 (x21.38)     | 311.27 (x21.33)     | 283.40 (x19.42)     | 297.41 (x20.38)     | 268.23 (x18.38)     |
+| mocked                  | 12731500     | 74.29            | 3871.72 (x52.11)    | 3771.89 (x50.77)    | 1748.01 (x23.53)    | 2103.55 (x28.31)    | 1240.09 (x16.69)    |
+| worldcitiespop          | 151492068    | 39.39            | 622.85 (x15.81)     | 617.22 (x15.67)     | 518.00 (x13.15)     | 538.54 (x13.67)     | 450.63 (x11.44)     |
+
+\*Different Cursively methods are:
+1. Directly using `CsvTokenizer`
+1. `CsvSyncInput.ForMemory`
+1. `CsvSyncInput.ForMemoryMappedFile`
+1. `CsvSyncInput.ForStream` (using a `FileStream`)
+1. `CsvAsyncInput.ForStream` (using a `FileStream` opened in asynchronous mode)
+
+Raw BenchmarkDotNet output:
+
+``` ini
+
+BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18362
+Intel Core i7-6850K CPU 3.60GHz (Skylake), 1 CPU, 12 logical and 6 physical cores
+.NET Core SDK=3.0.100-preview6-012264
+  [Host]     : .NET Core 2.2.6 (CoreCLR 4.6.27817.03, CoreFX 4.6.27818.02), 64bit RyuJIT
+  Job-UPPUKA : .NET Core 2.2.6 (CoreCLR 4.6.27817.03, CoreFX 4.6.27818.02), 64bit RyuJIT
+
+Server=True  
+
+```
+|                                       Method |              csvFile |         Mean |      Error |     StdDev | Ratio | RatioSD |      Gen 0 |    Gen 1 | Gen 2 |    Allocated |
+|--------------------------------------------- |--------------------- |-------------:|-----------:|-----------:|------:|--------:|-----------:|---------:|------:|-------------:|
+|                   CountRowsUsingCursivelyRaw |     100-huge-records |     5.737 ms |  0.0037 ms |  0.0033 ms |  1.00 |    0.00 |          - |        - |     - |         48 B |
+|            CountRowsUsingCursivelyArrayInput |     100-huge-records |     5.238 ms |  0.0066 ms |  0.0062 ms |  0.91 |    0.00 |          - |        - |     - |         96 B |
+| CountRowsUsingCursivelyMemoryMappedFileInput |     100-huge-records |     6.230 ms |  0.0159 ms |  0.0141 ms |  1.09 |    0.00 |          - |        - |     - |        544 B |
+|       CountRowsUsingCursivelyFileStreamInput |     100-huge-records |     6.172 ms |  0.0080 ms |  0.0067 ms |  1.08 |    0.00 |          - |        - |     - |        272 B |
+|  CountRowsUsingCursivelyAsyncFileStreamInput |     100-huge-records |     6.768 ms |  0.1349 ms |  0.1262 ms |  1.18 |    0.02 |          - |        - |     - |       1360 B |
+|                      CountRowsUsingCsvHelper |     100-huge-records |    99.938 ms |  0.3319 ms |  0.3105 ms | 17.42 |    0.06 |   400.0000 | 200.0000 |     - |  110256320 B |
+|                                              |                      |              |            |            |       |         |            |          |       |              |
+|                   CountRowsUsingCursivelyRaw | 100-h(...)uoted [23] |    15.341 ms |  0.0305 ms |  0.0255 ms |  1.00 |    0.00 |          - |        - |     - |         48 B |
+|            CountRowsUsingCursivelyArrayInput | 100-h(...)uoted [23] |    14.366 ms |  0.0167 ms |  0.0156 ms |  0.94 |    0.00 |          - |        - |     - |         96 B |
+| CountRowsUsingCursivelyMemoryMappedFileInput | 100-h(...)uoted [23] |    15.831 ms |  0.0487 ms |  0.0455 ms |  1.03 |    0.00 |          - |        - |     - |        544 B |
+|       CountRowsUsingCursivelyFileStreamInput | 100-h(...)uoted [23] |    15.946 ms |  0.0383 ms |  0.0358 ms |  1.04 |    0.00 |          - |        - |     - |        272 B |
+|  CountRowsUsingCursivelyAsyncFileStreamInput | 100-h(...)uoted [23] |    16.396 ms |  0.2821 ms |  0.2771 ms |  1.07 |    0.02 |          - |        - |     - |       1360 B |
+|                      CountRowsUsingCsvHelper | 100-h(...)uoted [23] |   158.968 ms |  0.1382 ms |  0.1154 ms | 10.36 |    0.02 |   333.3333 |        - |     - |  153579848 B |
+|                                              |                      |              |            |            |       |         |            |          |       |              |
+|                   CountRowsUsingCursivelyRaw |    10k-empty-records |    30.631 ms |  0.1009 ms |  0.0894 ms |  1.00 |    0.00 |          - |        - |     - |         48 B |
+|            CountRowsUsingCursivelyArrayInput |    10k-empty-records |    30.699 ms |  0.0624 ms |  0.0584 ms |  1.00 |    0.00 |          - |        - |     - |         96 B |
+| CountRowsUsingCursivelyMemoryMappedFileInput |    10k-empty-records |    33.718 ms |  0.0873 ms |  0.0817 ms |  1.10 |    0.00 |          - |        - |     - |        544 B |
+|       CountRowsUsingCursivelyFileStreamInput |    10k-empty-records |    32.130 ms |  0.0944 ms |  0.0737 ms |  1.05 |    0.00 |          - |        - |     - |        272 B |
+|  CountRowsUsingCursivelyAsyncFileStreamInput |    10k-empty-records |    35.625 ms |  0.7018 ms |  0.7801 ms |  1.17 |    0.03 |          - |        - |     - |       1360 B |
+|                      CountRowsUsingCsvHelper |    10k-empty-records |   654.743 ms | 13.0238 ms | 16.9346 ms | 21.42 |    0.51 |  2000.0000 |        - |     - |  420832856 B |
+|                                              |                      |              |            |            |       |         |            |          |       |              |
+|                   CountRowsUsingCursivelyRaw |               mocked |     3.136 ms |  0.0038 ms |  0.0034 ms |  1.00 |    0.00 |          - |        - |     - |         48 B |
+|            CountRowsUsingCursivelyArrayInput |               mocked |     3.219 ms |  0.0623 ms |  0.0741 ms |  1.02 |    0.02 |          - |        - |     - |         96 B |
+| CountRowsUsingCursivelyMemoryMappedFileInput |               mocked |     6.946 ms |  0.0553 ms |  0.0490 ms |  2.21 |    0.02 |          - |        - |     - |        544 B |
+|       CountRowsUsingCursivelyFileStreamInput |               mocked |     5.772 ms |  0.0365 ms |  0.0324 ms |  1.84 |    0.01 |          - |        - |     - |        272 B |
+|  CountRowsUsingCursivelyAsyncFileStreamInput |               mocked |     9.791 ms |  0.1129 ms |  0.1056 ms |  3.12 |    0.04 |          - |        - |     - |       1360 B |
+|                      CountRowsUsingCsvHelper |               mocked |   163.426 ms |  3.2351 ms |  3.1773 ms | 52.08 |    0.97 |   333.3333 |        - |     - |  115757736 B |
+|                                              |                      |              |            |            |       |         |            |          |       |              |
+|                   CountRowsUsingCursivelyRaw |       worldcitiespop |   231.955 ms |  1.0755 ms |  0.9534 ms |  1.00 |    0.00 |          - |        - |     - |         48 B |
+|            CountRowsUsingCursivelyArrayInput |       worldcitiespop |   234.071 ms |  1.0749 ms |  0.9529 ms |  1.01 |    0.01 |          - |        - |     - |         96 B |
+| CountRowsUsingCursivelyMemoryMappedFileInput |       worldcitiespop |   278.909 ms |  3.0866 ms |  2.8872 ms |  1.20 |    0.01 |          - |        - |     - |        544 B |
+|       CountRowsUsingCursivelyFileStreamInput |       worldcitiespop |   268.271 ms |  3.4632 ms |  2.8920 ms |  1.16 |    0.02 |          - |        - |     - |        272 B |
+|  CountRowsUsingCursivelyAsyncFileStreamInput |       worldcitiespop |   320.606 ms |  1.6204 ms |  1.5157 ms |  1.38 |    0.01 |          - |        - |     - |       1360 B |
+|                      CountRowsUsingCsvHelper |       worldcitiespop | 3,667.940 ms | 60.8394 ms | 56.9092 ms | 15.82 |    0.24 | 15000.0000 |        - |     - | 3096694312 B |
diff --git a/doc/release-notes.md b/doc/release-notes.md
@@ -1,5 +1,13 @@
 # Cursively Release Notes
 
+## [1.2.0](https://github.com/airbreather/Cursively/milestone/4)
+- Added fluent helpers to replace the `Csv.ProcessFoo` methods with something that's easier to maintain without being meaningfully less convenient to use ([#15](https://github.com/airbreather/Cursively/issues/15)).
+- Deprecated the ability to ignore a leading UTF-8 byte order mark inside the header-aware visitor, per [#14](https://github.com/airbreather/Cursively/issues/14).
+    - Instead, it's up to the source of the input to skip (or not skip) sending a leading UTF-8 BOM to the tokenizer in the first place.
+    - By default, all the fluent helpers from the previous bullet point will ignore a leading UTF-8 BOM if present.  This behavior may be disabled by chaining `.WithIgnoreUTF8ByteOrderMark(false)`.
+- Improved how the header-aware visitor behaves when the creator requests very high limits ([#17](https://github.com/airbreather/Cursively/issues/17)).
+- Fixed a rare off-by-one issue in the header-aware visitor that would happen when a header is exactly as long as the configured maximum **and** its last byte is exactly the last byte of the input chunk that happens to contain it ([#16](https://github.com/airbreather/Cursively/issues/16)).
+
 ## [1.1.0](https://github.com/airbreather/Cursively/milestone/1)
 - Several further performance optimizations.  Most significantly, inlining and tuning a critical `ReadOnlySpan<T>` extension method.
     - In some cases, this increased throughput by a factor of 3.

diff --git a/doc/toc.yml b/doc/toc.yml
@@ -3,7 +3,7 @@
 - name: API Documentation
   href: obj/api/
 - name: Benchmark
-  href: benchmark-1.1.0.md
+  href: benchmark-1.2.0.md
 - name: Release Notes
   href: release-notes.md
 - name: NuGet Package

diff --git a/generate-docs.cmd b/generate-docs.cmd
@@ -2,7 +2,7 @@
 REM ===========================================================================
 REM Regenerates the https://airbreather.github.io/Cursively content locally
 REM ===========================================================================
-set DOCFX_PACKAGE_VERSION=2.42.4
+set DOCFX_PACKAGE_VERSION=2.43.1
 pushd %~dp0
 REM incremental / cached builds tweak things about the output, so let's do it
 REM all fresh if we can help it...