Add lock-free connection pool #462

bgrainger · 2018-03-20T06:45:37Z

See example from @sebastienros here: sebastienros@ffe984f

Also implement the optimisation to stagger probes into the pool from @roji: npgsql/npgsql@19c436e#diff-33bb36ea0d45f66ae316561892d5b5a9R236

The text was updated successfully, but these errors were encountered:

roji · 2018-03-20T17:31:11Z

Wow, @sebastienros and @bgrainger, I didn't know you were working on a lock-free pool...

The more I work and think about the pooling issue, the more I'm convinced there needs to be a totally common component (https://github.com/dotnet/corefx/issues/26714) shared across providers (in fact, I'm thinking about a generic pooling API that wouldn't even be specific to ADO.NET/databases). I'm planning to spend some time on making my implementation a bit more generic to go in this direction, any reviewing/feedback would be really great (here's my pool). I think that if you guys are planning to work on the pool it may be a thing we should try to do together, and maybe even go towards a shared component.

Anyway, @sebastienros I've taken a quick look, here are some comments:

I'm sure this is just WIP, but the new implementation doesn't seem to impose Max Pool Size - if you reach the limit a new connection is allocated and returned. This means you can open an unbounded number of physical database connections, which is probably something we don't want.
If/when you do implement blocking when Max Pool Size is reached, it's pretty important for this to respect FIFO semantics, otherwise you get connection attempts timing out because later attempts got fulfilled first. This is why a simple lock/semaphore isn't very suitable and an actual queue is used in my implementation.
Not exactly pool-related but: It seems that (by default) MySqlConnector resets connection state every time a pooled connection is opened. If I'm reading the code right, this is an actual database roundtrip that's done every time. Npgsql used to work like this, until I changed it to "enqueue" the reset messages and to send them as part of the first statement after the connection is reopened. This effectively eliminates a roundtrip for each pooled open/close, which is pretty significant.
Related: I can also see that if connection reset is disabled, MySqlConnector does a ping roundtrip instead. This was also removed from Npgsql at some point, and I occasionally receive complaints from users that the pool sometimes broken connections. The reason it doesn't make sense to me to ping, is that a connection may broken at any point - 1 millisecond after the ping check happens - and user code must be capable of dealing with that. Npgsql does have an opt-in keepalive which partially takes care of it, but the problem will always remain and paying a database roundtrip to try to avoid it doesn't make sense to me.
Note that in my implementation I worked the extra mile to make sure that the fast-path invocation (i.e. an idle connection is available in the pool) isn't an async method, to avoid the state machine overhead.
When getting a session out, you can use Interlocked.Exchange() instead of Interlocked.CompareAndExchange().

bgrainger · 2018-03-20T17:42:41Z

It seems that (by default) MySqlConnector resets connection state every time a pooled connection is opened. If I'm reading the code right, this is an actual database roundtrip that's done every time.

That's correct.

Npgsql used to work like this, until I changed it to "enqueue" the reset messages and to send them as part of the first statement after the connection is reopened. This effectively eliminates a roundtrip for each pooled open/close, which is pretty significant.

My last attempt at improving this was to reset the connection in the background: #178. I think queueing the reset message is a better approach except...

I occasionally receive complaints from users that the pool sometimes broken connections

... performing the reset connection while retrieving a connection from the pool has the very nice side-effect that the connection is actually open when Open returns...

a connection may broken at any point

... unless of course there is a catastrophic network/server failure a nanosecond after Open returns.

In theory, I agree that users should expect that any database call can fail and that appropriate retry logic needs to be implemented. In practice, it seems like there's a lot of code that just assumes everything's going to be just fine if Open returns without throwing an exception; pushing that exception (for a broken pooled connection) down to the first call to ExecuteReader etc. is a performance-vs-reliability tradeoff that (currently) doesn't seem worth making by default.

(I'm all for allowing users to opt into high-performance at the potential cost of correctness or reliability if they determine they absolutely need it. However, it seems best to be "safe" by default. And FWIW MySqlConnector is already faster than Connector/NET for retrieving a pooled connection even with ConnectionReset=true. So MySQL users can get safety and better performance by switching ADO.NET providers, then opt into higher performance if they need it.)

sebastienros · 2018-03-20T17:42:59Z

It's more of a POC to see how it would benefit performance, but your comments are great to go towards a better final implementation.

bgrainger · 2018-03-20T17:46:43Z

there needs to be a totally common component shared across providers

I'd love to be able to reuse a high-performance, well-tested, already-debugged component (as opposed to having to develop and support my own implementation).

It would be nice to be able to reuse common code without introducing a new dependency. Are NuGet source-only packages still a thing? (I've not used them in the past, and I'm not sure if they made it into the new world or not.) Git submodules might be an answer? (They're not without their own problems, of course.)

roji · 2018-03-20T18:02:41Z

(I'm all for allowing users to opt into high-performance at the potential cost of correctness or reliability if they determine they absolutely need it. However, it seems best to be "safe" by default. And FWIW MySqlConnector is already faster than Connector/NET for retrieving a pooled connection even with ConnectionReset=true. So MySQL users can get safety and better performance by switching ADO.NET providers, then opt into higher performance if they need it.)

I agree that there's a bit of a tradeoff here, although it seems to be the SqlClient behavior:

If a connection exists to a server that has disappeared, this connection can be drawn from the pool even if the connection pooler has not detected the severed connection and marked it as invalid. This is the case because the overhead of checking that the connection is still valid would eliminate the benefits of having a pooler by causing another round trip to the server to occur. When this occurs, the first attempt to use the connection will detect that the connection has been severed, and an exception is thrown.

A similar note can be found in the ODBC driver developer guidelines:

When the Driver Manager is pooling connections, it needs to be able to determine if a connection is still working before handing out the connection. Otherwise, the Driver Manager keeps on handing out the dead connection to the application whenever a transient network failure occurs. A new connection attribute has been defined in ODBC 3.x: SQL_ATTR_CONNECTION_DEAD. This is a read-only connection attribute that returns either SQL_CD_TRUE or SQL_CD_FALSE. The value SQL_CD_TRUE means that the connection has been lost, while the value SQL_CD_FALSE means that the connection is still active. (Drivers conforming to earlier versions of ODBC can also support this attribute.)

A driver must implement this option efficiently or it will impair the connection pooling performance. Specifically, a call to get this connection attribute should not cause a round trip to the server. Instead, a driver should just return the last known state of the connection. The connection is dead if the last trip to the server failed, and not dead if the last trip succeeded.

As I said above, in Npgsql this can be somewhat mitigated via keepalive, an opt-in feature which does a roundtrip on connections every X seconds (even when in the pool). The advantage of this is that it does not happen on the code path of Open() or Close() - it's a totally "out-of-band" operation that checks your connection without affecting performance.

But you're right that some users may be surprised here, I'd at least recommend an opt-out from this for high-perf apps.

It would be nice to be able to reuse common code without introducing a new dependency. Are NuGet source-only packages still a thing? (I've not used them in the past, and I'm not sure if they made it into the new world or not.) Git submodules might be an answer? (They're not without their own problems, of course.)

Having this as a source-only nuget is a great idea, I think they still exist (see this thread).

In any case I'll look into making my own pool less coupled to PostgreSQL etc.

bgrainger · 2018-03-20T19:35:55Z

I agree that there's a bit of a tradeoff here, although it seems to be the SqlClient behavior

Thanks for this pointer; it's helpful.

bgrainger · 2019-02-26T04:12:39Z

@roji By the time you told me you were planning to work on abstracting a connection pool, I had already ported most of the code over to MySqlConnector (but I didn't mention it, since it was still WIP). I decided to clean it up today and start to run some performance tests: #614.

They also say you shouldn't create an abstraction until you've implemented it 3 times (or something like that) so it might be helpful to see how I needed to tweak the implementation to meet my specific needs. (Hopefully not breaking it in the process.)

bgrainger · 2019-02-26T05:26:02Z

@sebastienros I've tested these commits with TFB Docker locally, but I don't think I have sufficient system resources (particularly with the overhead of Docker for Windows) to see if the changes make a difference. (That is, I'm think already able to push my system to 100% usage and making the code faster doesn't result in any more requests per second.)

If you're able, can you test the following commits and report back the speeds achieved? I'm interested to know if any individual commit results in a significant performance increase or regression:

11f1c42 (baseline)
14ac31e (master before lock-free pool)
bgrainger@4e1861c (lock-free pool)
bgrainger@af2f1d5 (RunContinuationsAsynchronously)
bgrainger@6ed13ac (LIFO pool)
bgrainger@f168563 (lock-free busy sessions)

roji · 2019-02-26T10:41:40Z

/cc @divega @ajcvickers @bricelam

It's great to see this, and I'm really looking forward to seeing the perf impact here.

FWIW we don't have to do an actual reusable component with an abstraction - porting the code could be a very good start, and as you say having an implement on 2 providers is a much better place to start from if we do want a reusable component. My plate is quite full at the moment anyway, let's see where your efforts lead - I may also end up taking some of your improvements back into Npgsql at some point (e.g. the LIFO behavior).

bgrainger · 2019-02-26T14:40:22Z

@sebastienros I've taken those six commits and applied them to (a fork of) FrameworkBenchmarks (using a submodule), if that makes testing easier:

bgrainger/FrameworkBenchmarks@c2beb99 (baseline)
bgrainger/FrameworkBenchmarks@2863905 (master before lock-free pool)
bgrainger/FrameworkBenchmarks@6ea700c (lock-free pool)
bgrainger/FrameworkBenchmarks@0b4ee38 (RunContinuationsAsynchronously)
bgrainger/FrameworkBenchmarks@51a85d5 (LIFO pool)
bgrainger/FrameworkBenchmarks@8881b4b (lock-free busy sessions)

sebastienros · 2019-03-08T22:01:56Z

I ran all these branches on our local physical machines, which are not the TE ones but the ratio in perf will be the same.

Only the RPS and Average lantency numbers matter in this case

| Description |     RPS | CPU (%) | Memory (MB) | Avg. Latency (ms) | Startup (ms) | First Request (ms) | Latency (ms) | Ratio |
| ----------- | ------- | ------- | ----------- | ----------------- | ------------ | ------------------ | ------------ | ----- |
|   TE master | 106,406 |      93 |         395 |              2.67 |          834 |             491.85 |         1.35 |  1.00 | 
|     c2beb99 | 105,860 |      92 |         389 |              2.68 |          796 |             517.37 |         1.15 |  0.99 |  
|     2863905 | 105,049 |      94 |         392 |              2.70 |          804 |             528.89 |         1.21 |  0.99 | 
|     6ea700c | 108,851 |      93 |         386 |              2.66 |          795 |             517.35 |         1.41 |  1.02 | 
|     0b4ee38 | 105,834 |      92 |         401 |              2.71 |          810 |              525.6 |         1.40 |  0.99 | 
|     51a85d5 | 104,385 |      91 |         391 |              2.74 |          798 |             525.96 |         1.31 |  0.98 | 
|     8881b4b | 105,104 |      98 |         386 |              2.62 |          815 |             509.18 |         1.32 |  0.99 |

sebastienros · 2019-03-08T22:03:13Z

For future reference (note to myself) I am adding the command line I used to run these:

dotnet run -- --server http://asp-perf-lin:5001 --client http://asp-perf-load:5002 `
-r https://github.com/bgrainger/FrameworkBenchmarks@master#8881b4b `
--docker-file frameworks/CSharp/aspnetcore/aspcore-ado-my.dockerfile `
--docker-image aspnetcore_ado_my `
--docker-context frameworks/CSharp/aspnetcore/ `
--ready-text "Application started." `
--clientName bombardier `
--path /fortunes --port 8080 `
--diff TechEmpower --save 8881b4b

sebastienros · 2019-03-08T22:14:17Z

If you expected more variance please provide a PR that makes things really bad to ensure I ran these correctly.

bgrainger · 2019-03-08T23:45:47Z

@sebastienros Here's a PR with Thread.Sleep(1) added; it should reduce performance by an order of magnitude:

bgrainger/FrameworkBenchmarks@f288f97

sebastienros · 2019-03-09T00:18:17Z

It works, which means the previous results are valid

|     RPS | CPU (%) | Memory (MB) | Avg. Latency (ms) | Startup (ms) | First Request (ms) | Latency (ms) | Ratio |
| ------- | ------- | ----------- | ----------------- | ------------ | ------------------ | ------------ | ----- |
| 106,406 |      93 |         395 |              2.67 |          834 |             491.85 |         1.35 |  1.00 |
|  16,731 |      28 |         402 |             24.92 |          791 |             520.86 |         2.37 |  0.16 |

mkromkamp · 2019-06-09T09:50:07Z

Is this still actively discussed and/or developed? I'm having a certain workload that benefits from higher throughput/rps even it would mean a slight payoff in latency.

If this is still on-going I'm willing to do some testing to see if it could benefit such use-cases in real life. 👍

bgrainger · 2019-06-09T14:31:09Z

The code is at https://github.com/bgrainger/MySqlConnector/commits/lock-free-pool (except for the last commit on that branch).

It wasn't clear to me from the results above that this actually improved performance, so given the added complexity I decided not to integrate it yet. If you have a scenario where you can build MySqlConnector from source and test it, I'm very interested to hear your results.

Note that MySqlConnector logs over 200k RPS on the TFB Benchmarks (see aspcore-ado-my) so it's already capable of sustaining high throughput. What kind of numbers are you seeing and what are you hoping to achieve?

mkromkamp · 2019-06-11T07:41:41Z

The code is at https://github.com/bgrainger/MySqlConnector/commits/lock-free-pool (except for the last commit on that branch).

It wasn't clear to me from the results above that this actually improved performance, so given the added complexity I decided not to integrate it yet. If you have a scenario where you can build MySqlConnector from source and test it, I'm very interested to hear your results.

Thanks for pointing this out. It is also not clear for me if the there are real benifits in the implementation. Just wanted to reach out to see if the conversation would benefit from some additional testing. I will take a look at building the library from source and hook it into my application. If I have any meaningfull information to share I will drop you a message over here.

Note that MySqlConnector logs over 200k RPS on the TFB Benchmarks (see aspcore-ado-my) so it's already capable of sustaining high throughput. What kind of numbers are you seeing and what are you hoping to achieve?

That is impressive, and from looking at the benchmark, it is not clear MySqlConnector is used over here. Might be worth adding this somewhere on the website and/or readme on GitHub.
Do you know anything about the setup they are using(MySql config, how many webservers, connection configuration, etc)? Might give some great information how to squeeze every bit of performance out of such a situation 👍

bgrainger · 2019-06-11T13:34:25Z

TechEmpower server hardware is at https://www.techempower.com/benchmarks/#section=environment:

Citrine (rounds 16 onward)
Three homogeneous Dell R440 servers each equipped with an Intel Xeon Gold 5120 CPU, 32 GB of memory, and an enterprise SSD. Dedicated Cisco 10-gigabit Ethernet switch. Provided by Microsoft.

I believe they use one for the benchmarking client, one for the web framework being tested, and one for the database server. I have not found any information on how they're tuning MySQL for performance in this environment.

bgrainger · 2019-06-11T13:37:40Z

I will take a look at building the library from source and hook it into my application.

Note that the lock-free code branched from the main library somewhere between 0.49.3 and 0.50.0; there shouldn't be any major performance changes between those two versions but obviously some of the newer features won't be present. It looks like it won't currently merge cleanly.

bgrainger added the enhancement label Mar 20, 2018

lauxjpn mentioned this issue Dec 7, 2019

Consider forcing upper limit for connection pool by blocking #747

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lock-free connection pool #462

Add lock-free connection pool #462

bgrainger commented Mar 20, 2018

roji commented Mar 20, 2018

bgrainger commented Mar 20, 2018

sebastienros commented Mar 20, 2018

bgrainger commented Mar 20, 2018

roji commented Mar 20, 2018

bgrainger commented Mar 20, 2018

bgrainger commented Feb 26, 2019

bgrainger commented Feb 26, 2019

roji commented Feb 26, 2019

bgrainger commented Feb 26, 2019

sebastienros commented Mar 8, 2019

sebastienros commented Mar 8, 2019 •

edited

Loading

sebastienros commented Mar 8, 2019

bgrainger commented Mar 8, 2019 •

edited

Loading

sebastienros commented Mar 9, 2019

mkromkamp commented Jun 9, 2019

bgrainger commented Jun 9, 2019

mkromkamp commented Jun 11, 2019

bgrainger commented Jun 11, 2019

bgrainger commented Jun 11, 2019

Add lock-free connection pool #462

Add lock-free connection pool #462

Comments

bgrainger commented Mar 20, 2018

roji commented Mar 20, 2018

bgrainger commented Mar 20, 2018

sebastienros commented Mar 20, 2018

bgrainger commented Mar 20, 2018

roji commented Mar 20, 2018

bgrainger commented Mar 20, 2018

bgrainger commented Feb 26, 2019

bgrainger commented Feb 26, 2019

roji commented Feb 26, 2019

bgrainger commented Feb 26, 2019

sebastienros commented Mar 8, 2019

sebastienros commented Mar 8, 2019 • edited Loading

sebastienros commented Mar 8, 2019

bgrainger commented Mar 8, 2019 • edited Loading

sebastienros commented Mar 9, 2019

mkromkamp commented Jun 9, 2019

bgrainger commented Jun 9, 2019

mkromkamp commented Jun 11, 2019

bgrainger commented Jun 11, 2019

bgrainger commented Jun 11, 2019

sebastienros commented Mar 8, 2019 •

edited

Loading

bgrainger commented Mar 8, 2019 •

edited

Loading