Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lock-free connection pool #462

Open
bgrainger opened this issue Mar 20, 2018 · 20 comments
Open

Add lock-free connection pool #462

bgrainger opened this issue Mar 20, 2018 · 20 comments

Comments

@bgrainger
Copy link
Member

See example from @sebastienros here: sebastienros@ffe984f

Also implement the optimisation to stagger probes into the pool from @roji: npgsql/npgsql@19c436e#diff-33bb36ea0d45f66ae316561892d5b5a9R236

@roji
Copy link

roji commented Mar 20, 2018

Wow, @sebastienros and @bgrainger, I didn't know you were working on a lock-free pool...

The more I work and think about the pooling issue, the more I'm convinced there needs to be a totally common component (https://github.com/dotnet/corefx/issues/26714) shared across providers (in fact, I'm thinking about a generic pooling API that wouldn't even be specific to ADO.NET/databases). I'm planning to spend some time on making my implementation a bit more generic to go in this direction, any reviewing/feedback would be really great (here's my pool). I think that if you guys are planning to work on the pool it may be a thing we should try to do together, and maybe even go towards a shared component.

Anyway, @sebastienros I've taken a quick look, here are some comments:

  • I'm sure this is just WIP, but the new implementation doesn't seem to impose Max Pool Size - if you reach the limit a new connection is allocated and returned. This means you can open an unbounded number of physical database connections, which is probably something we don't want.
  • If/when you do implement blocking when Max Pool Size is reached, it's pretty important for this to respect FIFO semantics, otherwise you get connection attempts timing out because later attempts got fulfilled first. This is why a simple lock/semaphore isn't very suitable and an actual queue is used in my implementation.
  • Not exactly pool-related but: It seems that (by default) MySqlConnector resets connection state every time a pooled connection is opened. If I'm reading the code right, this is an actual database roundtrip that's done every time. Npgsql used to work like this, until I changed it to "enqueue" the reset messages and to send them as part of the first statement after the connection is reopened. This effectively eliminates a roundtrip for each pooled open/close, which is pretty significant.
  • Related: I can also see that if connection reset is disabled, MySqlConnector does a ping roundtrip instead. This was also removed from Npgsql at some point, and I occasionally receive complaints from users that the pool sometimes broken connections. The reason it doesn't make sense to me to ping, is that a connection may broken at any point - 1 millisecond after the ping check happens - and user code must be capable of dealing with that. Npgsql does have an opt-in keepalive which partially takes care of it, but the problem will always remain and paying a database roundtrip to try to avoid it doesn't make sense to me.
  • Note that in my implementation I worked the extra mile to make sure that the fast-path invocation (i.e. an idle connection is available in the pool) isn't an async method, to avoid the state machine overhead.
  • When getting a session out, you can use Interlocked.Exchange() instead of Interlocked.CompareAndExchange().

@bgrainger
Copy link
Member Author

It seems that (by default) MySqlConnector resets connection state every time a pooled connection is opened. If I'm reading the code right, this is an actual database roundtrip that's done every time.

That's correct.

Npgsql used to work like this, until I changed it to "enqueue" the reset messages and to send them as part of the first statement after the connection is reopened. This effectively eliminates a roundtrip for each pooled open/close, which is pretty significant.

My last attempt at improving this was to reset the connection in the background: #178. I think queueing the reset message is a better approach except...

I occasionally receive complaints from users that the pool sometimes broken connections

... performing the reset connection while retrieving a connection from the pool has the very nice side-effect that the connection is actually open when Open returns...

a connection may broken at any point

... unless of course there is a catastrophic network/server failure a nanosecond after Open returns.

In theory, I agree that users should expect that any database call can fail and that appropriate retry logic needs to be implemented. In practice, it seems like there's a lot of code that just assumes everything's going to be just fine if Open returns without throwing an exception; pushing that exception (for a broken pooled connection) down to the first call to ExecuteReader etc. is a performance-vs-reliability tradeoff that (currently) doesn't seem worth making by default.

(I'm all for allowing users to opt into high-performance at the potential cost of correctness or reliability if they determine they absolutely need it. However, it seems best to be "safe" by default. And FWIW MySqlConnector is already faster than Connector/NET for retrieving a pooled connection even with ConnectionReset=true. So MySQL users can get safety and better performance by switching ADO.NET providers, then opt into higher performance if they need it.)

@sebastienros
Copy link
Contributor

It's more of a POC to see how it would benefit performance, but your comments are great to go towards a better final implementation.

@bgrainger
Copy link
Member Author

there needs to be a totally common component shared across providers

I'd love to be able to reuse a high-performance, well-tested, already-debugged component (as opposed to having to develop and support my own implementation).

It would be nice to be able to reuse common code without introducing a new dependency. Are NuGet source-only packages still a thing? (I've not used them in the past, and I'm not sure if they made it into the new world or not.) Git submodules might be an answer? (They're not without their own problems, of course.)

@roji
Copy link

roji commented Mar 20, 2018

(I'm all for allowing users to opt into high-performance at the potential cost of correctness or reliability if they determine they absolutely need it. However, it seems best to be "safe" by default. And FWIW MySqlConnector is already faster than Connector/NET for retrieving a pooled connection even with ConnectionReset=true. So MySQL users can get safety and better performance by switching ADO.NET providers, then opt into higher performance if they need it.)

I agree that there's a bit of a tradeoff here, although it seems to be the SqlClient behavior:

If a connection exists to a server that has disappeared, this connection can be drawn from the pool even if the connection pooler has not detected the severed connection and marked it as invalid. This is the case because the overhead of checking that the connection is still valid would eliminate the benefits of having a pooler by causing another round trip to the server to occur. When this occurs, the first attempt to use the connection will detect that the connection has been severed, and an exception is thrown.

A similar note can be found in the ODBC driver developer guidelines:

When the Driver Manager is pooling connections, it needs to be able to determine if a connection is still working before handing out the connection. Otherwise, the Driver Manager keeps on handing out the dead connection to the application whenever a transient network failure occurs. A new connection attribute has been defined in ODBC 3.x: SQL_ATTR_CONNECTION_DEAD. This is a read-only connection attribute that returns either SQL_CD_TRUE or SQL_CD_FALSE. The value SQL_CD_TRUE means that the connection has been lost, while the value SQL_CD_FALSE means that the connection is still active. (Drivers conforming to earlier versions of ODBC can also support this attribute.)

A driver must implement this option efficiently or it will impair the connection pooling performance. Specifically, a call to get this connection attribute should not cause a round trip to the server. Instead, a driver should just return the last known state of the connection. The connection is dead if the last trip to the server failed, and not dead if the last trip succeeded.

As I said above, in Npgsql this can be somewhat mitigated via keepalive, an opt-in feature which does a roundtrip on connections every X seconds (even when in the pool). The advantage of this is that it does not happen on the code path of Open() or Close() - it's a totally "out-of-band" operation that checks your connection without affecting performance.

But you're right that some users may be surprised here, I'd at least recommend an opt-out from this for high-perf apps.

It would be nice to be able to reuse common code without introducing a new dependency. Are NuGet source-only packages still a thing? (I've not used them in the past, and I'm not sure if they made it into the new world or not.) Git submodules might be an answer? (They're not without their own problems, of course.)

Having this as a source-only nuget is a great idea, I think they still exist (see this thread).

In any case I'll look into making my own pool less coupled to PostgreSQL etc.

@bgrainger
Copy link
Member Author

I agree that there's a bit of a tradeoff here, although it seems to be the SqlClient behavior

Thanks for this pointer; it's helpful.

@bgrainger
Copy link
Member Author

@roji By the time you told me you were planning to work on abstracting a connection pool, I had already ported most of the code over to MySqlConnector (but I didn't mention it, since it was still WIP). I decided to clean it up today and start to run some performance tests: #614.

They also say you shouldn't create an abstraction until you've implemented it 3 times (or something like that) so it might be helpful to see how I needed to tweak the implementation to meet my specific needs. (Hopefully not breaking it in the process.)

@bgrainger
Copy link
Member Author

@sebastienros I've tested these commits with TFB Docker locally, but I don't think I have sufficient system resources (particularly with the overhead of Docker for Windows) to see if the changes make a difference. (That is, I'm think already able to push my system to 100% usage and making the code faster doesn't result in any more requests per second.)

If you're able, can you test the following commits and report back the speeds achieved? I'm interested to know if any individual commit results in a significant performance increase or regression:

@roji
Copy link

roji commented Feb 26, 2019

/cc @divega @ajcvickers @bricelam

It's great to see this, and I'm really looking forward to seeing the perf impact here.

FWIW we don't have to do an actual reusable component with an abstraction - porting the code could be a very good start, and as you say having an implement on 2 providers is a much better place to start from if we do want a reusable component. My plate is quite full at the moment anyway, let's see where your efforts lead - I may also end up taking some of your improvements back into Npgsql at some point (e.g. the LIFO behavior).

@bgrainger
Copy link
Member Author

@sebastienros I've taken those six commits and applied them to (a fork of) FrameworkBenchmarks (using a submodule), if that makes testing easier:

@sebastienros
Copy link
Contributor

I ran all these branches on our local physical machines, which are not the TE ones but the ratio in perf will be the same.

Only the RPS and Average lantency numbers matter in this case

| Description |     RPS | CPU (%) | Memory (MB) | Avg. Latency (ms) | Startup (ms) | First Request (ms) | Latency (ms) | Ratio |
| ----------- | ------- | ------- | ----------- | ----------------- | ------------ | ------------------ | ------------ | ----- |
|   TE master | 106,406 |      93 |         395 |              2.67 |          834 |             491.85 |         1.35 |  1.00 | 
|     c2beb99 | 105,860 |      92 |         389 |              2.68 |          796 |             517.37 |         1.15 |  0.99 |  
|     2863905 | 105,049 |      94 |         392 |              2.70 |          804 |             528.89 |         1.21 |  0.99 | 
|     6ea700c | 108,851 |      93 |         386 |              2.66 |          795 |             517.35 |         1.41 |  1.02 | 
|     0b4ee38 | 105,834 |      92 |         401 |              2.71 |          810 |              525.6 |         1.40 |  0.99 | 
|     51a85d5 | 104,385 |      91 |         391 |              2.74 |          798 |             525.96 |         1.31 |  0.98 | 
|     8881b4b | 105,104 |      98 |         386 |              2.62 |          815 |             509.18 |         1.32 |  0.99 |

@sebastienros
Copy link
Contributor

sebastienros commented Mar 8, 2019

For future reference (note to myself) I am adding the command line I used to run these:

dotnet run -- --server http://asp-perf-lin:5001 --client http://asp-perf-load:5002 `
-r https://github.com/bgrainger/FrameworkBenchmarks@master#8881b4b `
--docker-file frameworks/CSharp/aspnetcore/aspcore-ado-my.dockerfile `
--docker-image aspnetcore_ado_my `
--docker-context frameworks/CSharp/aspnetcore/ `
--ready-text "Application started." `
--clientName bombardier `
--path /fortunes --port 8080 `
--diff TechEmpower --save 8881b4b

@sebastienros
Copy link
Contributor

If you expected more variance please provide a PR that makes things really bad to ensure I ran these correctly.

@bgrainger
Copy link
Member Author

bgrainger commented Mar 8, 2019

@sebastienros Here's a PR with Thread.Sleep(1) added; it should reduce performance by an order of magnitude:

bgrainger/FrameworkBenchmarks@f288f97

@sebastienros
Copy link
Contributor

It works, which means the previous results are valid

|     RPS | CPU (%) | Memory (MB) | Avg. Latency (ms) | Startup (ms) | First Request (ms) | Latency (ms) | Ratio |
| ------- | ------- | ----------- | ----------------- | ------------ | ------------------ | ------------ | ----- |
| 106,406 |      93 |         395 |              2.67 |          834 |             491.85 |         1.35 |  1.00 |
|  16,731 |      28 |         402 |             24.92 |          791 |             520.86 |         2.37 |  0.16 |

@mkromkamp
Copy link

Is this still actively discussed and/or developed? I'm having a certain workload that benefits from higher throughput/rps even it would mean a slight payoff in latency.

If this is still on-going I'm willing to do some testing to see if it could benefit such use-cases in real life. 👍

@bgrainger
Copy link
Member Author

The code is at https://github.com/bgrainger/MySqlConnector/commits/lock-free-pool (except for the last commit on that branch).

It wasn't clear to me from the results above that this actually improved performance, so given the added complexity I decided not to integrate it yet. If you have a scenario where you can build MySqlConnector from source and test it, I'm very interested to hear your results.

Note that MySqlConnector logs over 200k RPS on the TFB Benchmarks (see aspcore-ado-my) so it's already capable of sustaining high throughput. What kind of numbers are you seeing and what are you hoping to achieve?

@mkromkamp
Copy link

The code is at https://github.com/bgrainger/MySqlConnector/commits/lock-free-pool (except for the last commit on that branch).

It wasn't clear to me from the results above that this actually improved performance, so given the added complexity I decided not to integrate it yet. If you have a scenario where you can build MySqlConnector from source and test it, I'm very interested to hear your results.

Thanks for pointing this out. It is also not clear for me if the there are real benifits in the implementation. Just wanted to reach out to see if the conversation would benefit from some additional testing. I will take a look at building the library from source and hook it into my application. If I have any meaningfull information to share I will drop you a message over here.

Note that MySqlConnector logs over 200k RPS on the TFB Benchmarks (see aspcore-ado-my) so it's already capable of sustaining high throughput. What kind of numbers are you seeing and what are you hoping to achieve?

That is impressive, and from looking at the benchmark, it is not clear MySqlConnector is used over here. Might be worth adding this somewhere on the website and/or readme on GitHub.
Do you know anything about the setup they are using(MySql config, how many webservers, connection configuration, etc)? Might give some great information how to squeeze every bit of performance out of such a situation 👍

@bgrainger
Copy link
Member Author

TechEmpower server hardware is at https://www.techempower.com/benchmarks/#section=environment:

Citrine (rounds 16 onward)
Three homogeneous Dell R440 servers each equipped with an Intel Xeon Gold 5120 CPU, 32 GB of memory, and an enterprise SSD. Dedicated Cisco 10-gigabit Ethernet switch. Provided by Microsoft.

I believe they use one for the benchmarking client, one for the web framework being tested, and one for the database server. I have not found any information on how they're tuning MySQL for performance in this environment.

@bgrainger
Copy link
Member Author

I will take a look at building the library from source and hook it into my application.

Note that the lock-free code branched from the main library somewhere between 0.49.3 and 0.50.0; there shouldn't be any major performance changes between those two versions but obviously some of the newer features won't be present. It looks like it won't currently merge cleanly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants