Postgres Connection Timeout Crash #1606

omad · 2024-06-27T03:15:50Z

I'm hunting a crash for some users running ODC code on the NCI, somewhere in database connection handling. A unique quirk of the NCI environment is that idle TCP Connections are killed after 5 minutes, whereas in most environments it's 2 hours. This is probably relevant, but I'm not sure.

An example stack trace

  File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/dass_logs.py", line 60, in decorated
    raise ex
  File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/dass_logs.py", line 52, in decorated
    result = fn(*args, **kwargs)
  File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 851, in scene_select
    l1_count, uuids2archive = l1_scenes_to_process(
  File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 588, in l1_scenes_to_process
    files2process, uuids2archive, duplicates = l1_filter(
  File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 446, in l1_filter
    for l1_dataset in dc.index.datasets.search(product=l1_product, time=month_as_range(year, month)):
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_datasets.py", line 548, in search
    for product, datasets in self._do_search_by_product(query,
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_datasets.py", line 691, in _do_search_by_product
    with self._db_connection() as connection:
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_transaction.py", line 60, in _db_connection
    with self._index._active_connection(transaction=transaction) as conn:
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/index.py", line 157, in _active_connection
    with self._db._connect() as conn:
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/drivers/postgres/_connections.py", line 231, in _connect
    connection.close()
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1251, in close
    conn.close()
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1166, in close
    self._checkin()
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1008, in _checkin
    _finalize_fairy(
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 800, in _finalize_fairy
    connection_record.checkin()
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 542, in checkin
    finalizer(connection)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 677, in _reset_characteristics
    characteristic.reset_characteristic(self, dbapi_connection)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/characteristics.py", line 50, in reset_characteristic
    dialect.reset_isolation_level(dbapi_conn)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 748, in reset_isolation_level
    self.set_isolation_level(dbapi_conn, self.default_isolation_level)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 851, in set_isolation_level
    connection.set_isolation_level(level)
psycopg2.OperationalError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

I can reliably cause exceptions using plain psycopg2, by attempting to use a connection that's been idle for more than 5 minutes.

I haven't been able to reproduce the exception using the ODC Index API. I've been inserting time.sleep(330) between different calls, which is what I could do with plain psycopg2.

I suspect that ODC is more resilient due to setting pool_recycle=60 on the SQLAlchemy connection pool.

datacube-core/datacube/drivers/postgres/_connections.py

Lines 131 to 134 in 4f1e636

    
           # If a connection is idle for this many seconds, SQLAlchemy will renew it rather 
        
           # than assuming it's still open. Allows servers to close idle connections without clients 
        
           # getting errors. 
        
           pool_recycle=pool_timeout,

One line in the traceback that I'm suspicious of is this line 231, since I think exiting the context would close the connection. Manually calling .close() seems unnecessary.

datacube-core/datacube/drivers/postgres/_connections.py

Lines 226 to 231 in 4f1e636

    
           with self._engine.connect() as connection: 
        
               try: 
        
                   connection.execution_options(isolation_level="AUTOCOMMIT") 
        
                   yield _api.PostgresDbAPI(connection) 
        
               finally: 
        
                   connection.close()

Summary

This script doesn't always crash.
This isn't the only time we've seen crashes that look like this on the NCI.
I've run out of steam investigating this, since I can't get a small reproducible example. For now at least.
I really want something reproducible before attempting a fix.
The connection handling code in ODC is far more complicated than I would have expected, and I'm not convinced it needs to be that way. However, maybe it does, I'm just not sure. :)

The text was updated successfully, but these errors were encountered:

Ariana-B · 2024-07-11T02:03:14Z

Here's a test script that causes a connection timeout error for some users:

#!/usr/bin/env python3
import datacube
import logging

def main(product="ga_ls8c_ard_3"):
    # This is ok
    # logging.basicConfig(level=logging.INFO)

    # setting up a file handler causes the new dea module to error out
    logging.basicConfig(filename = 'file.log',
                    level = logging.DEBUG,
                    format = '%(asctime)s:%(levelname)s:%(name)s:%(message)s')
    dc = datacube.Datacube(app="gen-list")
    for l1_dataset in dc.index.datasets.search(product=product):
        logging.info("This is a test")
        
if __name__ == "__main__":
    main(product="ga_s2am_ard_3")

jeremyh · 2024-07-11T03:19:12Z

The suspicious part, I think, is:

    for l1_dataset in dc.index.datasets.search(product=product):

It's a huge query — it's returning all datasets for a product. This will take a while as it's a lot of data over the wire.
It's returning an iterator, and (probably) streaming the results back with an open db cursor.

While the cursor is active, the pool_recycle time wouldn't help us — since it can't recycle a connection half way through getting its results back.

Perhaps each chunk of the results that the iterator is returning take long enough to "use" that they go beyond the timeout?

In this particular example we're only logging, so I wouldn't think that we're being too slow consuming our chunk of the iterator. But if we're logging to the filesystem we do know the lustre filesystem sometimes has very large pauses — a handful of "large pauses" within one chunk of the iterator could push it over the connection timeout? (ie, nothing has been read from the cursor/connection within that timeout).

(Only a guess at this point.)

SpacemanPaul · 2024-07-11T04:26:39Z

I agree that is almost certainly the root cause.

What happens if we greedily suck up the results before logging? e.g.

for l1_dataset in list(dc.index.datasets.search(product=product)):
        logging.info("This is a test")

(Obviously it hangs with no feedback to the user for a long time while it sucks down the whole product worth of datasets into memory - but does it timeout?)

SpacemanPaul · 2024-07-11T06:05:08Z

Also try:

batch_size=1000
i = 0
for prod, doc, uris  in dc.index.datasets.get_all_docs_for_product(
        dc.index.products.get(product)
        batch_size=batch_size
):
        logging.info(f"Dataset {doc['id']} is test {i % batch_size} of batch {i // batch_size}")
        i += 1

Ariana-B · 2024-07-11T07:25:19Z

@SpacemanPaul your first example also fails on my end, although with no stack trace - it simply outputs Killed.
The second example on the other hand seems to work.

jeremyh · 2024-07-11T21:57:43Z

@SpacemanPaul your first example also fails on my end, although with no stack trace - it simply outputs Killed. The second example on the other hand seems to work.

Killed means the script hit a limit on the login nodes (exceeding the memory, cpu or time limit) — they want people to qsub jobs to the batch system to run anything that needs more resources.

SpacemanPaul · 2024-07-11T23:26:23Z

That makes sense. I think the batch-get API above is probably the best approach to this class of problems.

omad · 2024-07-12T00:57:54Z

I've been digging a bit deeper, and now suspect it's caused by the (mis?)-configuration of PGBouncer on the NCI.

I ran the test script, and traced the network calls it made before crashing.

strace -e trace=%net --relative-timestamps -o ~/utils/stracenetlog.txt python ~/odc1606test.py

     0.000093 recvfrom(7, "\": 6390220.0}, \"lr\": {\"x\": 60978"..., 62212, 0, NULL, NULL) = 4096
     0.000090 recvfrom(7, "a06c8818b3\", \"crs\": \"epsg:32751\""..., 65438, 0, NULL, NULL) = 12288
     0.000099 recvfrom(7, "a_s2am_nbar_3-2-1_54HUD_2024-01-"..., 60873, 0, NULL, NULL) = 12288
     0.000159 recvfrom(7, "\", \"path\": \"ga_s2am_nbart_3-2-1_"..., 56854, 0, NULL, NULL) = 8192
     0.001896 recvfrom(7, "01/17/20240117T015348/ga_s2am_ar"..., 57805, 0, NULL, NULL) = 106   <-- This is where we finish receiving the results from the network connection.

*** LONG PAUSE HERE ***

   100.631082 sendto(7, "Q\0\0\0001SET default_transaction_iso"..., 50, MSG_NOSIGNAL, NULL, 0) = 50   <-- Then we try to set transaction isolation on the connection... for some reason.
     0.000276 recvfrom(7, "E\0\0\0(SFATAL\0C08P01\0Mclient_idle_"..., 65536, 0, NULL, NULL) = 41  <-- And receive an error reply
     0.000358 recvfrom(7, "", 65536, 0, NULL, NULL) = 0
     0.000102 recvfrom(7, "", 65536, 0, NULL, NULL) = 0
     2.241205 +++ exited with 1 +++

Searching for FATAL client_idle_ points towards PGBouncer configuration, for client_idle_timeout.

On our NCI PGBouncer server, the value has been changed from the default 0, to 90.

So it looks like PGBouncer has closed the connection, but the ODC Client hasn't realised, and continues trying to change transaction isolation settings.

I'm not sure exactly where we want to change things, I suspect there's a couple of things that aren't quite right, including the pgbouncer settings, and why ODC doesn't know the connection has been closed. I also don't recall why that pgbouncer setting is even set. It seems unnecessary to me, although it may at some point have helped work around the 300 second TCP idle closures on the NCI.

SpacemanPaul · 2024-07-12T01:19:40Z

Interesting. So we have:

a viable workaround through batched reads; and
a potential infrastructure configuration issue that may fix the issue directly.

omad assigned jeremyh and SpacemanPaul Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postgres Connection Timeout Crash #1606

Postgres Connection Timeout Crash #1606

omad commented Jun 27, 2024

Ariana-B commented Jul 11, 2024

jeremyh commented Jul 11, 2024 •

edited

Loading

SpacemanPaul commented Jul 11, 2024 •

edited

Loading

SpacemanPaul commented Jul 11, 2024

Ariana-B commented Jul 11, 2024

jeremyh commented Jul 11, 2024 •

edited

Loading

SpacemanPaul commented Jul 11, 2024

omad commented Jul 12, 2024

SpacemanPaul commented Jul 12, 2024

Postgres Connection Timeout Crash #1606

Postgres Connection Timeout Crash #1606

Comments

omad commented Jun 27, 2024

Summary

Ariana-B commented Jul 11, 2024

jeremyh commented Jul 11, 2024 • edited Loading

SpacemanPaul commented Jul 11, 2024 • edited Loading

SpacemanPaul commented Jul 11, 2024

Ariana-B commented Jul 11, 2024

jeremyh commented Jul 11, 2024 • edited Loading

SpacemanPaul commented Jul 11, 2024

omad commented Jul 12, 2024

SpacemanPaul commented Jul 12, 2024

jeremyh commented Jul 11, 2024 •

edited

Loading

SpacemanPaul commented Jul 11, 2024 •

edited

Loading

jeremyh commented Jul 11, 2024 •

edited

Loading