Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_publish_and_manipulate and test_get_dandiset_published fail locally #1488

Closed
yarikoptic opened this issue Aug 15, 2024 · 16 comments · Fixed by #1533
Closed

test_publish_and_manipulate and test_get_dandiset_published fail locally #1488

yarikoptic opened this issue Aug 15, 2024 · 16 comments · Fixed by #1533
Labels
tests Add or improve existing tests

Comments

@yarikoptic
Copy link
Member

This is with

============================================================================= short test summary info ==============================================================================
FAILED dandi/tests/test_dandiapi.py::test_publish_and_manipulate - ValueError: Dandiset 000008 is Pending: {
FAILED dandi/tests/test_dandiapi.py::test_get_dandiset_published[True] - ValueError: Dandiset 000021 is Pending: {
FAILED dandi/tests/test_dandiapi.py::test_get_dandiset_published[False] - ValueError: Dandiset 000022 is Pending: {
FAILED dandi/tests/test_dandiapi.py::test_get_dandiset_published_no_version_id[True] - ValueError: Dandiset 000023 is Pending: {
FAILED dandi/tests/test_dandiapi.py::test_get_dandiset_published_no_version_id[False] - ValueError: Dandiset 000024 is Pending: {
=================================================== 5 failed, 707 passed, 1 skipped, 2 xfailed, 8 xpassed in 1119.58s (0:18:39) ====================================================
DANDI_TESTS_AUDIT_CSV=/tmp/audit-full.csv python -m pytest -v dandi/  153.27s user 13.15s system 14% cpu 18:41.56 total

@jwodder - any immediate ideas?

I have

❯ docker images | grep dandi
dandiarchive/dandiarchive-api                        latest                         5bc326d92640   19 hours ago    749MB
dandiarchive/dandiarchive-api                        <none>                         cff5f21b9a7c   2 months ago    734MB
@yarikoptic yarikoptic added the tests Add or improve existing tests label Aug 15, 2024
@jwodder
Copy link
Member

jwodder commented Aug 15, 2024

@yarikoptic Those tests involve publishing a Dandiset, and it seems that the publication did not complete in time. What happens if you rerun the tests?

@yarikoptic
Copy link
Member Author

they were and still are (on master) failing consistently on my attempts. Full log http://www.oneukrainian.com/tmp/pytest-fail-20240815.txt

I also note that there is 100% CPU celery (not sure what could keep it that busy) - do you (or may be @jjnesbitt or @mvandenburgh ) have an idea what it could be doing or how to figure it out?

celery container logs show that it completes its jobs in milliseconds
[2024-08-15 15:51:30,720: INFO/MainProcess] Task dandiapi.api.tasks.calculate_sha256[47f0c15a-7d49-4208-aeda-9c8e01c18980] received: ((UUID('a33573ab-acc0-4d14-b823-91afa51b82b2'),), {})
[2024-08-15 15:51:30,736: INFO/ForkPoolWorker-2] dandiapi.api.tasks.calculate_sha256[47f0c15a-7d49-4208-aeda-9c8e01c18980]: Calculating sha256 checksum for asset blob a33573ab-acc0-4d14-b823-91afa51b82b2
[2024-08-15 15:51:30,742: INFO/ForkPoolWorker-2] Task dandiapi.api.tasks.calculate_sha256[47f0c15a-7d49-4208-aeda-9c8e01c18980] succeeded in 0.02144717297051102s: None
[2024-08-15 15:51:30,909: INFO/MainProcess] Task dandiapi.api.tasks.calculate_sha256[b4f586df-dadf-4841-ae26-a7e733c4dd60] received: ((UUID('5bdd712a-04cc-40ac-b73f-98f614743b27'),), {})
[2024-08-15 15:51:30,910: INFO/ForkPoolWorker-2] dandiapi.api.tasks.calculate_sha256[b4f586df-dadf-4841-ae26-a7e733c4dd60]: Calculating sha256 checksum for asset blob 5bdd712a-04cc-40ac-b73f-98f614743b27
[2024-08-15 15:51:30,914: INFO/ForkPoolWorker-2] Task dandiapi.api.tasks.calculate_sha256[b4f586df-dadf-4841-ae26-a7e733c4dd60] succeeded in 0.005075494991615415s: None
[2024-08-15 15:53:31,583: INFO/MainProcess] Task dandiapi.api.tasks.calculate_sha256[11a11275-94c1-40ff-adec-e87f012e4baa] received: ((UUID('091711e2-af2e-44f5-8fec-85eb49ce1ff9'),), {})
[2024-08-15 15:53:31,585: INFO/ForkPoolWorker-2] dandiapi.api.tasks.calculate_sha256[11a11275-94c1-40ff-adec-e87f012e4baa]: Calculating sha256 checksum for asset blob 091711e2-af2e-44f5-8fec-85eb49ce1ff9
[2024-08-15 15:53:31,588: INFO/ForkPoolWorker-2] Task dandiapi.api.tasks.calculate_sha256[11a11275-94c1-40ff-adec-e87f012e4baa] succeeded in 0.004132542992010713s: None
[2024-08-15 15:53:31,758: INFO/MainProcess] Task dandiapi.api.tasks.calculate_sha256[9170e368-2588-4999-a1c4-4f2c6ce910ef] received: ((UUID('3b7be16d-38e1-4f8f-8046-36936890eac8'),), {})
[2024-08-15 15:53:31,759: INFO/ForkPoolWorker-2] dandiapi.api.tasks.calculate_sha256[9170e368-2588-4999-a1c4-4f2c6ce910ef]: Calculating sha256 checksum for asset blob 3b7be16d-38e1-4f8f-8046-36936890eac8
[2024-08-15 15:53:31,763: INFO/ForkPoolWorker-2] Task dandiapi.api.tasks.calculate_sha256[9170e368-2588-4999-a1c4-4f2c6ce910ef] succeeded in 0.004479025024920702s: None
[2024-08-15 15:53:31,938: INFO/MainProcess] Task dandiapi.api.tasks.calculate_sha256[2da50b13-a501-4881-ad8d-0ccd1c6e1f8c] received: ((UUID('1e95e5b4-8fe2-4f6a-8d99-9fbf86646a72'),), {})
[2024-08-15 15:53:31,939: INFO/ForkPoolWorker-2] dandiapi.api.tasks.calculate_sha256[2da50b13-a501-4881-ad8d-0ccd1c6e1f8c]: Calculating sha256 checksum for asset blob 1e95e5b4-8fe2-4f6a-8d99-9fbf86646a72
[2024-08-15 15:53:31,943: INFO/ForkPoolWorker-2] Task dandiapi.api.tasks.calculate_sha256[2da50b13-a501-4881-ad8d-0ccd1c6e1f8c] succeeded in 0.004295543010812253s: None

@asmacdo
Copy link
Member

asmacdo commented Nov 15, 2024

I haven't gotten this to pass locally for me either. I can give full logs also if helpful. (FWIW there are other tests breaking as well, this is just the first one)

I tried more than doubling the wait time, still not passing. I guess something in the environment is broken?

diff --git a/dandi/dandiapi.py b/dandi/dandiapi.py
index 5ea504de..8b957c4f 100644
--- a/dandi/dandiapi.py
+++ b/dandi/dandiapi.py
@@ -1086,7 +1086,7 @@ class RemoteDandiset:
             json={"metadata": metadata, "name": metadata.get("name", "")},
         )
 
-    def wait_until_valid(self, max_time: float = 120) -> None:
+    def wait_until_valid(self, max_time: float = 250) -> None:
         """
         Wait at most ``max_time`` seconds for the Dandiset to be valid for
         publication.  If the Dandiset does not become valid in time, a
duct tox -e py3 -- -s -v dandi/tests/test_dandiapi.py::test_publish_and_manipulate

...<snip>...
DEBUG    dandi:dandiapi.py:262 Response: 200
DEBUG    dandi:dandiapi.py:198 GET http://localhost:8000/api/dandisets/000001/versions/draft/info/
DEBUG    dandi:dandiapi.py:262 Response: 200
DEBUG    dandi:dandiapi.py:198 GET http://localhost:8000/api/dandisets/000001/versions/draft/info/
DEBUG    dandi:dandiapi.py:262 Response: 200
============================= slowest 10 durations =============================
251.62s call     dandi/tests/test_dandiapi.py::test_publish_and_manipulate
45.01s setup    dandi/tests/test_dandiapi.py::test_publish_and_manipulate
8.46s teardown dandi/tests/test_dandiapi.py::test_publish_and_manipulate
=========================== short test summary info ============================
FAILED dandi/tests/test_dandiapi.py::test_publish_and_manipulate - ValueError: Dandiset 000001 is Pending: {
    "asset_validation_errors": [
        {
            "field": "",
            "message": "asset is currently being validated, please wait.",
            "path": "subdir/doomed.txt"
        },
        {
            "field": "",
            "message": "asset is currently being validated, please wait.",
            "path": "subdir/file.txt"
        }
    ],
    "version_validation_errors": []
}
======================== 1 failed in 305.86s (0:05:05) =========================
py3: exit 1 (313.54 seconds) /home/austin/devel/dandi-cli> coverage run -m pytest -v -s -v dandi/tests/test_dandiapi.py::test_publish_and_manipulate dandi pid=424363
  py3: FAIL code 1 (320.60=setup[6.91]+cmd[0.15,313.54] seconds)
  evaluation failed :( (320.68 seconds)
2024-11-14T18:08:33-0600 [INFO    ] con-duct: Summary:
Exit Code: 1
Command: tox -e py3 -- -s -v dandi/tests/test_dandiapi.py::test_publish_and_manipulate
Log files location: .duct/logs/2024.11.14T18.03.12-424275_
Wall Clock Time: 320.948 sec
Memory Peak Usage (RSS): 311.4 MB
Memory Average Usage (RSS): 234.0 MB
Virtual Memory Peak Usage (VSZ): 7.7 GB
Virtual Memory Average Usage (VSZ): 3.5 GB
Memory Peak Percentage: 0.7%
Memory Average Percentage: 0.599035369774919%
CPU Peak Usage: 383.5%
Average CPU Usage: 26.813504823151113%

These others fail for me too, incase thats related

========================================================= short test summary info =========================================================
FAILED dandi/tests/test_dandiapi.py::test_publish_and_manipulate - ValueError: Dandiset 000008 is Pending: {
FAILED dandi/tests/test_dandiapi.py::test_get_dandiset_published[True] - ValueError: Dandiset 000021 is Pending: {
FAILED dandi/tests/test_dandiapi.py::test_get_dandiset_published[False] - ValueError: Dandiset 000022 is Pending: {
FAILED dandi/tests/test_dandiapi.py::test_get_dandiset_published_no_version_id[True] - ValueError: Dandiset 000023 is Pending: {
FAILED dandi/tests/test_dandiapi.py::test_get_dandiset_published_no_version_id[False] - ValueError: Dandiset 000024 is Pending: {
=============================== 5 failed, 707 passed, 1 skipped, 2 xfailed, 8 xpassed in 2413.00s (0:40:12) ===============================
py3: exit 1 (2422.98 seconds) /home/austin/devel/dandi-cli> coverage run -m pytest -v dandi pid=430147
  py3: FAIL code 1 (2431.11=setup[7.95]+cmd[0.18,2422.98] seconds)
  evaluation failed :( (2431.21 seconds)

@yarikoptic
Copy link
Member Author

yarikoptic commented Nov 20, 2024

digging through the history of issues, it seems we had similar issue before

Since then a good number of changes to that compose
diff --git a/dandi/tests/data/dandiarchive-docker/docker-compose.yml b/dandi/tests/data/dandiarchive-docker/docker-compose.yml
index 9ce5e464..3f132d3e 100644
--- a/dandi/tests/data/dandiarchive-docker/docker-compose.yml
+++ b/dandi/tests/data/dandiarchive-docker/docker-compose.yml
@@ -4,20 +4,7 @@
 # <https://github.com/dandi/dandi-api/blob/master/docker-compose.override.yml>,
 # but using images uploaded to Docker Hub instead of building them locally.
 
-version: '2.1'
-
 services:
-  redirector:
-    image: dandiarchive/dandiarchive-redirector
-    depends_on:
-      - django
-    ports:
-      - "8079:8080"
-    environment:
-      #GUI_URL: http://localhost:8086
-      ABOUT_URL: http://www.dandiarchive.org
-      API_URL: http://localhost:8000/api
-
   django:
     image: dandiarchive/dandiarchive-api
     command: ["./manage.py", "runserver", "--nothreading", "0.0.0.0:8000"]
@@ -30,23 +17,31 @@ services:
         condition: service_healthy
       rabbitmq:
         condition: service_started
-    environment:
+    environment: &django_env
       DJANGO_CELERY_BROKER_URL: amqp://rabbitmq:5672/
       DJANGO_CONFIGURATION: DevelopmentConfiguration
       DJANGO_DANDI_DANDISETS_BUCKET_NAME: dandi-dandisets
+      DJANGO_DANDI_DANDISETS_LOG_BUCKET_NAME: dandiapi-dandisets-logs
       DJANGO_DANDI_DANDISETS_EMBARGO_BUCKET_NAME: dandi-embargoed-dandisets
+      DJANGO_DANDI_DANDISETS_EMBARGO_LOG_BUCKET_NAME: dandiapi-embargo-dandisets-logs
       DJANGO_DATABASE_URL: postgres://postgres:postgres@postgres:5432/django
       DJANGO_MINIO_STORAGE_ACCESS_KEY: minioAccessKey
       DJANGO_MINIO_STORAGE_ENDPOINT: minio:9000
       DJANGO_MINIO_STORAGE_SECRET_KEY: minioSecretKey
       DJANGO_STORAGE_BUCKET_NAME: django-storage
-      DJANGO_MINIO_STORAGE_MEDIA_URL: http://localhost:9000/django-storage
-      DJANGO_DANDI_SCHEMA_VERSION:
+      # The Minio URL needs to use 127.0.0.1 instead of localhost so that blob
+      # assets' "S3 URLs" will use 127.0.0.1, and thus tests that try to open
+      # these URLs via fsspec will not fail on systems where localhost is both
+      # 127.0.0.1 and ::1.
+      DJANGO_MINIO_STORAGE_MEDIA_URL: http://127.0.0.1:9000/django-storage
+      DJANGO_DANDI_SCHEMA_VERSION: ~
       DJANGO_DANDI_WEB_APP_URL: http://localhost:8085
       DJANGO_DANDI_API_URL: http://localhost:8000
+      DJANGO_DANDI_JUPYTERHUB_URL: https://hub.dandiarchive.org
+      DJANGO_DANDI_DEV_EMAIL: "[email protected]"
       DANDI_ALLOW_LOCALHOST_URLS: "1"
     ports:
-      - "8000:8000"
+      - "127.0.0.1:8000:8000"
 
   celery:
     image: dandiarchive/dandiarchive-api
@@ -70,21 +65,8 @@ services:
       rabbitmq:
         condition: service_started
     environment:
-      DJANGO_CELERY_BROKER_URL: amqp://rabbitmq:5672/
-      DJANGO_CONFIGURATION: DevelopmentConfiguration
-      DJANGO_DANDI_DANDISETS_BUCKET_NAME: dandi-dandisets
-      DJANGO_DANDI_DANDISETS_EMBARGO_BUCKET_NAME: dandi-embargoed-dandisets
-      DJANGO_DATABASE_URL: postgres://postgres:postgres@postgres:5432/django
-      DJANGO_MINIO_STORAGE_ACCESS_KEY: minioAccessKey
-      DJANGO_MINIO_STORAGE_ENDPOINT: minio:9000
-      DJANGO_MINIO_STORAGE_SECRET_KEY: minioSecretKey
-      DJANGO_STORAGE_BUCKET_NAME: django-storage
-      DJANGO_MINIO_STORAGE_MEDIA_URL: http://localhost:9000/django-storage
-      DJANGO_DANDI_SCHEMA_VERSION:
+      << : *django_env
       DJANGO_DANDI_VALIDATION_JOB_INTERVAL: "5"
-      DJANGO_DANDI_WEB_APP_URL: http://localhost:8085
-      DJANGO_DANDI_API_URL: http://localhost:8000
-      DANDI_ALLOW_LOCALHOST_URLS: "1"
 
   minio:
     image: minio/minio:RELEASE.2022-04-12T06-55-35Z
@@ -92,7 +74,7 @@ services:
     tty: true
     command: ["server", "/data"]
     ports:
-      - "9000:9000"
+      - "127.0.0.1:9000:9000"
     environment:
       MINIO_ACCESS_KEY: minioAccessKey
       MINIO_SECRET_KEY: minioSecretKey
@@ -107,8 +89,8 @@ services:
       POSTGRES_DB: django
       POSTGRES_PASSWORD: postgres
     image: postgres
-    ports:
-      - "5432:5432"
+    expose:
+      - "5432"
     healthcheck:
       test: ["CMD", "pg_isready", "-U", "postgres"]
       interval: 7s
@@ -117,5 +99,5 @@ services:
 
   rabbitmq:
     image: rabbitmq:management
-    ports:
-      - "5672:5672"
+    expose:
+      - "5672"

but neither minio version nor -c concurrency setting...

edit 1: celery is "busy" trying to close some incrementing handle and that is what it is wasting cpu on
root@d49aa92590f7:/opt/django# strace -f -p 8 2>&1 | head
strace: Process 8 attached
close(928972594)                        = -1 EBADF (Bad file descriptor)
close(928972593)                        = -1 EBADF (Bad file descriptor)
close(928972592)                        = -1 EBADF (Bad file descriptor)
close(928972591)                        = -1 EBADF (Bad file descriptor)
close(928972590)                        = -1 EBADF (Bad file descriptor)
close(928972589)                        = -1 EBADF (Bad file descriptor)
close(928972588)                        = -1 EBADF (Bad file descriptor)
close(928972587)                        = -1 EBADF (Bad file descriptor)
close(928972586)                        = -1 EBADF (Bad file descriptor)
root@d49aa92590f7:/opt/django# ls -l /proc/8/fd/
total 0
lrwx------ 1 root root 64 Nov 20 03:06 0 -> /dev/null
l-wx------ 1 root root 64 Nov 20 03:06 1 -> 'pipe:[5581143]'
l-wx------ 1 root root 64 Nov 20 03:06 2 -> 'pipe:[5581144]'
lrwx------ 1 root root 64 Nov 20 03:06 3 -> 'socket:[5575530]'
lrwx------ 1 root root 64 Nov 20 03:06 4 -> 'socket:[5574189]'
lrwx------ 1 root root 64 Nov 20 03:06 5 -> 'socket:[5576134]'
lrwx------ 1 root root 64 Nov 20 03:06 6 -> 'anon_inode:[eventpoll]'
lr-x------ 1 root root 64 Nov 20 03:06 7 -> /dev/null
l-wx------ 1 root root 64 Nov 20 03:06 8 -> 'pipe:[5578287]'

whenever in logs it seems to be ok

celery-1  | [2024-11-20 03:04:27,590: INFO/MainProcess] Task dandiapi.api.tasks.calculate_sha256[281efa6f-f660-4603-a661-f11e7ebcc220] received: ((UUID('13745690-aebd-4812-a1bd-aecb374bb83b'),), {})
celery-1  | [2024-11-20 03:04:27,590: DEBUG/MainProcess] TaskPool: Apply <function fast_trace_task at 0x7fbb89a50360> (args:('dandiapi.api.tasks.calculate_sha256', '281efa6f-f660-4603-a661-f11e7ebcc220', {'argsrepr': "(UUID('13745690-aebd-4812-a1bd-aecb374bb83b'),)", 'eta': None, 'expires': None, 'group': None, 'group_index': None, 'id': '281efa6f-f660-4603-a661-f11e7ebcc220', 'ignore_result': False, 'kwargsrepr': '{}', 'lang': 'py', 'origin': 'gen7@e29931eba4e7', 'parent_id': None, 'replaced_task_nesting': 0, 'retries': 0, 'root_id': '281efa6f-f660-4603-a661-f11e7ebcc220', 'shadow': None, 'stamped_headers': None, 'stamps': {}, 'task': 'dandiapi.api.tasks.calculate_sha256', 'timelimit': [None, 86400], 'properties': {'content_type': 'application/json', 'content_encoding': 'utf-8', 'application_headers': {'argsrepr': "(UUID('13745690-aebd-4812-a1bd-aecb374bb83b'),)", 'eta': None, 'expires': None, 'group': None, 'group_index': None, 'id': '281efa6f-f660-4603-a661-f11e7ebcc220', 'ignore_result': False, 'kwargsrepr': '{}', 'lang': 'py', 'origin': 'gen7@e29931eba4e7', 'parent_id': None, 'replaced_task_nesting': 0, 'retries': 0,... kwargs:{})
celery-1  | [2024-11-20 03:04:27,606: DEBUG/ForkPoolWorker-2] (0.001) SELECT t.oid, typarray FROM pg_type t JOIN pg_namespace ns ON typnamespace = ns.oid WHERE typname = 'hstore'; args=None; alias=default
celery-1  | [2024-11-20 03:04:27,606: DEBUG/ForkPoolWorker-2] (0.000) SELECT typarray FROM pg_type WHERE typname = 'citext'; args=None; alias=default
celery-1  | [2024-11-20 03:04:27,607: DEBUG/ForkPoolWorker-2] (0.000) SELECT "api_assetblob"."id", "api_assetblob"."created", "api_assetblob"."modified", "api_assetblob"."embargoed", "api_assetblob"."blob", "api_assetblob"."blob_id", "api_assetblob"."sha256", "api_assetblob"."etag", "api_assetblob"."size", "api_assetblob"."download_count" FROM "api_assetblob" WHERE "api_assetblob"."blob_id" = '13745690-aebd-4812-a1bd-aecb374bb83b'::uuid LIMIT 21; args=(UUID('13745690-aebd-4812-a1bd-aecb374bb83b'),); alias=default
celery-1  | [2024-11-20 03:04:27,607: INFO/ForkPoolWorker-2] dandiapi.api.tasks.calculate_sha256[281efa6f-f660-4603-a661-f11e7ebcc220]: Calculating sha256 checksum for asset blob 13745690-aebd-4812-a1bd-aecb374bb83b
celery-1  | [2024-11-20 03:04:27,610: DEBUG/ForkPoolWorker-2] http://minio:9000 "GET /dandi-dandisets/blobs/137/456/13745690-aebd-4812-a1bd-aecb374bb83b HTTP/11" 200 22
celery-1  | [2024-11-20 03:04:27,611: DEBUG/ForkPoolWorker-2] (0.001) UPDATE "api_assetblob" SET "sha256" = '458eeae191657906bb1fefdca3619f35820263fd9d429e35688de62cf7a8c10b' WHERE "api_assetblob"."blob_id" = '13745690-aebd-4812-a1bd-aecb374bb83b'::uuid; args=('458eeae191657906bb1fefdca3619f35820263fd9d429e35688de62cf7a8c10b', UUID('13745690-aebd-4812-a1bd-aecb374bb83b')); alias=default
celery-1  | [2024-11-20 03:04:27,612: INFO/ForkPoolWorker-2] Task dandiapi.api.tasks.calculate_sha256[281efa6f-f660-4603-a661-f11e7ebcc220] succeeded in 0.02052633202401921s: None
celery-1  | [2024-11-20 03:04:27,817: INFO/MainProcess] Task dandiapi.api.tasks.calculate_sha256[77c1f8a3-93e7-4b46-9e3e-73e3db11409e] received: ((UUID('7a7988d0-01ed-4bfb-8fa3-28741ced67ed'),), {})
celery-1  | [2024-11-20 03:04:27,817: DEBUG/MainProcess] TaskPool: Apply <function fast_trace_task at 0x7fbb89a50360> (args:('dandiapi.api.tasks.calculate_sha256', '77c1f8a3-93e7-4b46-9e3e-73e3db11409e', {'argsrepr': "(UUID('7a7988d0-01ed-4bfb-8fa3-28741ced67ed'),)", 'eta': None, 'expires': None, 'group': None, 'group_index': None, 'id': '77c1f8a3-93e7-4b46-9e3e-73e3db11409e', 'ignore_result': False, 'kwargsrepr': '{}', 'lang': 'py', 'origin': 'gen7@e29931eba4e7', 'parent_id': None, 'replaced_task_nesting': 0, 'retries': 0, 'root_id': '77c1f8a3-93e7-4b46-9e3e-73e3db11409e', 'shadow': None, 'stamped_headers': None, 'stamps': {}, 'task': 'dandiapi.api.tasks.calculate_sha256', 'timelimit': [None, 86400], 'properties': {'content_type': 'application/json', 'content_encoding': 'utf-8', 'application_headers': {'argsrepr': "(UUID('7a7988d0-01ed-4bfb-8fa3-28741ced67ed'),)", 'eta': None, 'expires': None, 'group': None, 'group_index': None, 'id': '77c1f8a3-93e7-4b46-9e3e-73e3db11409e', 'ignore_result': False, 'kwargsrepr': '{}', 'lang': 'py', 'origin': 'gen7@e29931eba4e7', 'parent_id': None, 'replaced_task_nesting': 0, 'retries': 0,... kwargs:{})
celery-1  | [2024-11-20 03:04:27,818: DEBUG/ForkPoolWorker-2] (0.000) SELECT "api_assetblob"."id", "api_assetblob"."created", "api_assetblob"."modified", "api_assetblob"."embargoed", "api_assetblob"."blob", "api_assetblob"."blob_id", "api_assetblob"."sha256", "api_assetblob"."etag", "api_assetblob"."size", "api_assetblob"."download_count" FROM "api_assetblob" WHERE "api_assetblob"."blob_id" = '7a7988d0-01ed-4bfb-8fa3-28741ced67ed'::uuid LIMIT 21; args=(UUID('7a7988d0-01ed-4bfb-8fa3-28741ced67ed'),); alias=default
celery-1  | [2024-11-20 03:04:27,818: INFO/ForkPoolWorker-2] dandiapi.api.tasks.calculate_sha256[77c1f8a3-93e7-4b46-9e3e-73e3db11409e]: Calculating sha256 checksum for asset blob 7a7988d0-01ed-4bfb-8fa3-28741ced67ed
celery-1  | [2024-11-20 03:04:27,820: DEBUG/ForkPoolWorker-2] http://minio:9000 "GET /dandi-dandisets/blobs/7a7/988/7a7988d0-01ed-4bfb-8fa3-28741ced67ed HTTP/11" 200 19
celery-1  | [2024-11-20 03:04:27,821: DEBUG/ForkPoolWorker-2] (0.001) UPDATE "api_assetblob" SET "sha256" = '6fef386efa7208eaf1c596b6ab2f8a5a3583696ef8649be0552ab3effad1e191' WHERE "api_assetblob"."blob_id" = '7a7988d0-01ed-4bfb-8fa3-28741ced67ed'::uuid; args=('6fef386efa7208eaf1c596b6ab2f8a5a3583696ef8649be0552ab3effad1e191', UUID('7a7988d0-01ed-4bfb-8fa3-28741ced67ed')); alias=default
celery-1  | [2024-11-20 03:04:27,821: INFO/ForkPoolWorker-2] Task dandiapi.api.tasks.calculate_sha256[77c1f8a3-93e7-4b46-9e3e-73e3db11409e] succeeded in 0.00400476303184405s: None




celery-1  | [2024-11-20 03:04:33,334: DEBUG/MainProcess] heartbeat_tick : for connection 9884df1b2b3c481899da38f31ce36670
celery-1  | [2024-11-20 03:04:33,334: DEBUG/MainProcess] heartbeat_tick : Prev sent/recv: None/None, now - 42/43, monotonic - 330934.476212158, last_heartbeat_sent - 330934.47621111, heartbeat int. - 20.0 for connection 9884df1b2b3c481899da38f31ce36670


celery-1  | [2024-11-20 03:04:40,002: DEBUG/MainProcess] heartbeat_tick : for connection 9884df1b2b3c481899da38f31ce36670
celery-1  | [2024-11-20 03:04:40,002: DEBUG/MainProcess] heartbeat_tick : Prev sent/recv: 42/43, now - 42/44, monotonic - 330941.144286589, last_heartbeat_sent - 330934.47621111, heartbeat int. - 20.0 for connection 9884df1b2b3c481899da38f31ce36670
celery-1  | [2024-11-20 03:04:40,003: DEBUG/MainProcess] heartbeat_tick: sending heartbeat for connection 9884df1b2b3c481899da38f31ce36670
celery-1  | [2024-11-20 03:04:46,670: DEBUG/MainProcess] heartbeat_tick : for connection 9884df1b2b3c481899da38f31ce36670
celery-1  | [2024-11-20 03:04:46,670: DEBUG/MainProcess] heartbeat_tick : Prev sent/recv: 42/44, now - 43/45, monotonic - 330947.812269588, last_heartbeat_sent - 330947.812268985, heartbeat int. - 20.0 for connection 9884df1b2b3c481899da38f31ce36670

edit2: I verified that we do run this test in dandi-api tests: https://github.com/dandi/dandi-archive/actions/runs/11925359257/job/33237382168?pr=2076#step:8:30

@yarikoptic
Copy link
Member Author

just now spotted in the django outputs

django-1  |            INFO     "GET /api/auth/token/ HTTP/1.1" 200 42       basehttp.py:187
django-1  | /usr/local/lib/python3.11/site-packages/pydantic/main.py:390: UserWarning: Pydantic serializer warnings:
django-1  |   Failed to get discriminator value for tagged union serialization with value `{'schemaKey': 'Person', '... 'dcite:ContactPerson']}` - defaulting to left to right union serialization.
django-1  |   PydanticSerializationUnexpectedValue: Expected `Person` but got `dict` with value `{'schemaKey': 'Person', '... 'dcite:ContactPerson']}` - serialized value may not be as expected
django-1  |   PydanticSerializationUnexpectedValue: Expected `Organization` but got `dict` with value `{'schemaKey': 'Person', '... 'dcite:ContactPerson']}` - serialized value may not be as expected
django-1  |   Expected `enum` but got `str` with value `'spdx:CC0-1.0'` - serialized value may not be as expected
django-1  |   return self.__pydantic_serializer__.to_python(
django-1  |            INFO     "POST /api/dandisets/?embargo=false          basehttp.py:187
django-1  |                     HTTP/1.1" 200 614                                           

so may be that is what undermines validation somewhere ?

yarikoptic added a commit to dandi/dandi-archive that referenced this issue Nov 20, 2024
…rformed

While troubleshooting
- dandi/dandi-cli#1488

I would like to know if any of those tasks was ran at all.  That is conditioned
on correct condition to happen first, so it is valuable to add explicit debug level
log entry if condition was not met
yarikoptic added a commit to dandi/dandi-archive that referenced this issue Nov 20, 2024
…rformed

While troubleshooting
- dandi/dandi-cli#1488

I would like to know if any of those tasks was ran at all.  That is conditioned
on correct condition to happen first, so it is valuable to add explicit debug level
log entry if condition was not met
@yarikoptic
Copy link
Member Author

yarikoptic commented Nov 20, 2024

ok, I think that celery in our docker compose instance is not properly initialized and never registers for periodic tasks.
I base it on the fact that adding the additional logging in

does not show up anywhere in the logs for the docker compose logs run even if I made DJANGO and celery log level debug. @jjnesbitt @mvandenburgh @jwodder @candleindark -- any ideas on how that could happen and why https://github.com/dandi/dandi-archive/blob/HEAD/dandiapi/celery.py#L26 might not be in effect?

FTR: here is the patch to docker compose in our tests to enable more logging, and bind mount dandi-archive/dandiapi so I could use "development" version
diff --git a/dandi/tests/data/dandiarchive-docker/docker-compose.yml b/dandi/tests/data/dandiarchive-docker/docker-compose.yml
index cc37b5c5..9108cd5c 100644
--- a/dandi/tests/data/dandiarchive-docker/docker-compose.yml
+++ b/dandi/tests/data/dandiarchive-docker/docker-compose.yml
@@ -24,6 +24,8 @@ services:
       DJANGO_DANDI_DANDISETS_LOG_BUCKET_NAME: dandiapi-dandisets-logs
       DJANGO_DANDI_DANDISETS_EMBARGO_BUCKET_NAME: dandi-embargoed-dandisets
       DJANGO_DANDI_DANDISETS_EMBARGO_LOG_BUCKET_NAME: dandiapi-embargo-dandisets-logs
+      # Pending https://github.com/dandi/dandi-archive/pull/2078
+      DJANGO_DANDI_LOG_LEVEL: DEBUG
       DJANGO_DATABASE_URL: postgres://postgres:postgres@postgres:5432/django
       DJANGO_MINIO_STORAGE_ACCESS_KEY: minioAccessKey
       DJANGO_MINIO_STORAGE_ENDPOINT: minio:9000
@@ -42,6 +44,8 @@ services:
       DANDI_ALLOW_LOCALHOST_URLS: "1"
     ports:
       - "127.0.0.1:8000:8000"
+    volumes:
+      - /home/yoh/proj/dandi/dandi-archive/dandiapi:/opt/django/dandiapi
 
   celery:
     image: dandiarchive/dandiarchive-api
@@ -49,7 +53,7 @@ services:
       "celery",
       "--app", "dandiapi.celery",
       "worker",
-      "--loglevel", "INFO",
+      "--loglevel", "DEBUG",
       "--without-heartbeat",
       "-Q","celery,calculate_sha256,ingest_zarr_archive,manifest-worker",
       "-c","1",

@jjnesbitt
Copy link
Member

jjnesbitt commented Nov 20, 2024

Ahh, @yarikoptic I think I see the issue. The way you tested your branch with the CLI docker compose file was to mount the dandiapi directory from your checked out git branch to the container directory. That works just fine, however, that log statement is run from the celery container, not the django one. It seems you only mounted the volume for the django container. Mounting it for both locally allows me to see that new log statement:

celery-1    | [18:08:31] INFO     Registering scheduled tasks for <Celery     scheduled.py:132
celery-1    |                     __main__ at 0x71ad933d2090>.
celery-1    |                     DANDI_VALIDATION_JOB_INTERVAL is 5 seconds.

Seeing that log with the scheduled tasks in the registered task printout, as well as seeing scheduled tasks running in the logs, seems to prove that scheduled tasks are being picked up just fine.

celery-1    | [tasks]
...
celery-1    |   . dandiapi.api.tasks.scheduled.aggregate_assets_summary_task
celery-1    |   . dandiapi.api.tasks.scheduled.refresh_materialized_view_search
celery-1    |   . dandiapi.api.tasks.scheduled.send_pending_users_email
celery-1    |   . dandiapi.api.tasks.scheduled.validate_draft_version_metadata
celery-1    |   . dandiapi.api.tasks.scheduled.validate_pending_asset_metadata
...

@yarikoptic
Copy link
Member Author

Thank you @jjnesbitt, makes total sense. I will retry with bind mounting modified source tree there too to get logs.

@yarikoptic
Copy link
Member Author

indeed, with bind mount also for celery I got the expected

celery-1    |            INFO     Registering scheduled tasks for <Celery     scheduled.py:132
celery-1    |                     __main__ at 0x7f0c08551d50>.                                
celery-1    |                     DANDI_VALIDATION_JOB_INTERVAL is 5 seconds.  

but no other logs on having e.g. validate_draft_version_metadata ever running

@yarikoptic
Copy link
Member Author

I have docker 26.1.5+dfsg1-4 from Debian

@jjnesbitt has docker 27.3.1 (likely from upstream)

@asmacdo what is your version of docker?

@asmacdo
Copy link
Member

asmacdo commented Nov 20, 2024

Docker version 27.3.1, build ce12230 from fedora 40
Docker Compose version v2.29.7

@yarikoptic
Copy link
Member Author

yarikoptic commented Nov 20, 2024

@asmacdo do you also observe 100% CPU of celery process while tests are waiting for dandiset to get validated?

@yarikoptic
Copy link
Member Author

also both @asmacdo and @jjnesbitt what is output of

❯ docker exec -it dandiarchive-docker-celery-1 bash -c 'ulimit -n'
1073741816
❯ docker exec -it dandiarchive-docker-celery-1 python -c "import os;print(os.sysconf('SC_OPEN_MAX'))"
1073741816

for you while docker compose is running? That is at least explains why celery (billiard) is busy for me due to this loop: https://github.com/celery/billiard/blob/main/billiard/compat.py#L147

@asmacdo
Copy link
Member

asmacdo commented Nov 20, 2024

@yarikoptic yep, during dandi/tests/test_dandiapi.py::test_publish_and_manipulate celery was at or near 100% CPU (also looks like its staying near 100 even after that test had timed out and moved on to other tests)

image

To answer your other question:

$ docker exec -it dandiarchive-docker-celery-1 bash -c 'ulimit -n'
1073741816
$ docker exec -it dandiarchive-docker-celery-1 python -c "import os;print(os.sysconf('SC_OPEN_MAX'))"
1073741816

yarikoptic added a commit to yarikoptic/billiard that referenced this issue Nov 20, 2024
In my case I kept finding celery running at 100% and doing nothing.  py-spy
pointed to the close_open_fds and then ulimit inside the container showed gory
detail of

    ❯ docker run -it --rm --entrypoint bash dandiarchive/dandiarchive-api -c "ulimit -n"
    1073741816

situation is not unique to me. See more at

dandi/dandi-cli#1488
@yarikoptic
Copy link
Member Author

FWIW, submitted

With that patched billiard (kudos to py-spy project which allowed me to point to culprit) I proceed ... and test complets green as expected!

The solution should be - limiting number of open files, ideally at docker compose level but for me that didn't work out yet.

@yarikoptic
Copy link
Member Author

I also addressed it at the system configuration level (ignore the storage-driver)

bilena# cat /etc/docker/daemon.json 
{
  "storage-driver": "btrfs",
  "default-ulimits": {
    "nofile": {
      "Name": "nofile",
      "Hard": 2048,
      "Soft": 1024
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Add or improve existing tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants