-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_publish_and_manipulate and test_get_dandiset_published fail locally #1488
Comments
@yarikoptic Those tests involve publishing a Dandiset, and it seems that the publication did not complete in time. What happens if you rerun the tests? |
they were and still are (on master) failing consistently on my attempts. Full log http://www.oneukrainian.com/tmp/pytest-fail-20240815.txt I also note that there is 100% CPU celery (not sure what could keep it that busy) - do you (or may be @jjnesbitt or @mvandenburgh ) have an idea what it could be doing or how to figure it out? celery container logs show that it completes its jobs in milliseconds[2024-08-15 15:51:30,720: INFO/MainProcess] Task dandiapi.api.tasks.calculate_sha256[47f0c15a-7d49-4208-aeda-9c8e01c18980] received: ((UUID('a33573ab-acc0-4d14-b823-91afa51b82b2'),), {})
[2024-08-15 15:51:30,736: INFO/ForkPoolWorker-2] dandiapi.api.tasks.calculate_sha256[47f0c15a-7d49-4208-aeda-9c8e01c18980]: Calculating sha256 checksum for asset blob a33573ab-acc0-4d14-b823-91afa51b82b2
[2024-08-15 15:51:30,742: INFO/ForkPoolWorker-2] Task dandiapi.api.tasks.calculate_sha256[47f0c15a-7d49-4208-aeda-9c8e01c18980] succeeded in 0.02144717297051102s: None
[2024-08-15 15:51:30,909: INFO/MainProcess] Task dandiapi.api.tasks.calculate_sha256[b4f586df-dadf-4841-ae26-a7e733c4dd60] received: ((UUID('5bdd712a-04cc-40ac-b73f-98f614743b27'),), {})
[2024-08-15 15:51:30,910: INFO/ForkPoolWorker-2] dandiapi.api.tasks.calculate_sha256[b4f586df-dadf-4841-ae26-a7e733c4dd60]: Calculating sha256 checksum for asset blob 5bdd712a-04cc-40ac-b73f-98f614743b27
[2024-08-15 15:51:30,914: INFO/ForkPoolWorker-2] Task dandiapi.api.tasks.calculate_sha256[b4f586df-dadf-4841-ae26-a7e733c4dd60] succeeded in 0.005075494991615415s: None
[2024-08-15 15:53:31,583: INFO/MainProcess] Task dandiapi.api.tasks.calculate_sha256[11a11275-94c1-40ff-adec-e87f012e4baa] received: ((UUID('091711e2-af2e-44f5-8fec-85eb49ce1ff9'),), {})
[2024-08-15 15:53:31,585: INFO/ForkPoolWorker-2] dandiapi.api.tasks.calculate_sha256[11a11275-94c1-40ff-adec-e87f012e4baa]: Calculating sha256 checksum for asset blob 091711e2-af2e-44f5-8fec-85eb49ce1ff9
[2024-08-15 15:53:31,588: INFO/ForkPoolWorker-2] Task dandiapi.api.tasks.calculate_sha256[11a11275-94c1-40ff-adec-e87f012e4baa] succeeded in 0.004132542992010713s: None
[2024-08-15 15:53:31,758: INFO/MainProcess] Task dandiapi.api.tasks.calculate_sha256[9170e368-2588-4999-a1c4-4f2c6ce910ef] received: ((UUID('3b7be16d-38e1-4f8f-8046-36936890eac8'),), {})
[2024-08-15 15:53:31,759: INFO/ForkPoolWorker-2] dandiapi.api.tasks.calculate_sha256[9170e368-2588-4999-a1c4-4f2c6ce910ef]: Calculating sha256 checksum for asset blob 3b7be16d-38e1-4f8f-8046-36936890eac8
[2024-08-15 15:53:31,763: INFO/ForkPoolWorker-2] Task dandiapi.api.tasks.calculate_sha256[9170e368-2588-4999-a1c4-4f2c6ce910ef] succeeded in 0.004479025024920702s: None
[2024-08-15 15:53:31,938: INFO/MainProcess] Task dandiapi.api.tasks.calculate_sha256[2da50b13-a501-4881-ad8d-0ccd1c6e1f8c] received: ((UUID('1e95e5b4-8fe2-4f6a-8d99-9fbf86646a72'),), {})
[2024-08-15 15:53:31,939: INFO/ForkPoolWorker-2] dandiapi.api.tasks.calculate_sha256[2da50b13-a501-4881-ad8d-0ccd1c6e1f8c]: Calculating sha256 checksum for asset blob 1e95e5b4-8fe2-4f6a-8d99-9fbf86646a72
[2024-08-15 15:53:31,943: INFO/ForkPoolWorker-2] Task dandiapi.api.tasks.calculate_sha256[2da50b13-a501-4881-ad8d-0ccd1c6e1f8c] succeeded in 0.004295543010812253s: None
|
I haven't gotten this to pass locally for me either. I can give full logs also if helpful. (FWIW there are other tests breaking as well, this is just the first one) I tried more than doubling the wait time, still not passing. I guess something in the environment is broken? diff --git a/dandi/dandiapi.py b/dandi/dandiapi.py
index 5ea504de..8b957c4f 100644
--- a/dandi/dandiapi.py
+++ b/dandi/dandiapi.py
@@ -1086,7 +1086,7 @@ class RemoteDandiset:
json={"metadata": metadata, "name": metadata.get("name", "")},
)
- def wait_until_valid(self, max_time: float = 120) -> None:
+ def wait_until_valid(self, max_time: float = 250) -> None:
"""
Wait at most ``max_time`` seconds for the Dandiset to be valid for
publication. If the Dandiset does not become valid in time, a
These others fail for me too, incase thats related
|
digging through the history of issues, it seems we had similar issue before
Since then a good number of changes to that composediff --git a/dandi/tests/data/dandiarchive-docker/docker-compose.yml b/dandi/tests/data/dandiarchive-docker/docker-compose.yml
index 9ce5e464..3f132d3e 100644
--- a/dandi/tests/data/dandiarchive-docker/docker-compose.yml
+++ b/dandi/tests/data/dandiarchive-docker/docker-compose.yml
@@ -4,20 +4,7 @@
# <https://github.com/dandi/dandi-api/blob/master/docker-compose.override.yml>,
# but using images uploaded to Docker Hub instead of building them locally.
-version: '2.1'
-
services:
- redirector:
- image: dandiarchive/dandiarchive-redirector
- depends_on:
- - django
- ports:
- - "8079:8080"
- environment:
- #GUI_URL: http://localhost:8086
- ABOUT_URL: http://www.dandiarchive.org
- API_URL: http://localhost:8000/api
-
django:
image: dandiarchive/dandiarchive-api
command: ["./manage.py", "runserver", "--nothreading", "0.0.0.0:8000"]
@@ -30,23 +17,31 @@ services:
condition: service_healthy
rabbitmq:
condition: service_started
- environment:
+ environment: &django_env
DJANGO_CELERY_BROKER_URL: amqp://rabbitmq:5672/
DJANGO_CONFIGURATION: DevelopmentConfiguration
DJANGO_DANDI_DANDISETS_BUCKET_NAME: dandi-dandisets
+ DJANGO_DANDI_DANDISETS_LOG_BUCKET_NAME: dandiapi-dandisets-logs
DJANGO_DANDI_DANDISETS_EMBARGO_BUCKET_NAME: dandi-embargoed-dandisets
+ DJANGO_DANDI_DANDISETS_EMBARGO_LOG_BUCKET_NAME: dandiapi-embargo-dandisets-logs
DJANGO_DATABASE_URL: postgres://postgres:postgres@postgres:5432/django
DJANGO_MINIO_STORAGE_ACCESS_KEY: minioAccessKey
DJANGO_MINIO_STORAGE_ENDPOINT: minio:9000
DJANGO_MINIO_STORAGE_SECRET_KEY: minioSecretKey
DJANGO_STORAGE_BUCKET_NAME: django-storage
- DJANGO_MINIO_STORAGE_MEDIA_URL: http://localhost:9000/django-storage
- DJANGO_DANDI_SCHEMA_VERSION:
+ # The Minio URL needs to use 127.0.0.1 instead of localhost so that blob
+ # assets' "S3 URLs" will use 127.0.0.1, and thus tests that try to open
+ # these URLs via fsspec will not fail on systems where localhost is both
+ # 127.0.0.1 and ::1.
+ DJANGO_MINIO_STORAGE_MEDIA_URL: http://127.0.0.1:9000/django-storage
+ DJANGO_DANDI_SCHEMA_VERSION: ~
DJANGO_DANDI_WEB_APP_URL: http://localhost:8085
DJANGO_DANDI_API_URL: http://localhost:8000
+ DJANGO_DANDI_JUPYTERHUB_URL: https://hub.dandiarchive.org
+ DJANGO_DANDI_DEV_EMAIL: "[email protected]"
DANDI_ALLOW_LOCALHOST_URLS: "1"
ports:
- - "8000:8000"
+ - "127.0.0.1:8000:8000"
celery:
image: dandiarchive/dandiarchive-api
@@ -70,21 +65,8 @@ services:
rabbitmq:
condition: service_started
environment:
- DJANGO_CELERY_BROKER_URL: amqp://rabbitmq:5672/
- DJANGO_CONFIGURATION: DevelopmentConfiguration
- DJANGO_DANDI_DANDISETS_BUCKET_NAME: dandi-dandisets
- DJANGO_DANDI_DANDISETS_EMBARGO_BUCKET_NAME: dandi-embargoed-dandisets
- DJANGO_DATABASE_URL: postgres://postgres:postgres@postgres:5432/django
- DJANGO_MINIO_STORAGE_ACCESS_KEY: minioAccessKey
- DJANGO_MINIO_STORAGE_ENDPOINT: minio:9000
- DJANGO_MINIO_STORAGE_SECRET_KEY: minioSecretKey
- DJANGO_STORAGE_BUCKET_NAME: django-storage
- DJANGO_MINIO_STORAGE_MEDIA_URL: http://localhost:9000/django-storage
- DJANGO_DANDI_SCHEMA_VERSION:
+ << : *django_env
DJANGO_DANDI_VALIDATION_JOB_INTERVAL: "5"
- DJANGO_DANDI_WEB_APP_URL: http://localhost:8085
- DJANGO_DANDI_API_URL: http://localhost:8000
- DANDI_ALLOW_LOCALHOST_URLS: "1"
minio:
image: minio/minio:RELEASE.2022-04-12T06-55-35Z
@@ -92,7 +74,7 @@ services:
tty: true
command: ["server", "/data"]
ports:
- - "9000:9000"
+ - "127.0.0.1:9000:9000"
environment:
MINIO_ACCESS_KEY: minioAccessKey
MINIO_SECRET_KEY: minioSecretKey
@@ -107,8 +89,8 @@ services:
POSTGRES_DB: django
POSTGRES_PASSWORD: postgres
image: postgres
- ports:
- - "5432:5432"
+ expose:
+ - "5432"
healthcheck:
test: ["CMD", "pg_isready", "-U", "postgres"]
interval: 7s
@@ -117,5 +99,5 @@ services:
rabbitmq:
image: rabbitmq:management
- ports:
- - "5672:5672"
+ expose:
+ - "5672"
but neither minio version nor edit 1: celery is "busy" trying to close some incrementing handle and that is what it is wasting cpu onroot@d49aa92590f7:/opt/django# strace -f -p 8 2>&1 | head
strace: Process 8 attached
close(928972594) = -1 EBADF (Bad file descriptor)
close(928972593) = -1 EBADF (Bad file descriptor)
close(928972592) = -1 EBADF (Bad file descriptor)
close(928972591) = -1 EBADF (Bad file descriptor)
close(928972590) = -1 EBADF (Bad file descriptor)
close(928972589) = -1 EBADF (Bad file descriptor)
close(928972588) = -1 EBADF (Bad file descriptor)
close(928972587) = -1 EBADF (Bad file descriptor)
close(928972586) = -1 EBADF (Bad file descriptor)
root@d49aa92590f7:/opt/django# ls -l /proc/8/fd/
total 0
lrwx------ 1 root root 64 Nov 20 03:06 0 -> /dev/null
l-wx------ 1 root root 64 Nov 20 03:06 1 -> 'pipe:[5581143]'
l-wx------ 1 root root 64 Nov 20 03:06 2 -> 'pipe:[5581144]'
lrwx------ 1 root root 64 Nov 20 03:06 3 -> 'socket:[5575530]'
lrwx------ 1 root root 64 Nov 20 03:06 4 -> 'socket:[5574189]'
lrwx------ 1 root root 64 Nov 20 03:06 5 -> 'socket:[5576134]'
lrwx------ 1 root root 64 Nov 20 03:06 6 -> 'anon_inode:[eventpoll]'
lr-x------ 1 root root 64 Nov 20 03:06 7 -> /dev/null
l-wx------ 1 root root 64 Nov 20 03:06 8 -> 'pipe:[5578287]'
whenever in logs it seems to be ok
edit2: I verified that we do run this test in dandi-api tests: https://github.com/dandi/dandi-archive/actions/runs/11925359257/job/33237382168?pr=2076#step:8:30 |
just now spotted in the django outputs
so may be that is what undermines validation somewhere ? |
…rformed While troubleshooting - dandi/dandi-cli#1488 I would like to know if any of those tasks was ran at all. That is conditioned on correct condition to happen first, so it is valuable to add explicit debug level log entry if condition was not met
…rformed While troubleshooting - dandi/dandi-cli#1488 I would like to know if any of those tasks was ran at all. That is conditioned on correct condition to happen first, so it is valuable to add explicit debug level log entry if condition was not met
ok, I think that celery in our docker compose instance is not properly initialized and never registers for periodic tasks. does not show up anywhere in the logs for the FTR: here is the patch to docker compose in our tests to enable more logging, and bind mount dandi-archive/dandiapi so I could use "development" versiondiff --git a/dandi/tests/data/dandiarchive-docker/docker-compose.yml b/dandi/tests/data/dandiarchive-docker/docker-compose.yml
index cc37b5c5..9108cd5c 100644
--- a/dandi/tests/data/dandiarchive-docker/docker-compose.yml
+++ b/dandi/tests/data/dandiarchive-docker/docker-compose.yml
@@ -24,6 +24,8 @@ services:
DJANGO_DANDI_DANDISETS_LOG_BUCKET_NAME: dandiapi-dandisets-logs
DJANGO_DANDI_DANDISETS_EMBARGO_BUCKET_NAME: dandi-embargoed-dandisets
DJANGO_DANDI_DANDISETS_EMBARGO_LOG_BUCKET_NAME: dandiapi-embargo-dandisets-logs
+ # Pending https://github.com/dandi/dandi-archive/pull/2078
+ DJANGO_DANDI_LOG_LEVEL: DEBUG
DJANGO_DATABASE_URL: postgres://postgres:postgres@postgres:5432/django
DJANGO_MINIO_STORAGE_ACCESS_KEY: minioAccessKey
DJANGO_MINIO_STORAGE_ENDPOINT: minio:9000
@@ -42,6 +44,8 @@ services:
DANDI_ALLOW_LOCALHOST_URLS: "1"
ports:
- "127.0.0.1:8000:8000"
+ volumes:
+ - /home/yoh/proj/dandi/dandi-archive/dandiapi:/opt/django/dandiapi
celery:
image: dandiarchive/dandiarchive-api
@@ -49,7 +53,7 @@ services:
"celery",
"--app", "dandiapi.celery",
"worker",
- "--loglevel", "INFO",
+ "--loglevel", "DEBUG",
"--without-heartbeat",
"-Q","celery,calculate_sha256,ingest_zarr_archive,manifest-worker",
"-c","1", |
Ahh, @yarikoptic I think I see the issue. The way you tested your branch with the CLI docker compose file was to mount the
Seeing that log with the scheduled tasks in the registered task printout, as well as seeing scheduled tasks running in the logs, seems to prove that scheduled tasks are being picked up just fine.
|
Thank you @jjnesbitt, makes total sense. I will retry with bind mounting modified source tree there too to get logs. |
indeed, with bind mount also for celery I got the expected
but no other logs on having e.g. |
I have docker 26.1.5+dfsg1-4 from Debian @jjnesbitt has docker 27.3.1 (likely from upstream) @asmacdo what is your version of docker? |
Docker version 27.3.1, build ce12230 from fedora 40 |
@asmacdo do you also observe 100% CPU of celery process while tests are waiting for dandiset to get validated? |
also both @asmacdo and @jjnesbitt what is output of
for you while |
@yarikoptic yep, during To answer your other question:
|
In my case I kept finding celery running at 100% and doing nothing. py-spy pointed to the close_open_fds and then ulimit inside the container showed gory detail of ❯ docker run -it --rm --entrypoint bash dandiarchive/dandiarchive-api -c "ulimit -n" 1073741816 situation is not unique to me. See more at dandi/dandi-cli#1488
FWIW, submitted With that patched billiard (kudos to py-spy project which allowed me to point to culprit) I proceed ... and test complets green as expected! The solution should be - limiting number of open files, ideally at docker compose level but for me that didn't work out yet. |
I also addressed it at the system configuration level (ignore the bilena# cat /etc/docker/daemon.json
{
"storage-driver": "btrfs",
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Hard": 2048,
"Soft": 1024
}
}
} |
This is with
Audit
table from Docker env upon completion of testing #1484state (0.63.0-6-g782e959b) but I don't think it should relate anyhow since that PR is only for the finalization of the session fixture, so shouldn't effect individual tests
@jwodder - any immediate ideas?
I have
The text was updated successfully, but these errors were encountered: