Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashing on hyde.cyverse.org #2

Open
dannymandel opened this issue Nov 28, 2023 · 13 comments
Open

Crashing on hyde.cyverse.org #2

dannymandel opened this issue Nov 28, 2023 · 13 comments
Assignees

Comments

@dannymandel
Copy link
Member

There have been a couple times where the Docker container has mysteriously crashed on hyde.cyverse.org. It is unclear what is causing the issue. On log inspection, there isn't really any smoking gun.

@dannymandel dannymandel self-assigned this Nov 28, 2023
@dannymandel
Copy link
Member Author

For the moment, have added

Restart=on-failure

to the service definition on hyde.

@dannymandel
Copy link
Member Author

That seemed to cause it to get killed immediately. Had to remove it.

@dannymandel
Copy link
Member Author

Actually, it seems to still happen regardless of whether the Restart statement is there or not.

@dannymandel
Copy link
Member Author

dannymandel@hyde:~$ sudo service isamples_modelserver status
○ isamples_modelserver.service - Docker Compose iSamples Central Application Service
     Loaded: loaded (/etc/systemd/system/isamples_modelserver.service; disabled; vendor preset: enabled)
     Active: inactive (dead)

Nov 28 13:11:26 hyde.cyverse.org docker[970989]: Container isamples_modelserver-isamples_modelserver-1  Started
Nov 28 13:11:26 hyde.cyverse.org docker[971280]: Container isamples_modelserver-isamples_modelserver-1  Stopping
Nov 28 13:11:26 hyde.cyverse.org docker[971280]: Container isamples_modelserver-isamples_modelserver-1  Stopping
Nov 28 13:11:38 hyde.cyverse.org docker[971280]: Container isamples_modelserver-isamples_modelserver-1  Stopped
Nov 28 13:11:38 hyde.cyverse.org docker[971280]: Container isamples_modelserver-isamples_modelserver-1  Removing
Nov 28 13:11:38 hyde.cyverse.org docker[971280]: Container isamples_modelserver-isamples_modelserver-1  Removed
Nov 28 13:11:38 hyde.cyverse.org docker[971280]: Network isamples_modelserver_default  Removing
Nov 28 13:11:39 hyde.cyverse.org docker[971280]: Network isamples_modelserver_default  Removed
Nov 28 13:11:39 hyde.cyverse.org systemd[1]: isamples_modelserver.service: Deactivated successfully.
Nov 28 13:11:39 hyde.cyverse.org systemd[1]: Finished Docker Compose iSamples Central Application Service.

@dannymandel
Copy link
Member Author

So it is a problem with running isamples_modelserver as a linux service. Works fine when I run it manually. When I run it as a service it spontaneously stops itself.

@dannymandel
Copy link
Member Author

Going to table the investigation for now, there is an easy workaround.

@dannymandel
Copy link
Member Author

Note that the crash happened again over the weekend, I still don't know what is up with it.

@dannymandel
Copy link
Member Author

I've been experimenting with some signal handlers and shutdown events, and can see this locally with the additional debug logging in place:

/Users/mandeld/.virtualenvs/isamples_modelserver/bin/python /Users/mandeld/iSamples/isamples_modelserver/main.py 
ERROR:root:Unable to locate fasttext model at path UNSET.  All predictions will return NOT_PROVIDED.
INFO:     Will watch for changes in these directories: ['/Users/mandeld/iSamples/isamples_modelserver']
INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
INFO:     Started reloader process [15382] using StatReload
ERROR:root:Unable to locate fasttext model at path UNSET.  All predictions will return NOT_PROVIDED.
Loading OpenContext sample model…
INFO:     Started server process [15386]
INFO:     Waiting for application startup.
Loading OpenContext material model…
Loading SESAR material model…
INFO:     Application startup complete.
Loading done.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
  File "<string>", line 1, in <module>
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/spawn.py", line 135, in _main
    return self._bootstrap(parent_sentinel)
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/uvicorn/_subprocess.py", line 76, in subprocess_started
    target(sockets=sockets)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/uvicorn/server.py", line 61, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/base_events.py", line 640, in run_until_complete
    self.run_forever()
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/base_events.py", line 607, in run_forever
    self._run_once()
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/base_events.py", line 1922, in _run_once
    handle._run()
  File "/Users/mandeld/.pyenv/versions/3.11.6/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/uvicorn/lifespan/on.py", line 86, in main
    await app(scope, self.receive, self.send)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/fastapi/applications.py", line 289, in __call__
    await super().__call__(scope, receive, send)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/middleware/errors.py", line 149, in __call__
    await self.app(scope, receive, send)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 55, in __call__
    await self.app(scope, receive, send)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/routing.py", line 707, in __call__
    await self.lifespan(scope, receive, send)
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/routing.py", line 677, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/routing.py", line 569, in __aexit__
    await self._router.shutdown()
  File "/Users/mandeld/.virtualenvs/isamples_modelserver/lib/python3.11/site-packages/starlette/routing.py", line 666, in shutdown
    handler()
  File "/Users/mandeld/iSamples/isamples_modelserver/main.py", line 79, in on_shutdown
    traceback.print_stack()
INFO:     Application shutdown complete.
INFO:     Finished server process [15386]
Dumping stack trace because the python interpreter was killed.
INFO:     Stopping reloader process [15382]
  File "/Users/mandeld/iSamples/isamples_modelserver/main.py", line 56, in dump_stack_trace
    traceback.print_stack()

Process finished with exit code 0

I'm tempted to see if we can get any more information using something like this deployed out to hyde.

@dannymandel
Copy link
Member Author

Unfortunately it is only the last step in the shutdown process, but it at least registers that something is happening.

@dannymandel
Copy link
Member Author

On thinking about this a bit more (and remembering that the service wasn't working to begin with), I decided to correlate the outage times with hyde restart times and see if they are happening at the same time.

Last outage reported time (check happens once an hour): Sun, Dec 10, 2:47 AM (3 days ago)

Last hyde reboot time: system boot 2023-12-10 02:01

So I think rather than a crash this just might be a symptom of the linux service not coming up after reboot.

@dannymandel
Copy link
Member Author

I manually killed the container then restarted and it came up. This was the service file that seemed to work:

dannymandel@hyde:~$ cat /etc/systemd/system/isamples_modelserver.service 
[Unit]
Description=Docker Compose iSamples Central Modelserver Application Service
Requires=docker.service
After=isamples_central.service

[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/home/isamples/isamples_modelserver
ExecStart=/usr/bin/docker compose up -d --build 
ExecStop=/usr/bin/docker compose down 
TimeoutStartSec=90

[Install]
WantedBy=multi-user.target

@dannymandel
Copy link
Member Author

The two changes I made recently were:

(1) Updated description to include modelserver to differentiate from the core container
(2) Changed the after=isamples_central.service (used to be docker)

Will keep an eye on it, but it is working for now.

@dannymandel
Copy link
Member Author

So I think the reason there were no logs when inspecting the crashed Docker container was that it just never started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant