-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing Containers #78
Comments
@nmulyk Your CADC certificate might be expired. Could you run a |
Hi @shinybrar, I had the same thought and tried that earlier. Oddly, I can spawn some jobs but most fail. |
I'm also trying to spawn containers with the following code, and I'm getting a similar error to Nicole after having tried
|
I've had several containers fail this week. I'm spawning containers with the following code:
payload2 = { 'name': 'nicole-baseband', 'image': 'images.canfar.net/chimefrb-public/baseband-analysis:latest', 'cores': 2, 'ram': 8, 'kind': 'headless', 'cmd': 'workflow', 'args': 'run --site=canfar baseband-nmulyk', 'env': {'SITE': 'canfar', 'CHIME_FRB_ACCESS_TOKEN': 'eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyX2lkIjoia3NoaW4iLCJleHAiOjE3MjY2MDAyODEsImlzcyI6ImZyYi1tYXN0ZXIiLCJpYXQiOjE3MjY1OTg0ODF9.jbKUKw3QgfKaXZdHBYER63cONnfkPgEIMxtyJigp-DU', 'CHIME_FRB_REFRESH_TOKEN': 'beb04f3faea1e802cf4a698a2df3f290bff9a112a6328619', # 'PYTHONPATH': '/arc/home/user/baseband-analysis/' # you can set this if you want to run your local branch on this headless session }, 'replicas': 10 } sid2 = s.create(**payload2)
Although I see the usual logs when I request 10 or fewer replicas (but more than half still fail), when I request more (~30 replicas) I get the following errors:
2024-12-20 16:29:34,000 - skaha-client-skaha.session - INFO - Creating 30 session(s) with parameters:
2024-12-20 16:29:34,000 INFO Creating 30 session(s) with parameters:
2024-12-20 16:29:34,004 - skaha-client-skaha.session - INFO - {'name': 'nicole-baseband', 'image': 'images.canfar.net/chimefrb-public/baseband-analysis:latest', 'cores': 2, 'ram': 8, 'kind': 'headless', 'cmd': 'workflow', 'args': 'run --site=canfar baseband-nmulyk', 'env': {'SITE': 'canfar', 'CHIME_FRB_ACCESS_TOKEN': 'eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyX2lkIjoia3NoaW4iLCJleHAiOjE3MjY2MDAyODEsImlzcyI6ImZyYi1tYXN0ZXIiLCJpYXQiOjE3MjY1OTg0ODF9.jbKUKw3QgfKaXZdHBYER63cONnfkPgEIMxtyJigp-DU', 'CHIME_FRB_REFRESH_TOKEN': 'beb04f3faea1e802cf4a698a2df3f290bff9a112a6328619'}}
2024-12-20 16:29:34,004 INFO {'name': 'nicole-baseband', 'image': 'images.canfar.net/chimefrb-public/baseband-analysis:latest', 'cores': 2, 'ram': 8, 'kind': 'headless', 'cmd': 'workflow', 'args': 'run --site=canfar baseband-nmulyk', 'env': {'SITE': 'canfar', 'CHIME_FRB_ACCESS_TOKEN': 'eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyX2lkIjoia3NoaW4iLCJleHAiOjE3MjY2MDAyODEsImlzcyI6ImZyYi1tYXN0ZXIiLCJpYXQiOjE3MjY1OTg0ODF9.jbKUKw3QgfKaXZdHBYER63cONnfkPgEIMxtyJigp-DU', 'CHIME_FRB_REFRESH_TOKEN': 'beb04f3faea1e802cf4a698a2df3f290bff9a112a6328619'}}
2024-12-20 16:29:42,713 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:42,716 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:43,078 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:43,848 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:44,842 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:44,846 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:44,868 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:45,927 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:46,109 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:46,174 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:47,125 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:47,269 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:47,371 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:47,636 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:47,981 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:47,982 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:48,032 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:48,045 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:48,112 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
2024-12-20 16:29:48,147 WARNING Connection pool is full, discarding connection: ws-uv.canfar.net. Connection pool size: 10
SSLError Traceback (most recent call last)
File /opt/pysetup/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
704 conn,
705 method,
706 url,
707 timeout=timeout_obj,
708 body=body,
709 headers=headers,
710 chunked=chunked,
711 )
713 # If we're going to release the connection in
finally:
, then714 # the response doesn't need to know about the connection. Otherwise
715 # it will also try to release it and we'll have a double-release
716 # mess.
File /opt/pysetup/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py:386, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
385 try:
--> 386 self._validate_conn(conn)
387 except (SocketTimeout, BaseSSLError) as e:
388 # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
File /opt/pysetup/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py:1042, in HTTPSConnectionPool._validate_conn(self, conn)
1041 if not getattr(conn, "sock", None): # AppEngine might not have
.sock
-> 1042 conn.connect()
1044 if not conn.is_verified:
File /opt/pysetup/.venv/lib/python3.8/site-packages/urllib3/connection.py:419, in HTTPSConnection.connect(self)
417 context.load_default_certs()
--> 419 self.sock = ssl_wrap_socket(
420 sock=conn,
421 keyfile=self.key_file,
422 certfile=self.cert_file,
423 key_password=self.key_password,
424 ca_certs=self.ca_certs,
425 ca_cert_dir=self.ca_cert_dir,
426 ca_cert_data=self.ca_cert_data,
427 server_hostname=server_hostname,
428 ssl_context=context,
429 tls_in_tls=tls_in_tls,
430 )
432 # If we're using all defaults and the connection
433 # is TLSv1 or TLSv1.1 we throw a DeprecationWarning
434 # for the host.
File /opt/pysetup/.venv/lib/python3.8/site-packages/urllib3/util/ssl_.py:418, in ssl_wrap_socket(sock, keyfile, certfile, cert_reqs, ca_certs, server_hostname, ssl_version, ciphers, ssl_context, ca_cert_dir, key_password, ca_cert_data, tls_in_tls)
417 if key_password is None:
--> 418 context.load_cert_chain(certfile, keyfile)
419 else:
SSLError: [X509: KEY_VALUES_MISMATCH] key values mismatch (_ssl.c:4071)
During handling of the above exception, another exception occurred:
MaxRetryError Traceback (most recent call last)
File /opt/pysetup/.venv/lib/python3.8/site-packages/requests/adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
485 try:
--> 486 resp = conn.urlopen(
487 method=request.method,
488 url=url,
489 body=request.body,
490 headers=request.headers,
491 redirect=False,
492 assert_same_host=False,
493 preload_content=False,
494 decode_content=False,
495 retries=self.max_retries,
496 timeout=timeout,
497 chunked=chunked,
498 )
500 except (ProtocolError, OSError) as err:
File /opt/pysetup/.venv/lib/python3.8/site-packages/urllib3/connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
785 e = ProtocolError("Connection aborted.", e)
--> 787 retries = retries.increment(
788 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
789 )
790 retries.sleep()
File /opt/pysetup/.venv/lib/python3.8/site-packages/urllib3/util/retry.py:592, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
591 if new_retry.is_exhausted():
--> 592 raise MaxRetryError(_pool, url, error or ResponseError(cause))
594 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)
MaxRetryError: HTTPSConnectionPool(host='ws-uv.canfar.net', port=443): Max retries exceeded with url: /skaha/v0/session?name=nicole-baseband-8&image=images.canfar.net%2Fchimefrb-public%2Fbaseband-analysis%3Alatest&cores=2&ram=8&kind=headless&cmd=workflow&args=run+--site%3Dcanfar+baseband-nmulyk&env=SITE%3Dcanfar&env=CHIME_FRB_ACCESS_TOKEN%3DeyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyX2lkIjoia3NoaW4iLCJleHAiOjE3MjY2MDAyODEsImlzcyI6ImZyYi1tYXN0ZXIiLCJpYXQiOjE3MjY1OTg0ODF9.jbKUKw3QgfKaXZdHBYER63cONnfkPgEIMxtyJigp-DU&env=CHIME_FRB_REFRESH_TOKEN%3Dbeb04f3faea1e802cf4a698a2df3f290bff9a112a6328619&env=REPLICA_ID%3D8&env=REPLICA_COUNT%3D30 (Caused by SSLError(SSLError(116, '[X509: KEY_VALUES_MISMATCH] key values mismatch (_ssl.c:4071)')))
During handling of the above exception, another exception occurred:
SSLError Traceback (most recent call last)
Cell In[23], line 18
1 ### spawning 20 containers for all the baseband pipeline stuff (beamforming, merge, analysis)
2 ### you're of course free to use your own chime tokens -- otherwise you'll be masquerading as
kshin
:-)3 payload2 = {
4 'name': 'nicole-baseband',
5 'image': 'images.canfar.net/chimefrb-public/baseband-analysis:latest',
(...)
16 'replicas': 30
17 }
---> 18 sid2 = s.create(**payload2)
File /opt/pysetup/.venv/lib/python3.8/site-packages/skaha/session.py:236, in Session.create(self, name, image, cores, ram, kind, gpu, cmd, args, env, replicas)
234 arguments.append({"url": self.server, "params": payload})
235 loop = get_event_loop()
--> 236 results = loop.run_until_complete(scale(self.session.post, arguments))
237 responses: List[str] = []
238 for response in results:
File /opt/pysetup/.venv/lib/python3.8/site-packages/nest_asyncio.py:90, in _patch_loop..run_until_complete(self, future)
87 if not f.done():
88 raise RuntimeError(
89 'Event loop stopped before Future completed.')
---> 90 return f.result()
File /usr/local/lib/python3.8/asyncio/futures.py:178, in Future.result(self)
176 self.__log_traceback = False
177 if self._exception is not None:
--> 178 raise self._exception
179 return self._result
File /usr/local/lib/python3.8/asyncio/tasks.py:282, in Task.__step(failed resolving arguments)
280 result = coro.send(None)
281 else:
--> 282 result = coro.throw(exc)
283 except StopIteration as exc:
284 if self._must_cancel:
285 # Task is cancelled right before coro stops.
File /opt/pysetup/.venv/lib/python3.8/site-packages/skaha/utils/threaded.py:35, in scale(function, arguments)
30 loop = asyncio.get_event_loop()
31 futures = [
32 loop.run_in_executor(executor, partial(function, **arguments[index]))
33 for index in range(workers)
34 ]
---> 35 return await asyncio.gather(*futures)
File /usr/local/lib/python3.8/asyncio/tasks.py:349, in Task.__wakeup(self, future)
347 def __wakeup(self, future):
348 try:
--> 349 future.result()
350 except BaseException as exc:
351 # This may also be a cancellation.
352 self.__step(exc)
File /usr/local/lib/python3.8/concurrent/futures/thread.py:57, in _WorkItem.run(self)
54 return
56 try:
---> 57 result = self.fn(*self.args, **self.kwargs)
58 except BaseException as exc:
59 self.future.set_exception(exc)
File /opt/pysetup/.venv/lib/python3.8/site-packages/requests/sessions.py:635, in Session.post(self, url, data, json, **kwargs)
624 def post(self, url, data=None, json=None, **kwargs):
625 r"""Sends a POST request. Returns :class:
Response
object.626
627 :param url: URL for the new :class:
Request
object.(...)
632 :rtype: requests.Response
633 """
--> 635 return self.request("POST", url, data=data, json=json, **kwargs)
File /opt/pysetup/.venv/lib/python3.8/site-packages/requests/sessions.py:587, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
582 send_kwargs = {
583 "timeout": timeout,
584 "allow_redirects": allow_redirects,
585 }
586 send_kwargs.update(settings)
--> 587 resp = self.send(prep, **send_kwargs)
589 return resp
File /opt/pysetup/.venv/lib/python3.8/site-packages/requests/sessions.py:701, in Session.send(self, request, **kwargs)
698 start = preferred_clock()
700 # Send the request
--> 701 r = adapter.send(request, **kwargs)
703 # Total elapsed time of the request (approximately)
704 elapsed = preferred_clock() - start
File /opt/pysetup/.venv/lib/python3.8/site-packages/requests/adapters.py:517, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
513 raise ProxyError(e, request=request)
515 if isinstance(e.reason, _SSLError):
516 # This branch is for urllib3 v1.22 and later.
--> 517 raise SSLError(e, request=request)
519 raise ConnectionError(e, request=request)
521 except ClosedPoolError as e:
SSLError: HTTPSConnectionPool(host='ws-uv.canfar.net', port=443): Max retries exceeded with url: /skaha/v0/session?name=nicole-baseband-8&image=images.canfar.net%2Fchimefrb-public%2Fbaseband-analysis%3Alatest&cores=2&ram=8&kind=headless&cmd=workflow&args=run+--site%3Dcanfar+baseband-nmulyk&env=SITE%3Dcanfar&env=CHIME_FRB_ACCESS_TOKEN%3DeyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyX2lkIjoia3NoaW4iLCJleHAiOjE3MjY2MDAyODEsImlzcyI6ImZyYi1tYXN0ZXIiLCJpYXQiOjE3MjY1OTg0ODF9.jbKUKw3QgfKaXZdHBYER63cONnfkPgEIMxtyJigp-DU&env=CHIME_FRB_REFRESH_TOKEN%3Dbeb04f3faea1e802cf4a698a2df3f290bff9a112a6328619&env=REPLICA_ID%3D8&env=REPLICA_COUNT%3D30 (Caused by SSLError(SSLError(116, '[X509: KEY_VALUES_MISMATCH] key values mismatch (_ssl.c:4071)')))
The text was updated successfully, but these errors were encountered: