-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional issues with sqlite tool cache #406
Comments
Usually dowscaling all galaxy processes to zero, moving the directory |
Could this be related to?
I have seen it happen now again, and that also showed up like this. |
Partly related to #391 (but not the same). |
This workaround results in the following:
|
I have seen this sometimes as well, it goes away if downscaling, deleting it, and upscaling. Agreed is undesirable. However I was making so many changes I wasn't sure what was generating it. |
Thanks for getting back so quickly :) (and apologies for my super curt comment) In the end we made the issue go away by repeatedly restarting varying components in semi-random orders. It does make the whole galaxy stack feel rather fragile though. From a more conceptual point, do you know if there is a reason for using sqlite, rather than the postgresql database galaxy has access to anyway? |
actually, always downscaling before a helm upgrade avoid this. However, it only happens on some of our deployments, so I'm not sure what is the original trigger of this. |
Hmmn, interesting. I'll dig a bit deeper on my end once the galaxy deployment is no longer in active use. |
...mmm... I have hit a case where downscaling still shows the failure. I had manually installed a tool from the UI before the helm upgrade, and then on restart started to have these. |
If I look at the actual files when this is happening, what I see is that they are all empty:
could it be that downscaling the containers doesn't wait for them to shut down processes correctly? Maybe we need some instructions on the deployment to do certain things when downscaling to zero? |
You can turn |
Thanks @mvdbeek , this really resolves a problem for us. |
That sounds like a great option, I'll definitely try that! |
I am re-opening this issue as I am running into the same problem running benchmarks on AnVIL-like instances, that is, Galaxy installed with the galaxykubeman-helm chart onto a GKE instance. The In my case I am re-writing TPV config files with Given the sporadic nature of the exception it feels like a race condition between the pods that are starting i.e. running But what is with all the zero length tmp files? Could we be seeing a race someplace like this?
or the Would it be sufficient (for now) to simply always open the cache in create mode? |
awesome, does that mean I can remove the cache from Galaxy as well ? |
I think so. If we need its functionality in the future I like your toolbox microservice idea. |
The cache made a significant improvement to performance, especially when the cache itself was coming off cvmfs. Startup times were down to a few seconds. Now, it's in the minutes again. If we can somehow streamline this functionality, I think it would make a sizeable impact on performance and day to day administrative experience. I don't know what the toolbox microservice idea is though. |
There's a lot at play when measuring startup time, that it takes minutes now I would only attribute to the cache if you have a before/after comparison. |
I can run some timing tests this week, but I think for my use case of running If I understood Marius's comment the toolbox microservice idea is to implement the toolbox (and cache?) as a microservice rather than having multiple processes trying to access the same Sqlite database file. |
If it's generally beneficial to support, can we not just put a simple (file?) lock around the tool cache I/O? |
General, as in a project that is not anvil ? I'd say no, even booting from my instance with cold cvms it doesn't make a difference. Why it makes a difference on anvil, if it does ? IDK I don't think locking will solve the access issue in #406 (comment), that's got something to do with an existing table (in RO) being read for which We should definitely try to get away from parsing and holding all tools in memory, that will benefit all deployment scenarios. A persistent service that can serve up tool objects is kind of the extension of the cache idea, where my main conclusion was that there's no easy, portable and reliable way to have a cache that can be read by multiple hosts, processes and threads. One of the previous implementations of the cache was a directory of zipped json files ... it had other issues, but we could have a simple service that can serve these up on demand. |
I think it might solve this problem. My working theory is there is a race for that Sqlite database file. One process has started creating the database when another process comes along, sees the file exists so tries to open in read mode only to find the table hasn't actually been created yet.
Agreed. And we would have to make sure we don't trade a race condition for a deadlock. Having a separate service for the tool cache is likely the easiest and most robust solution (short of doing away with it entirely). |
Yep, that was my understanding. I don't see a way to actually reproduce the error with an exclusive file lock in place. It shouldn't be possible. |
I'm puzzled by why you are not experiencing a slow down. Could it be due to proximity to cvmfs or something? The reason is that in our observations, especially with a full toolbox, the bulk of the time was spent on fetching files from cvmfs. Especially in cold start scenarios, reading a thousand files off of cvmfs "should" be slow, since it implies 1000 sequential http requests? In contrast, when the sqlite database was distributed over cvmfs itself, it resulted in a pretty remarkable speedup. It was like swapping a horse and buggy for a bugatti - unbelievably fast. Once the sqlite database was taken out of cvmfs, and had to be rebuilt on the fly, cold start times became slow again, but restarts were fast since the database was already populated. @afgane Can you confirm that this is your experience as well, and none of this is a memory-error on my part? |
They are RO since they are on cvmfs, that's not it. |
CVMFS is RO, but the cache itself isn't. The cache does a copy-on-write, which is where I think things go wrong. At least that |
what did you do to cvmfs to have it writable ? it isn't for writable for me. maybe turn that off ? |
(which is not to say try a filelock too, if that fixes it then I'm happy) |
I didn't do anything, I am just trying to figure it out. Git blame says you wrote that line 😉 From what I can figure out the cache copies the files from CVMFS to a Kubernetes persistent disk, which is why the cache is writeable. I assume the cache is writeable so tools can be added by the user/admin? |
that's guarded by the file being writeable, so you should never get there. |
oh, that seems like a problem, that should go to a different shed_tools_conf.xml file with a separate cache (or no cache) |
Yet we do. Or at least it is the next line
That might also explain another problem I am seeing with tools going missing when relaunching a cluster. I will install a workflow with planemo, shut down the cluster (but keep the disks), launch a new cluster and attach the existing disks. This is to simulate users on Terra that pause/resume their Galaxy cluster. However, the workflow no longer runs as tools are not available and no longer appear in the tool panel. Trying to reinstall the tools with Planemo doesn't work as it thinks the tools are present. I haven't had a chance to see if disabling the tool cache solves that problem. It is also rather moot as Terra users can't install tools, but it is odd. |
They actually can as they have full admin access on Galaxy. I don't recall seeing (or hearing about) this issue so wonder if could be new?
Unfortunately I do not have a confident recollection on the topic so the best option would be to test a couple of launches with the cache on&off. |
Whoops, you are right; I was looking in the wrong place for the Tool Install page... I only started re-using the disks for the latest benchmarking runs so I didn't notice the problem until now. Maybe because I am installing tools via the API (planemo) rather than the UI? |
I don't follow what next line you mean ? |
Also if you can post again the stack we're discussing if it's not the one in the opening post? |
It is the same stack trace, but I think I have not been very good at describing what I think is going on. In particular, I don't think the problem code is where the exception is being raised. The exception is raised in
Sorry on the line after the one I linked to above. Which is in the
This is where I think the problem really happens as there only a few places in the code that actually manipulate/remove the database file. There may be others, such as the Since I am running a At least that is my theory. |
Right, but then that comes back to the cache being RO, so your persist is a NO-OP. I don't think the cvmfs mount should be writable, it's not necessary for user installs (those should go to a different shed_tool_conf.xml file). |
The CVMFS mount is RO, but the cache is RW.
The only place I can see that the Unless you have another theory that can explain 1) why the |
You should have diff --git a/lib/galaxy/tools/__init__.py b/lib/galaxy/tools/__init__.py
index 2c76989717..b5ebed5f98 100644
--- a/lib/galaxy/tools/__init__.py
+++ b/lib/galaxy/tools/__init__.py
@@ -490,13 +490,13 @@ class ToolBox(AbstractToolBox):
return self._tools_by_id
def get_cache_region(self, tool_cache_data_dir):
- if self.app.config.enable_tool_document_cache:
+ if self.app.config.enable_tool_document_cache and tool_cache_data_dir:
if tool_cache_data_dir not in self.cache_regions:
self.cache_regions[tool_cache_data_dir] = ToolDocumentCache(cache_dir=tool_cache_data_dir)
return self.cache_regions[tool_cache_data_dir]
def create_tool(self, config_file, tool_cache_data_dir=None, **kwds):
- cache = self.get_cache_region(tool_cache_data_dir or self.app.config.tool_cache_data_dir)
+ cache = self.get_cache_region(tool_cache_data_dir)
if config_file.endswith(".xml") and cache and not cache.disabled:
tool_document = cache.get(config_file)
if tool_document:
that means Galaxy would only use the cache when the cache location is listed in the tool config file, so not for non-cvmfs tools. Additionally this would also be fine: diff --git a/lib/galaxy/tools/cache.py b/lib/galaxy/tools/cache.py
index 8846c8e154..3704172198 100644
--- a/lib/galaxy/tools/cache.py
+++ b/lib/galaxy/tools/cache.py
@@ -49,7 +49,7 @@ class ToolDocumentCache:
else:
cache_file = self.writeable_cache_file.name if self.writeable_cache_file else self.cache_file
self._cache = SqliteDict(cache_file, flag=flag, encode=encoder, decode=decoder, autocommit=False)
- except sqlite3.OperationalError:
+ except (sqlite3.OperationalError, RuntimeError):
log.warning("Tool document cache unavailable")
self._cache = None
self.disabled = True
|
Some times on
helm upgrades
of existing setups I see the following issue on startup with the sqlite tool cache (which is shared by containers via shared filesystem):I wonder if this tool cache could be local to each container instead to improve performance? DBs in general don't like shared file systems.
The text was updated successfully, but these errors were encountered: