Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional issues with sqlite tool cache #406

Closed
pcm32 opened this issue Jan 5, 2023 · 40 comments · Fixed by #438
Closed

Occasional issues with sqlite tool cache #406

pcm32 opened this issue Jan 5, 2023 · 40 comments · Fixed by #438

Comments

@pcm32
Copy link
Member

pcm32 commented Jan 5, 2023

Some times on helm upgrades of existing setups I see the following issue on startup with the sqlite tool cache (which is shared by containers via shared filesystem):

sqlitedict INFO 2023-01-05 16:22:26,427 [pN:workflow_scheduler0,p:8,tN:MainThread] opening Sqlite table 'unnamed' in '/galaxy/server/database/cache/tool_cache/cache.sqlite'
sqlitedict.SqliteMultithread DEBUG 2023-01-05 16:22:26,428 [pN:workflow_scheduler0,p:8,tN:Thread-1168] received: --close--, send: --no more--
Failed to initialize Galaxy application
Traceback (most recent call last):
  File "/galaxy/server/scripts/galaxy-main", line 112, in app_loop
    galaxy_app = load_galaxy_app(
  File "/galaxy/server/scripts/galaxy-main", line 91, in load_galaxy_app
    app = UniverseApplication(global_conf=config_builder.global_conf(), attach_to_pools=attach_to_pools, **kwds)
  File "/galaxy/server/lib/galaxy/app.py", line 653, in __init__
    self.datatypes_registry.load_external_metadata_tool(self.toolbox)
  File "/galaxy/server/lib/galaxy/datatypes/registry.py", line 903, in load_external_metadata_tool
    set_meta_tool = toolbox.load_hidden_lib_tool(
  File "/galaxy/server/lib/galaxy/tool_util/toolbox/base.py", line 1106, in load_hidden_lib_tool
    return self.load_hidden_tool(tool_xml)
  File "/galaxy/server/lib/galaxy/tool_util/toolbox/base.py", line 1112, in load_hidden_tool
    tool = self.load_tool(config_file, **kwds)
  File "/galaxy/server/lib/galaxy/tool_util/toolbox/base.py", line 1062, in load_tool
    tool = self.create_tool(
  File "/galaxy/server/lib/galaxy/tools/__init__.py", line 380, in create_tool
    cache = self.get_cache_region(tool_cache_data_dir or self.app.config.tool_cache_data_dir)
  File "/galaxy/server/lib/galaxy/tools/__init__.py", line 376, in get_cache_region
    self.cache_regions[tool_cache_data_dir] = ToolDocumentCache(cache_dir=tool_cache_data_dir)
  File "/galaxy/server/lib/galaxy/tools/cache.py", line 47, in __init__
    self._get_cache(create_if_necessary=True)
  File "/galaxy/server/lib/galaxy/tools/cache.py", line 61, in _get_cache
    self._cache = SqliteDict(cache_file, flag=flag, encode=encoder, decode=decoder, autocommit=False)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlitedict.py", line 164, in __init__
    raise RuntimeError(msg)
RuntimeError: Refusing to create a new table "unnamed" in read-only DB mode

I wonder if this tool cache could be local to each container instead to improve performance? DBs in general don't like shared file systems.

@pcm32
Copy link
Member Author

pcm32 commented Jan 5, 2023

Usually dowscaling all galaxy processes to zero, moving the directory /galaxy/server/database/cache/tool_cache away and then re-scaling up tends to work, with some hiccups.

@pcm32
Copy link
Member Author

pcm32 commented Jan 6, 2023

Could this be related to?

galaxy-dev2-init-mounts-iohsy-dn4ch                               1/4     NotReady    0             2m11s

I have seen it happen now again, and that also showed up like this.

@pcm32
Copy link
Member Author

pcm32 commented Jan 6, 2023

Partly related to #391 (but not the same).

@pckroon
Copy link

pckroon commented Feb 16, 2023

This workaround results in the following:

Traceback (most recent call last):
  File "/galaxy/server/scripts/galaxy-main", line 259, in <module>
    main()
  File "/galaxy/server/scripts/galaxy-main", line 255, in main
    func(args, log)
  File "/galaxy/server/scripts/galaxy-main", line 112, in app_loop
    galaxy_app = load_galaxy_app(
  File "/galaxy/server/scripts/galaxy-main", line 91, in load_galaxy_app
    app = UniverseApplication(global_conf=config_builder.global_conf(), attach_to_pools=attach_to_pools, **kwds)
  File "/galaxy/server/lib/galaxy/app.py", line 659, in __init__
    self.toolbox.persist_cache(register_postfork=True)
  File "/galaxy/server/lib/galaxy/tools/__init__.py", line 339, in persist_cache
    region.persist()
  File "/galaxy/server/lib/galaxy/tools/cache.py", line 104, in persist
    self.reopen_ro()
  File "/galaxy/server/lib/galaxy/tools/cache.py", line 72, in reopen_ro
    self._get_cache(flag="r")
  File "/galaxy/server/lib/galaxy/tools/cache.py", line 61, in _get_cache
    self._cache = SqliteDict(cache_file, flag=flag, encode=encoder, decode=decoder, autocommit=False)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlitedict.py", line 162, in __init__
    if self.tablename not in SqliteDict.get_tablenames(self.filename):
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlitedict.py", line 301, in get_tablenames
    raise IOError('file %s does not exist' % (filename))
OSError: file /galaxy/server/database/cache/tool_cache/tmpcbl9ombfcache.sqlite.tmp does not exist

@pcm32
Copy link
Member Author

pcm32 commented Feb 16, 2023

I have seen this sometimes as well, it goes away if downscaling, deleting it, and upscaling. Agreed is undesirable. However I was making so many changes I wasn't sure what was generating it.

@pckroon
Copy link

pckroon commented Feb 17, 2023

Thanks for getting back so quickly :) (and apologies for my super curt comment)

In the end we made the issue go away by repeatedly restarting varying components in semi-random orders. It does make the whole galaxy stack feel rather fragile though.

From a more conceptual point, do you know if there is a reason for using sqlite, rather than the postgresql database galaxy has access to anyway?

@pcm32
Copy link
Member Author

pcm32 commented Mar 4, 2023

actually, always downscaling before a helm upgrade avoid this. However, it only happens on some of our deployments, so I'm not sure what is the original trigger of this.

@pckroon
Copy link

pckroon commented Mar 4, 2023

Hmmn, interesting. I'll dig a bit deeper on my end once the galaxy deployment is no longer in active use.

@pcm32
Copy link
Member Author

pcm32 commented Mar 10, 2023

...mmm... I have hit a case where downscaling still shows the failure. I had manually installed a tool from the UI before the helm upgrade, and then on restart started to have these.

@pcm32
Copy link
Member Author

pcm32 commented Mar 10, 2023

If I look at the actual files when this is happening, what I see is that they are all empty:

bash-5.0# ls -ltr
total 0
-rw-r--r-- 1 101 root 0 Mar  8 08:51 tmp0nfyfheicache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 08:52 tmpan0zjzs2cache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 08:54 tmp_p3_waejcache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:10 tmpmrsyoa2icache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:10 tmp1g5gxvcmcache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:10 tmp2uva7j7mcache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:10 tmpx0adws2gcache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:10 tmpy7fiybnecache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:10 tmp4er1d4c_cache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:11 tmprrnhvnawcache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:11 tmpir8lhcj3cache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:12 tmpzb4sqxn9cache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:12 tmphto8k09pcache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:13 tmpu1avwt0ccache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:13 tmpimiuwx0ccache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:14 tmpl4bqfrgacache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:15 tmpkqgowhmbcache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:59 tmpvqljjl8vcache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:59 tmptrmenn2rcache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 09:59 tmpaatyrbtbcache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 10:13 tmp16e4z5pacache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 10:13 tmpj0b1dyl3cache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 13:21 tmp29xwuor1cache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 13:21 tmpof2ki_y3cache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 13:21 tmp3rgza3xocache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  8 13:21 tmpksr3rf5hcache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  9 08:47 tmp9sky5257cache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  9 08:47 tmp0vn8fakwcache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  9 08:47 tmpq35di1q3cache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar  9 16:04 tmprr5_c1sbcache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar 10 14:05 tmpucs17s5hcache.sqlite.tmp
-rw-r--r-- 1 101 root 0 Mar 10 14:05 cache.sqlite
-rw-r--r-- 1 101 root 0 Mar 10 14:05 tmpqmztd3yvcache.sqlite.tmp

could it be that downscaling the containers doesn't wait for them to shut down processes correctly? Maybe we need some instructions on the deployment to do certain things when downscaling to zero?

@mvdbeek
Copy link
Member

mvdbeek commented Mar 13, 2023

You can turn enable_tool_document_cache off (that is the default anyway), that option basically only exists for anvil, there's no big gain if you're not loading up a tool box from cvmfs.

@pcm32
Copy link
Member Author

pcm32 commented Mar 13, 2023

Thanks @mvdbeek , this really resolves a problem for us.

@pcm32 pcm32 closed this as completed Mar 13, 2023
@pckroon
Copy link

pckroon commented Mar 14, 2023

That sounds like a great option, I'll definitely try that!

@ksuderman
Copy link
Contributor

ksuderman commented Jul 31, 2023

I am re-opening this issue as I am running into the same problem running benchmarks on AnVIL-like instances, that is, Galaxy installed with the galaxykubeman-helm chart onto a GKE instance. The ToolDocumentCache is enabled here in the galaxy-helm chart so the problem does not appear to be specific to the GalaxyKubeman chart.

In my case I am re-writing TPV config files with helm update. This works most of the time, but occasionally (after 6-8 times) I will get the above exception which prevents the job, web, and workflow handler pods from starting. I have found a better work-around is to connect to one of the running pods and deleting the /galaxy/server/database/cache/tool_cache/cache.sqlite file then deleting the stuck pods. Kubernetes should now be able to start the pods.

Given the sporadic nature of the exception it feels like a race condition between the pods that are starting i.e. running /ga/axy/server/scripts/galaxy-main and entering the app_loop, and the pods trying to exit app_loop and calling app.shutdown().

But what is with all the zero length tmp files? Could we be seeing a race someplace like this?

def persist(self):
    if self.writeable_cache_file:
        self._cache.commit()
        os.rename(self.writeable_cache_file.name, self.cache_file) # <--- here
        self.reopen_ro()  # <-- or here

or the _make_writable (sp?) just above? I don't see many other places that would cause the cache file to exist but be empty.

Would it be sufficient (for now) to simply always open the cache in create mode? SqliteDict will only create the table if it doesn't exist so it looks like that should be safe.

@mvdbeek
Copy link
Member

mvdbeek commented Aug 7, 2023

awesome, does that mean I can remove the cache from Galaxy as well ?

@ksuderman
Copy link
Contributor

I think so. If we need its functionality in the future I like your toolbox microservice idea.

@nuwang
Copy link
Member

nuwang commented Aug 7, 2023

The cache made a significant improvement to performance, especially when the cache itself was coming off cvmfs. Startup times were down to a few seconds. Now, it's in the minutes again. If we can somehow streamline this functionality, I think it would make a sizeable impact on performance and day to day administrative experience. I don't know what the toolbox microservice idea is though.

@mvdbeek
Copy link
Member

mvdbeek commented Aug 7, 2023

There's a lot at play when measuring startup time, that it takes minutes now I would only attribute to the cache if you have a before/after comparison.

@ksuderman
Copy link
Contributor

I can run some timing tests this week, but I think for my use case of running helm upgrade the speed up should be considerable as the cache is (supposed to be) already there; that part of the startup cost should go from However long it takes to zero. However, a failure rate of ~12% makes it unusable.

If I understood Marius's comment the toolbox microservice idea is to implement the toolbox (and cache?) as a microservice rather than having multiple processes trying to access the same Sqlite database file.

@dannon
Copy link
Member

dannon commented Aug 7, 2023

If it's generally beneficial to support, can we not just put a simple (file?) lock around the tool cache I/O?

@mvdbeek
Copy link
Member

mvdbeek commented Aug 7, 2023

General, as in a project that is not anvil ? I'd say no, even booting from my instance with cold cvms it doesn't make a difference. Why it makes a difference on anvil, if it does ? IDK

I don't think locking will solve the access issue in #406 (comment), that's got something to do with an existing table (in RO) being read for which SELECT name FROM sqlite_master WHERE type="table" returns nothing.

We should definitely try to get away from parsing and holding all tools in memory, that will benefit all deployment scenarios. A persistent service that can serve up tool objects is kind of the extension of the cache idea, where my main conclusion was that there's no easy, portable and reliable way to have a cache that can be read by multiple hosts, processes and threads.

One of the previous implementations of the cache was a directory of zipped json files ... it had other issues, but we could have a simple service that can serve these up on demand.

@ksuderman
Copy link
Contributor

I don't think locking will solve the access issue in #406 (comment), that's got something to do with an existing table (in RO) being read for which SELECT name FROM sqlite_master WHERE type="table" returns nothing.

I think it might solve this problem. My working theory is there is a race for that Sqlite database file. One process has started creating the database when another process comes along, sees the file exists so tries to open in read mode only to find the table hasn't actually been created yet.

my main conclusion was that there's no easy, portable and reliable way to have a cache that can be read by multiple hosts, processes and threads.

Agreed. And we would have to make sure we don't trade a race condition for a deadlock. Having a separate service for the tool cache is likely the easiest and most robust solution (short of doing away with it entirely).

@dannon
Copy link
Member

dannon commented Aug 8, 2023

I think it might solve this problem. My working theory is there is a race for that Sqlite database file. One process has started creating the database when another process comes along, sees the file exists so tries to open in read mode only to find the table hasn't actually been created yet.

Yep, that was my understanding. I don't see a way to actually reproduce the error with an exclusive file lock in place. It shouldn't be possible.

@nuwang
Copy link
Member

nuwang commented Aug 8, 2023

General, as in a project that is not anvil ? I'd say no, even booting from my instance with cold cvms it doesn't make a difference. Why it makes a difference on anvil, if it does ? IDK

I'm puzzled by why you are not experiencing a slow down. Could it be due to proximity to cvmfs or something? The reason is that in our observations, especially with a full toolbox, the bulk of the time was spent on fetching files from cvmfs. Especially in cold start scenarios, reading a thousand files off of cvmfs "should" be slow, since it implies 1000 sequential http requests? In contrast, when the sqlite database was distributed over cvmfs itself, it resulted in a pretty remarkable speedup. It was like swapping a horse and buggy for a bugatti - unbelievably fast.

Once the sqlite database was taken out of cvmfs, and had to be rebuilt on the fly, cold start times became slow again, but restarts were fast since the database was already populated. @afgane Can you confirm that this is your experience as well, and none of this is a memory-error on my part?

@mvdbeek
Copy link
Member

mvdbeek commented Aug 8, 2023

I think it might solve this problem. My working theory is there is a race for that Sqlite database file. One process has started creating the database when another process comes along, sees the file exists so tries to open in read mode only to find the table hasn't actually been created yet.

They are RO since they are on cvmfs, that's not it.

@ksuderman
Copy link
Contributor

I think it might solve this problem. My working theory is there is a race for that Sqlite database file. One process has started creating the database when another process comes along, sees the file exists so tries to open in read mode only to find the table hasn't actually been created yet.

They are RO since they are on cvmfs, that's not it.

CVMFS is RO, but the cache itself isn't. The cache does a copy-on-write, which is where I think things go wrong. At least that os.rename() without a guard looks suspicious to me, and would explain all the zero byte cache.sqlitel.tmp files we are seeing.

https://github.com/galaxyproject/galaxy/blob/58851af74b3e79b8693f1d1d6551b87ef004f7e1/lib/galaxy/tools/cache.py#L93

@mvdbeek
Copy link
Member

mvdbeek commented Aug 8, 2023

what did you do to cvmfs to have it writable ? it isn't for writable for me. maybe turn that off ?

@mvdbeek
Copy link
Member

mvdbeek commented Aug 8, 2023

(which is not to say try a filelock too, if that fixes it then I'm happy)

@ksuderman
Copy link
Contributor

what did you do to cvmfs to have it writable ? it isn't for writable for me. maybe turn that off ?

I didn't do anything, I am just trying to figure it out. Git blame says you wrote that line 😉

From what I can figure out the cache copies the files from CVMFS to a Kubernetes persistent disk, which is why the cache is writeable. I assume the cache is writeable so tools can be added by the user/admin?

@mvdbeek
Copy link
Member

mvdbeek commented Aug 8, 2023

I didn't do anything, I am just trying to figure it out. Git blame says you wrote that line 😉

that's guarded by the file being writeable, so you should never get there.

@mvdbeek
Copy link
Member

mvdbeek commented Aug 8, 2023

I assume the cache is writeable so tools can be added by the user/admin?

oh, that seems like a problem, that should go to a different shed_tools_conf.xml file with a separate cache (or no cache)

@ksuderman
Copy link
Contributor

that's guarded by the file being writeable, so you should never get there.

Yet we do. Or at least it is the next line self.reopen_ro() where the exception is being raised. The race looks like it happens between the rename and the reopen_ro.

oh, that seems like a problem, that should go to a different shed_tools_conf.xml file with a separate cache (or no cache)

That might also explain another problem I am seeing with tools going missing when relaunching a cluster. I will install a workflow with planemo, shut down the cluster (but keep the disks), launch a new cluster and attach the existing disks. This is to simulate users on Terra that pause/resume their Galaxy cluster. However, the workflow no longer runs as tools are not available and no longer appear in the tool panel. Trying to reinstall the tools with Planemo doesn't work as it thinks the tools are present. I haven't had a chance to see if disabling the tool cache solves that problem. It is also rather moot as Terra users can't install tools, but it is odd.

@afgane
Copy link
Contributor

afgane commented Aug 8, 2023

It is also rather moot as Terra users can't install tools, but it is odd.

They actually can as they have full admin access on Galaxy. I don't recall seeing (or hearing about) this issue so wonder if could be new?

Once the sqlite database was taken out of cvmfs, and had to be rebuilt on the fly, cold start times became slow again, but restarts were fast since the database was already populated. @afgane Can you confirm that this is your experience as well, and none of this is a memory-error on my part?

Unfortunately I do not have a confident recollection on the topic so the best option would be to test a couple of launches with the cache on&off.

@ksuderman
Copy link
Contributor

They actually can as they have full admin access on Galaxy. I don't recall seeing (or hearing about) this issue so wonder if could be new?

Whoops, you are right; I was looking in the wrong place for the Tool Install page... I only started re-using the disks for the latest benchmarking runs so I didn't notice the problem until now. Maybe because I am installing tools via the API (planemo) rather than the UI?

@mvdbeek
Copy link
Member

mvdbeek commented Aug 8, 2023

Or at least it is the next line self.reopen_ro() where the exception is being raised.

I don't follow what next line you mean ?

@mvdbeek
Copy link
Member

mvdbeek commented Aug 8, 2023

Also if you can post again the stack we're discussing if it's not the one in the opening post?

@ksuderman
Copy link
Contributor

Or at least it is the next line self.reopen_ro() where the exception is being raised.

Also if you can post again the stack we're discussing if it's not the one in the opening post?

It is the same stack trace, but I think I have not been very good at describing what I think is going on. In particular, I don't think the problem code is where the exception is being raised.

The exception is raised in _get_cache when the sqlite database file exists but an expected table doesn't exist. Since the database file exists, the database is being opened in RO mode and SqliteDict refuses to create the missing table. In my case this occurs after 8-10 helm upgrade commands, which means that for 8-10 iterations that call sequence worked; the database file existed and the table was present. So something, somewhere, replaced the database file with an empty file tripping up _get_cache.

I don't follow what next line you mean ?

Sorry on the line after the one I linked to above.
https://github.com/galaxyproject/galaxy/blob/58851af74b3e79b8693f1d1d6551b87ef004f7e1/lib/galaxy/tools/cache.py#L94

Which is in the persist method.

def persist(self):
    if self.writeable_cache_file:
        self._cache.commit()
        os.rename(self.writeable_cache_file.name, self.cache_file) # <--- race here
        self.reopen_ro()  # <-- and/or here

This is where I think the problem really happens as there only a few places in the code that actually manipulate/remove the database file. There may be others, such as the make_writable method, which is just above the persist method, but I can only find one or two places that actually remove the file. My thinking is that it must be one of these places that is the true cause of the exception.

Since I am running a helm upgrade command Kubernetes is starting new web, job, and workflow pods before shutting down the existing pods. So in the galaxy-main startup script it is likely that we have some pods entering the app_loop while other pods are still exiting the loop and calling galaxy_app.shutdown().

At least that is my theory.

@mvdbeek
Copy link
Member

mvdbeek commented Aug 9, 2023

Right, but then that comes back to the cache being RO, so your persist is a NO-OP. I don't think the cvmfs mount should be writable, it's not necessary for user installs (those should go to a different shed_tool_conf.xml file).

@ksuderman
Copy link
Contributor

The CVMFS mount is RO, but the cache is RW.

suderman$ kubectl exec -itn galaxy galaxy-galaxy-job-0-7c987c7b49-fs5dr -- ls -lh /galaxy/server/database/cache/tool_cache
Defaulted container "galaxy-job-0" out of: galaxy-job-0, galaxy-wait-db (init)
total 296K
-rw-r--r-- 1 galaxy nogroup 240K Aug  9 17:06 cache.sqlite
-rw-r--r-- 1 galaxy nogroup  12K Aug  9 17:06 tmplf2ziimacache.sqlite.tmp
-rw-r--r-- 1 galaxy nogroup  13K Aug  9 17:06 tmplf2ziimacache.sqlite.tmp-journal
-rw-r--r-- 1 galaxy nogroup    0 Aug  9 17:06 tmpst3rw4t9cache.sqlite.tmp
-rw-r--r-- 1 galaxy nogroup  12K Aug  9 17:06 tmpuwnyp_0vcache.sqlite.tmp
-rw-r--r-- 1 galaxy nogroup  13K Aug  9 17:06 tmpuwnyp_0vcache.sqlite.tmp-journal

The only place I can see that the cache.sqlite.tmp files are being created is in the make_writable method. Also, the cache is written to at startup when the hidden set_metadata_tool is created.

Unless you have another theory that can explain 1) why the cache.sqlite will periodically become empty causing the above exception, and 2) where those cache.sqlite.tmp files are coming from.

@mvdbeek
Copy link
Member

mvdbeek commented Aug 10, 2023

You should have<toolbox tool_path="/cvmfs/main.galaxyproject.org/shed_tools" tool_cache_data_dir="/cvmfs/main.galaxyproject.org/var/tool_cache"> in the shed_tool_conf.xml file on cvmfs. We don't have a setting that would let you turn off the cache just for local tools, but can you try this?

diff --git a/lib/galaxy/tools/__init__.py b/lib/galaxy/tools/__init__.py
index 2c76989717..b5ebed5f98 100644
--- a/lib/galaxy/tools/__init__.py
+++ b/lib/galaxy/tools/__init__.py
@@ -490,13 +490,13 @@ class ToolBox(AbstractToolBox):
         return self._tools_by_id
 
     def get_cache_region(self, tool_cache_data_dir):
-        if self.app.config.enable_tool_document_cache:
+        if self.app.config.enable_tool_document_cache and tool_cache_data_dir:
             if tool_cache_data_dir not in self.cache_regions:
                 self.cache_regions[tool_cache_data_dir] = ToolDocumentCache(cache_dir=tool_cache_data_dir)
             return self.cache_regions[tool_cache_data_dir]
 
     def create_tool(self, config_file, tool_cache_data_dir=None, **kwds):
-        cache = self.get_cache_region(tool_cache_data_dir or self.app.config.tool_cache_data_dir)
+        cache = self.get_cache_region(tool_cache_data_dir)
         if config_file.endswith(".xml") and cache and not cache.disabled:
             tool_document = cache.get(config_file)
             if tool_document:

that means Galaxy would only use the cache when the cache location is listed in the tool config file, so not for non-cvmfs tools.

Additionally this would also be fine:

diff --git a/lib/galaxy/tools/cache.py b/lib/galaxy/tools/cache.py
index 8846c8e154..3704172198 100644
--- a/lib/galaxy/tools/cache.py
+++ b/lib/galaxy/tools/cache.py
@@ -49,7 +49,7 @@ class ToolDocumentCache:
             else:
                 cache_file = self.writeable_cache_file.name if self.writeable_cache_file else self.cache_file
                 self._cache = SqliteDict(cache_file, flag=flag, encode=encoder, decode=decoder, autocommit=False)
-        except sqlite3.OperationalError:
+        except (sqlite3.OperationalError, RuntimeError):
             log.warning("Tool document cache unavailable")
             self._cache = None
             self.disabled = True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants