created | last updated | status | title | authors | |
---|---|---|---|---|---|
2019-04-29 |
2019-04-29 |
implemented |
Forcing non-cache-hits in the repository cache |
|
This document describes an extension of the repository build API that can be used to artifically avoid cache hits in the repository cache.
Bazel requires a view of the whole dependency graph. But not everybody organizes their source code in a single repository. Therefore, bazel has the concept of external repositories to specify additional parts of the source tree to be fetched from some other location.
Several projects, especially related ones, often depend on the same external resources. To avoid re-downloading them for every project, bazel has a cache, shared between all workspaces, of downloaded files, indexed by their hash.
It is good practice, for reasons of security and hermeticity, to specify a hash of any external file needed. Therefore, those hashes are available anyway for most downloads, making the cache work efficiently.
Upon updating external dependencies, sometimes it is forgotten to update the
hash as well, leading to cache hits not desired by the user
(#5045,
#6089).
This is especially confusing, if the download is only indirectly through
a different rule (e.g., maven_jar
) where a canonical identifier is available,
and the hash is considered only a security measure
(#5144,
#5250).
To avoid such unintended cache hits, it was requested to artifically split the cache. We propose to do it in the following way.
The repository-context functions download
and download_and_extract
are
extended by an additional optional parameter canonicial_id
. If this parameter
is set to a non-empty string, a file is taken only from cache, if the a
file for the logical pair consisting of hash and canonical_id
is already in the logical cache; such a pair enters the logical cache, if a
file with that hash is downloaded via a download
or download_and_extract
request specifying that canonical_id
.
We do not want to store the same file several times on disk. Moreover, we
want a file downloaded with a canonical_id
also be available in cache for
requests asking for a hash only. Therefore, we still store the files by
hash, as we currently do, and only add the additional information by which
canonical_id
s a file is known by. As the on-disk format uses one directory per
hashed file, we can directly add the information there. To avoid having
conflicts when updating the list, we use one file per canonical_id
. More
precisely, we add a file where the name id-
followed by the hash of the
canonical_id
; the content of the file will be the canonical_id
itself.
As the hash is assummed to be sufficiently secure, there won't be any
name clashes, also not with the payload entry called file
. Moreover, using
(essentially) the hash rather than the id avoids problems with very long
canonical_id
s and with difficult characters in the canonical_id
,
including /
.
The approach of artificially splitting the cache has several limitations users should be aware of.
-
Unintended Fragmentation: if several rules (or variants thereof used by different projects) refer to the same file via a different
canonical_id
, it will be downloaded once for every such identifier. In particular, URLs don't work well as canonical identifieres for archives distributed via different organisations. -
Bazel has no means of verifying the canonical id; it is purely declarative from bazel's point of view. So, if a an incorrect use of a rule specifies an unintended pair of
canonical_id
and URL, bazel will add whatever file downloaded from there to the logical cache. That will be available for future cache hits.
The change adds a new optional argument to ctx.download
and
ctx.download_and_extract
. If the argument is not provided, the behavior
does not change. So the proposed change is backwards compatible.