Chromium bundles its own pre-built version of Clang. This is done so that Chromium developers have access to the latest and greatest developer tools provided by Clang and LLVM (ASan, CFI, coverage, etc). In order to update the compiler (roll clang), it has to be tested so that we can be confident that it works in the configurations that Chromium cares about.
We maintain a waterfall of builders that continuously build fresh versions of Clang and use them to build and test Chromium. "Clang gardening" is the process of monitoring that waterfall, determining if any compile or test failures are due to an upstream compiler change, filing bugs upstream, and often reverting bad changes in LLVM. This document describes some of the processes and techniques for doing that.
Some may find the sheriff-o-matic view of the waterfall easier to work with.
To keep others informed, file a
bug.
earlier rather than later for build breaks likely caused by changes in
clang or the rest fo the toolchain. Make sure to set the component field to
Tools > LLVM
, which will include the entire Chrome toolchain (Lexan) team.
At the beginning of your gardener rotation, it may be useful to search for recent bot breaks. We prefer searching like this to having gardeners compose status email at the end of their week.
In addition to the waterfall, make sure dry run attempts at updating clang are green. As part of the Clang release process we run upstream LLVM tests. Ideally these tests are covered by upstream LLVM bots and breakages are quickly noticed and fixed by the original author of a breaking commit, but that is sadly not always the case.
Each gardener should attempt to update the compiler by performing a Clang roll during their week, assuming the bots are green enough.
The gardener is also responsible for taking notes during the weekly Chrome toolchain (Lexan) status sync-up meeting.
[TOC]
If there are any issues with disk running out of space, file a go/bug-a-trooper bug, for example https://crbug.com/1105134.
Chromium does not always build and pass tests in all configurations that everyone cares about. Some configurations simply take too long to build (ThinLTO) or be tested (dbg) on the CQ before committing. And, some tests are flaky. So, our console is often filled with red boxes, and the boxes don't always need to be green to roll clang.
Oftentimes, if a bot is red with a test failure, it's not a bug in the compiler. To check this, the easiest and best thing to do is to try to find a corresponding builder that doesn't use ToT clang. For standard configurations, start on the waterfall that corresponds to the OS of the red bot, and search from there. If the failing bot is Google Chrome branded, go to the (Google internal) official builder list and start searching from there.
If you are feeling charitable, you can try to see when the test failure was
introduced by looking at the history in the bot. One way to do this is to add
?numbuilds=200
to the builder URL to see more history. If that isn't enough
history, you can manually binary search build numbers by editing the URL until
you find where the regression was introduced. If it's immediately clear what CL
introduced the regression (i.e. caused tests to fail reliably in the official
build configuration), you can simply load the change in gerrit and revert it,
linking to the first failing build that implicates the change being reverted.
If the failure looks like a compiler bug, these are the common failures we see and what to do about them:
- compiler crash
- compiler warning change
- compiler error
- miscompile
- linker errors
This is probably the most common bug. The standard procedure is to do these things:
-
Open the
gclient runhooks
stdout log from the first red build. Near the top of that log you can find the range of upstream llvm revisions. For example:From https://github.com/llvm/llvm-project f917356f9ce..292e898c16d master -> origin/master
-
File a crbug documenting the crash. Include the range, and any other bots displaying the same symptoms.
-
All clang crashes on the Chromium bots are automatically uploaded to Cloud Storage. On the failing build, click the "stdout" link of the "process clang crashes" step right after the red compile step. It will print something like
processing heap_page-65b34d... compressing... uploading... done gs://chrome-clang-crash-reports/v1/2019/08/27/chromium.clang-ToTMac-20955-heap_page-65b34d.tgz removing heap_page-65b34d.sh removing heap_page-65b34d.cpp
Use
gsutil.py cp gs://chrome-clang-crash-reports/v1/2019/08/27/chromium.clang-ToTMac-20955-heap_page-65b34d.tgz .
to copy it to your local machine. Untar withtar xzf chromium.clang-ToTMac-20955-heap_page-65b34d.tgz
and change the included shell script to point to a locally-built clang. Remove the-Xclang -plugin
flags. If you re-run the shell script, it should reproduce the crash. -
Identify the revision that introduced the crash. First, look at the commit messages in the LLVM revision range to see if one modifies the code near the point of the crash. If so, try reverting it locally, rebuild, and run the reproducer to see if the crash goes away.
If that doesn't work, use
git bisect
. Use this as a template for the bisect run script:#!/bin/bash cd $(dirname $0) # get into llvm build dir ninja -j900 clang || exit 125 # skip revisions that don't compile ./t-8f292b.sh || exit 1 # exit 0 if good, 1 if bad
-
File an upstream bug like http://llvm.org/PR43016. Usually the unminimized repro is too large for LLVM's bugzilla, so attach it to a (public) crbug and link to that from the LLVM bug. Then revert with a commit message like "Revert r368987, it caused PR43016."
-
If you want, make a reduced repro using CReduce. Clang contains a handy wrapper around CReduce that you can invoke like so:
clang/utils/creduce-clang-crash.py --llvm-bin bin \ angle_deqp_gtest-d421b0.sh angle_deqp_gtest-d421b0.cpp
Attach the reproducer to the llvm bug you filed in the previous step.
If you need to do something the wrapper doesn't support, follow the official CReduce docs for writing an interestingness test and use creduce directly.
New Clang versions often find new bad code patterns to warn on. Chromium builds
with -Werror
, so improvements to warnings often turn into build failures in
Chromium. Once you understand the code pattern Clang is complaining about, file
a bug to do either fix or silence the new warning.
If this is a completely new warning, disable it by adding -Wno-NEW-WARNING
to
this list of disabled
warnings
if llvm_force_head_revision
is true. Here is an
example. This will keep the
ToT bots green while you decide what to do.
Sometimes, behavior changes and a pre-existing warning changes to warn on new code. In this case, fixing Chromium may be the easiest and quickest fix. If there are many sites, you may consider changing clang to put the new diagnostic into a new warning group so you can handle it as a new warning as described above.
If the warning is high value, then eventually our team or other contributors
will end up fixing the crbug and there is nothing more to do. If the warning
seems low value, pass that feedback along to the author of the new warning
upstream. It's unlikely that it should be on by default or enabled by -Wall
if
users don't find it valuable. If the warning is particularly noisy and can't be
easily disabled without disabling other high value warnings, you should consider
reverting the change upstream and asking for more discussion.
This rarely happens, but sometimes clang becomes more strict and no longer accepts code that it previously did. The standard procedure for a new warning may apply, but it's more likely that the upstream Clang change should be reverted, if the C++ code in question in Chromium looks valid.
Miscompiles tend to result in crashes, so if you see a test with the CRASHED status, this is probably what you want to do.
- Bisect object files to find the object with the code that changed. LLVM
contains
llvm/utils/rsp_bisect.py
which may be useful for bisecting object files using an rsp file. - Debug it with a traditional debugger
ld.lld
's --reproduce
flag makes LLD write a tar archive of all its inputs
and a file response.txt
that contains the link command. This allows people to
work on linker bugs without having to have a Chromium build environment.
To use ld.lld
's --reproduce
flag, follow these steps:
-
Locally (build Chromium with a locally-built clang)[https://chromium.googlesource.com/chromium/src.git/+/main/docs/clang.md#Using-a-custom-clang-binary]
-
After reproducing the link error, build just the failing target with ninja's
-v -d keeprsp
flags added:ninja -C out/gn base_unittests -v -d keeprsp
. -
Copy the link command that ninja prints,
cd out/gn
, paste it, and manually append-Wl,--reproduce,repro.tar
. Withlld-link
, instead append/reproduce:repro.tar
. (ld.lld
is invoked through theclang
driver, so it needs-Wl
to pass the flag through to the linker.lld-link
is called directly, so the flag needs no prefix.) -
Zip up the tar file:
gzip repro.tar
. This will take a few minutes and produce a .tar.gz file that's 0.5-1 GB. -
Upload the .tar.gz to Google Drive. If you're signed in with your @google address, you won't be able to make a world-shareable link to it, so upload it in a Window where you're signed in with your @chromium account.
-
File an LLVM bug linking to the file. Example: http://llvm.org/PR43241
TODO: Describe object file bisection, identify obj with symbol that no longer has the section.
Sometimes, problems occur in ThinLTO builds that do not occur in non-LTO builds. These steps can be used to debug such problems.
Notes:
-
All steps assume they are run from the output directory (the same directory args.gn is in).
-
Commands have been shortened for clarity. In particular, Chromium build commands are generally long, with many parts that you just copy-paste when debugging. These have largely been omitted.
-
The commands below use "clang++", where in practice there would be some path prefix in front of this. Make sure you are invoking the right clang++. In particular, there may be one in the PATH which behaves very differently.
To get the command that is used to link base_unittests:
$ rm base_unittests
$ ninja -n -d keeprsp -v base_unittests
This will print a command line. It will also write a file called base_unittests.rsp
, which
contains additional parameters to be passed.
ThinLTO uses a cache to avoid compilation in some cases. This can be confusing
when debugging, so make sure to remove the various cache flags like
-Wl,--thinlto-cache-dir
.
Expand thin archives mentioned in the command line to their individual object files.
The script tools/clang/scripts/expand_thin_archives.py
can be used for this purpose.
For example:
$ ../../tools/clang/scripts/expand_thin_archives.py -p=-Wl, -- @base_unittests.rsp > base_unittests.expanded.rsp
The -p
parameter here specifies the prefix for parameters to be passed to the linker.
If you are invoking the linker directly (as opposed to through clang++), the prefix should
be empty.
$ ../../tools/clang/scripts/expand_thin_archives.py -p='', -- @base_unittests.rsp > base_unittests.expanded.rsp
Edit the link command to use the expanded command line, and remove any mention of -Wl,--start-group
and -Wl,--end-group
that surround the expanded command line. For example, if the original command was:
clang++ -fuse-ld=lld -o ./base_unittests -Wl,--start-group @base_unittests.rsp -Wl,--end-group
the new command should be:
clang++ -fuse-ld=lld -o ./base_unittests @base_unittests.expanded.rsp
The reason for this is that the -start-lib
and -end-lib
flags that expanding the command
line produces cannot be nested inside --start-group
and --end-group
.
In a ThinLTO build, what is normally the compile step that produces native object files instead produces LLVM bitcode files. A simple example would be:
$ clang++ -c -flto=thin foo.cpp -o foo.o
In a Chromium build, these files reside under obj/
, and you can generate them using ninja.
For example:
$ ninja obj/base/base/lock.o
These can be fed to llvm-dis
to produce textual LLVM IR:
$ llvm-dis -o - obj/base/base/lock.o | less
When using split LTO unit (-fsplit-lto-unit
, which is required for
some features, CFI among them), this may produce a message like:
llvm-dis: error: Expected a single module
In that case, you can use llvm-modextract
:
$ llvm-modextract -n 0 -o - obj/base/base/lock.o | llvm-dis -o - | less
The ThinLTO linking process proceeds in a number of stages. The bitcode that is
generated during these stages can be saved by passing -save-temps
to the linker:
$ clang++ -fuse-ld=lld -Wl,-save-temps -o ./base_unittests @base_unittests.expanded.rsp
This generates files such as:
- lock.o.0.preopt.bc
- lock.o.3.import.bc
- lock.o.5.precodegen.bc
in the directory where lock.o is (obj/base/base).
These can be fed to llvm-dis
to produce textual LLVM IR. They show
how the code is transformed as it progresses through ThinLTO stages.
Of particular interest are:
- .3.import.bc, which shows the IR after definitions have been imported from
other modules, but before optimizations. Running this through LLVM's
opt
tool with the right optimization level can often reproduce issues. - .5.precodegen.bc, which shows the IR just before it is transformed to native
code. Running this through LLVM's
llc
tool with the right optimization level can often reproduce issues.
The same -save-temps
command also produces base_unittests.resolution.txt
, which
shows symbol resolutions. These look like:
-r=obj/base/test/run_all_base_unittests/run_all_base_unittests.o,main,plx
In this example, run_all_base_unittests.o contains a symbol named main, with flags plx.
The possible flags are:
- p: prevailing: of symbols with this name, this one has been chosen.
- l: final definition in this linkage unit.
- r: redefined by the linker.
- x: visible to regular (that is, non-LTO) objects.
To speed up debugging, it may be helpful to limit code generation to a single module if you know the name of the module (e.g. the module name is in a crash dump).
-Wl,--thinlto-single-module=foo
tells ThinLTO to only run
optimizations/codegen on files matching the pattern and skip linking. This is
helpful especially in combination with -Wl,-save-temps
.
$ clang++ -fuse-ld=lld -Wl,--thinlto-single-module=obj/base/base/lock.o -o ./base_unittests @base_unittests.expanded.rsp
You should see
[ThinLTO] Selecting obj/base/base/lock.o to compile
being printed.
Finding what object files differ between two directories:
$ diff -u <(cd out.good && find . -name "*.o" -exec sha1sum {} \; | sort -k2) \
<(cd out.bad && find . -name "*.o" -exec sha1sum {} \; | sort -k2)
Or with cmp:
$ find good -name "*.o" -exec bash -c 'cmp -s $0 ${0/good/bad} || echo $0' {} \;