⬆️ [email protected] #722

CNugteren · 2022-04-01T14:29:45Z

What do these changes do?

This PR updates LCE to TensorFlow 2.8.0. Quite a few changes have been made, and most likely more are needed. I'll highlight some of the more controversial changes below in the form of a comment.

How Has This Been Tested?

All unittests have been ran locally. Fixes have been made as compilation, linking, or testing errors showed-up. However, some of them do not succeed yet, most notably the MLIR tests. This PR is therefore still a draft.

Benchmark Results

Latency has not been verified yet.

Related issue number

N/A.

Tasks

Open tasks before being able to turn the draft PR in an actual PR:

Resolve the failing MLIR tests
Deal with the deprecated make build system, translate to CMake?

.tensorflow.bazelrc

larq_compute_engine/mlir/tests/bitpack-weights.mlir

larq_compute_engine/mlir/tf_tfl_passes.cc

larq_compute_engine/mlir/tf_to_tfl_flatbuffer.cc

larq_compute_engine/tflite/kernels/bconv2d.cc

larq_compute_engine/tflite/tests/interpreter_test.py

third_party/tensorflow_patches/fix_armhf_xnnpack.patch

lgeiger · 2022-04-05T16:46:13Z

larq_compute_engine/mlir/ir/lce_ops.cc

+Operation* LarqDialect::materializeConstant(OpBuilder& builder, Attribute value,
+                                            Type type, Location loc) {
+  if (arith::ConstantOp::isBuildableWith(value, type))
+    return builder.create<mlir::arith::ConstantOp>(loc, type, value);
+  return nullptr;
+}


This is required to enable constant folding with the switch to arith.ConstantOp. TensorFlow does something similar as well. See here for more information about this.

In the tensorflow example they first try to make it into a TFL::ConstOp before they try the arith::ConstOp. I guess we don't need that here because the larq dialect is used for TF ops and not TFL ops?

If I understand their code correctly it will create a tfl.pseudo_const if the value is opaque which isn't supported by arith::ConstantOp. I don't think this is a case that we will encounter. In general TFL::ConstOp will always be folded to arith::ConstantOp if possible so I think this should be fine to keep it like this for now.

larq_compute_engine/mlir/transforms/quantize_patterns.td

larq_compute_engine/tests/end2end_test.py

larq_compute_engine/mlir/tf_tfl_passes.cc

* Update converter to sync with upstream tensorflow * Update larq_compute_engine/mlir/tf_to_tfl_flatbuffer.cc Co-authored-by: Tom Bannink <[email protected]> Co-authored-by: Tom Bannink <[email protected]>

CNugteren · 2022-04-11T16:00:33Z

I've removed the old make-builds, and replaced it with the new CMake build system. I've only done something quite basic, building the library and the example and benchmark binaries. I tested the benchmark binary locally and that seems to work. So I think we are ready for a last round of reviews now.

lgeiger · 2022-04-11T16:06:24Z

I've removed the old make-builds, and replaced it with the new CMake build system. I've only done something quite basic, building the library and the example and benchmark binaries. I tested the benchmark binary locally and that seems to work. So I think we are ready for a last round of reviews now.

Awesome! 🚀

Just for reference, so we don't forget. Once this is merged and we've released a new version we will also need to update build docs.

lgeiger · 2022-04-13T11:11:33Z

I triggered a test release, let's see how this goes.

Tombana · 2022-04-13T11:37:22Z

I triggered a test release, let's see how this goes.

The windows looks like it succeeded within 12 minutes but actually the build failed with

The target you are compiling requires Visual C++ build tools. 
Bazel couldn't find a valid Visual C++ build tools installation on your machine.

The linux builds fail because the manylinux toolchain bazel files can't be found; they've probably been updated to newer versions.

lgeiger · 2022-04-13T11:59:49Z

The windows looks like it succeeded within 12 minutes but actually the build failed with

I pinned the Windows VM version to 2019 in 5bf5748, let's see if this fixes the issue. Alternatively we could also change the BAZEL_VC path to make it work with Windows 2022, but not sure if we would run into other issues then.

The linux builds fail because the manylinux toolchain bazel files can't be found; they've probably been updated to newer versions.

I pinned the docker image to 2.8 in 3923692 which didn't seem to fix things.

It seems like the TensorFlow repo removed their toolchains from the repo. I copied the toolchain from the TF Addons repo in bdb6f20. They use the same docker container to build the wheels as we do, so I expect this to work.

I triggered a new build to test these changes.

CNugteren · 2022-04-15T09:54:39Z

Two builds failed:

The Linux Python 3.9 builds. I couldn't see any logs there so I assume there was some GitHub Actions glitch. I've re-triggered it: https://github.com/larq/compute-engine/runs/6036217417?check_suite_focus=true
All Windows builds timed out after 360 minutes (6 hours). Earlier build also reached around that point. I've added a time-out of 10 hours now, let's see if that works: https://github.com/larq/compute-engine/runs/6036809952?check_suite_focus=true

Update: the linux build fails with:

ERROR: /root/.cache/bazel/_bazel_root/db801a7aca49c33ba961fa7cde3e7bf7/external/org_tensorflow/tensorflow/compiler/mlir/tensorflow/BUILD:573:11: Compiling tensorflow/compiler/mlir/tensorflow/ir/tf_ops.cc failed: (Exit 4): crosstool_wrapper_driver_is_not_gcc failed: error executing command 
(...)
gcc: internal compiler error: Killed (program cc1plus)

But in the latest build it is the 3.8 version of the linux build that fails and the 3.9 passes, so that seems random:
https://github.com/larq/compute-engine/runs/6036809834?check_suite_focus=true

CNugteren · 2022-04-19T06:44:11Z

The timeout increase did not work, it still times out after 6 hours: https://github.com/larq/compute-engine/runs/6036809795?check_suite_focus=true. Perhaps that is some global setting or a Github Actions maximum? In any case, this time two out of the 3 Windows builds passed since they finished in five and a half hours, only one timed out. This behaviour was also seen earlier without TF 2.8, it was always tight with this 6 hours maximum. So I do not consider this related to this branch.

My proposal is to merge this in and then investigate making the releases more stable in a separate PR (Linux builds random failures and Windows builds time outs)? I can open an issue for that.

Tombana · 2022-04-19T08:20:34Z

I agree, let's merge it.

I think I've seen this before and in that particular case the system had run out of memory. That could be fixed by adding swap space or changing the number of jobs that bazel runs in parallel. But maybe this problem is different.

gcc: internal compiler error: Killed (program cc1plus)

.github/workflows/release.yml

Co-authored-by: Lukas Geiger <[email protected]>

CNugteren requested review from Tombana and lgeiger April 1, 2022 14:29

CNugteren added the feature New feature or request label Apr 1, 2022