-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AArch64 Linux test failures on LLVM 12 and newer #597
Comments
Not sure how useful this is, but here's some quick tests (using 23559ea), macOS on M1. Exporting both
|
Interesting, thanks. I guess the issue identified in the title here is really a Linux problem, so that means the issue on Apple M1 is really somewhere else. Thanks for taking the time to test this. |
Here's some more (hopefully useful) information gleaned from running each failing test individually in
54321
atomicrmw.t:69: terra atomic_add_and_return(x : &int32,y : int32) : int32
atomicrmw.t:70: return atomicrmw("add", x, y, { syncscope = "", ordering = "acq_rel", align = native, isvolatile = false })
atomicrmw.t:69: end
definition {&int32,int32} -> int32
define dso_local i32 @"$atomic_add_and_return"(i32* %0, i32 %1) {
entry:
%2 = atomicrmw add i32* %0, i32 %1 acq_rel, align 4
ret i32 %2
}
assembly for function at address 0x10978c000
0x10978c000(+0): ldaxr w8, [x0]
0x10978c004(+4): add w9, w8, w1
0x10978c008(+8): stlxr w10, w9, [x0]
0x10978c00c(+12): cbnz w10, #-12
0x10978c010(+16): mov x0, x8
0x10978c014(+20): ret
22
atomicrmw.t:91: terra atomic_fadd(x : &double,y : double) : {}
atomicrmw.t:92: atomicrmw("fadd", x, y, { syncscope = "", ordering = "monotonic", align = native, isvolatile = false })
atomicrmw.t:91: end
definition {&double,double} -> {}
define dso_local void @"$atomic_fadd"(double* %0, double %1) {
entry:
%2 = atomicrmw fadd double* %0, double %1 monotonic, align 8
ret void
}
assembly for function at address 0x109794000
0x109794000(+0): ldr d1, [x0]
0x109794004(+4): b #16
0x109794008(+8): fmov d1, x10
0x10979400c(+12): cmp x10, x9
0x109794010(+16): b.eq #40
0x109794014(+20): fadd d2, d1, d0
0x109794018(+24): fmov x8, d2
0x10979401c(+28): fmov x9, d1
0x109794020(+32): ldaxr x10, [x0]
0x109794024(+36): cmp x10, x9
0x109794028(+40): b.ne #-32
0x10979402c(+44): stlxr wzr, x8, [x0]
0x109794030(+48): cbnz wzr, #-16
0x109794034(+52): b #-44
0x109794038(+56): ret
21
atomicrmw.t:69: terra atomic_add_and_return(x : &int32,y : int32) : int32
atomicrmw.t:70: return atomicrmw("add", x, y, { syncscope = "", ordering = "acq_rel", align = native, isvolatile = false })
atomicrmw.t:69: end
definition {&int32,int32} -> int32
define dso_local i32 @"$atomic_add_and_return"(i32* %0, i32 %1) {
entry:
%2 = atomicrmw add i32* %0, i32 %1 acq_rel, align 4
ret i32 %2
}
assembly for function at address 0x10978c000
0x10978c000(+0): ldaxr w8, [x0]
0x10978c004(+4): add w9, w8, w1
0x10978c008(+8): stlxr w10, w9, [x0]
0x10978c00c(+12): cbnz w10, #-12
0x10978c010(+16): mov x0, x8
0x10978c014(+20): ret
LLVM ERROR: Cannot select: intrinsic %llvm.aarch64.stlxr
Process 49454 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
frame #0: 0x00000001a314ad98 libsystem_kernel.dylib`__pthread_kill + 8
Process 49620 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x3131676e69727450)
frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
Process 49681 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x3131676e69727450)
frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
Expect this ouput:
Process 49745 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x697274530b008c00)
frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
Expect this ouput:
Process 61876 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xfff9000000000000)
frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
running test for int8[0]
running test for int8[1]
running test for int8[2]
running test for int8[3]
running test for int8[4]
running test for int8[5]
running test for int8[6]
running test for int8[7]
running test for int8[8]
running test for int8[9]
cconv_array.t:53: NYI: cannot call this C function (yet)
stack traceback:
[C]: in function 'caller'
cconv_array.t:53: in function 'run_test_case'
cconv_array.t:64: in main chunk
Process 49838 exited with status = 1 (0x00000001)
metatype init of Animal
allocating slot 1 to method speak in type Animal
stub speak in Animal
metatype init of Dog
importing speak at slot 1 in type Dog
override slot 1 with method speak in type Dog
stub speak in Dog
metatype init of Cat
importing speak at slot 1 in type Cat
override slot 1 with method speak in type Cat
stub speak in Cat
meow! 1876945344
woof! 1876945344
metatype init of P
allocating slot 1 to method add in type P
stub add in P
metatype init of C$1
importing add at slot 1 in type C$1
allocating slot 2 to method sub in type C$1
stub add in C$1
stub sub in C$1
defining interface Add for type P
defining interface Add for type C$1
defining interface Sub for type C$1
interface xA�
Process 49941 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
metatype init of Leaf
allocating slot 1 to method print in type Leaf
stub print in Leaf
metatype init of Node
importing print at slot 1 in type Node
override slot 1 with method print in type Node
stub print in Node
defining interface Prints for type Leaf
defining interface Prints for type Node
Process 50072 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xd50)
frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
fakeasm.t:32: assertion failed!
stack traceback:
[C]: in function 'assert'
fakeasm.t:32: in main chunk
Process 50169 exited with status = 1 (0x00000001) Adding a
Process 53567 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8000000060)
frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
Assertion failed: (Ty && "Invalid GetElementPtrInst indices for type!"), function checkGEPType, file Instructions.h, line 921.
Process 53786 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
frame #4: 0x0000000100038d1c terra`llvm::checkGEPType(Ty=0x0000000000000000) at Instructions.h:921:3
918 // message on bad indexes for a gep instruction.
919 //
920 inline Type *checkGEPType(Type *Ty) {
-> 921 assert(Ty && "Invalid GetElementPtrInst indices for type!");
922 return Ty;
923 }
924
Target 0: (terra) stopped. |
Based on the others, it looks like a lot of them are getting caught in This is a shot in the dark, but what happens when you change Line 834 in 0cf6be6
That should completely shut off optimizations in the JIT. For these tests, I think what it will take to make progress is to minimize each test: i.e., delete lines from the test until the the failure behavior changes. Once we've got the smallest version of each test that fails with the same behavior as the original, then we can (hopefully) see what specifically it is about that test that is causing trouble, and then form a hypothesis. |
Replacing (lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xfff9000000000000)
* frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
frame #1: 0x00000001a30436a0 libsystem_c.dylib`__vfprintf + 4544
frame #2: 0x00000001a30531d8 libsystem_c.dylib`vfprintf_l + 68
frame #3: 0x00000001a3070f38 libsystem_c.dylib`printf + 80
frame #4: 0x00000001095e80f8
(lldb) image lookup -va 0x00000001095e80f8
(lldb) Looks like the code at frame 4 might be generated from Terra's JIT. Setting a breakpoint on the address shows it's at least written to by (lldb) w s e 0x00000001095e80f8
Watchpoint created: Watchpoint 1: addr = 0x1095e80f8 size = 8 state = enabled type = w
(lldb) c
Process 62850 resuming
...
Process 62850 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BREAKPOINT (code=258, subcode=0x1095e80f0)
* frame #0: 0x00000001a3197184 libsystem_platform.dylib`_platform_memmove + 180
frame #1: 0x0000000101491784 terra`llvm::RuntimeDyldImpl::emitSection(llvm::object::ObjectFile const&, llvm::object::SectionRef const&, bool) + 756
frame #2: 0x0000000101490a08 terra`llvm::RuntimeDyldImpl::findOrEmitSection(llvm::object::ObjectFile const&, llvm::object::SectionRef const&, bool, std::__1::map<llvm::object::SectionRef, unsigned int, std::__1::less<llvm::object::SectionRef>, std::__1::allocator<std::__1::pair<llvm::object::SectionRef const, unsigned int> > >&) + 120
frame #3: 0x000000010148f704 terra`llvm::RuntimeDyldImpl::loadObjectImpl(llvm::object::ObjectFile const&) + 1952
frame #4: 0x00000001014a4570 terra`llvm::RuntimeDyldMachO::loadObject(llvm::object::ObjectFile const&) + 52
frame #5: 0x0000000101493824 terra`llvm::RuntimeDyld::loadObject(llvm::object::ObjectFile const&) + 800
frame #6: 0x000000010141e43c terra`llvm::MCJIT::generateCodeForModule(llvm::Module*) + 208
frame #7: 0x000000010141f128 terra`llvm::MCJIT::findSymbol(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool) + 544
frame #8: 0x000000010141ed98 terra`llvm::MCJIT::getSymbolAddress(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool) + 124
frame #9: 0x000000010141f3ac terra`llvm::MCJIT::getGlobalValueAddress(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 52
frame #10: 0x0000000100050c14 terra`JITGlobalValue(CU=0x0000600002c04100, gv=0x0000600002c10d88) at tcompiler.cpp:3812:24
frame #11: 0x0000000100009574 terra`terra_jit(L=0x000000010a800380) at tcompiler.cpp:3820:17
frame #12: 0x0000000103070388 terra`lj_BC_FUNCC + 44
frame #13: 0x0000000103017ddc terra`lua_pcall(L=<unavailable>, nargs=<unavailable>, nresults=<unavailable>, errfunc=<unavailable>) at lj_api.c:1145:12 [opt]
frame #14: 0x00000001000052dc terra`docall(L=0x000000010a800380, narg=0, clear=0) at main.cpp:339:14
frame #15: 0x0000000100004e84 terra`main(argc=2, argv=0x000000016fdfee90) at main.cpp:119:13
frame #16: 0x000000010936108c dyld`start + 520
(lldb) w delete 1
1 watchpoints deleted.
(lldb) c
Process 62850 resuming
...
Expect this ouput:
Process 62850 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xfff9000000000000)
* frame #0: 0x00000001a3194864 libsystem_platform.dylib`_platform_strlen + 4
frame #1: 0x00000001a30436a0 libsystem_c.dylib`__vfprintf + 4544
frame #2: 0x00000001a30531d8 libsystem_c.dylib`vfprintf_l + 68
frame #3: 0x00000001a3070f38 libsystem_c.dylib`printf + 80
frame #4: 0x00000001095e80f8 |
I think Unfortunately, debug info does not seem to be functioning on ARM. (You can try with So I think the next best thing we can do is "hardhat" debugging, which in this case probably is going to involve |
Not sure if relevant, but e.g. in C.printf("Config: 0x%0.16lx\n", [long](&(config)))
C.printf("xBCLeftInflowProfile: 0x%0.16lx\n", [long](&(config.BC.xBCLeftInflowProfile.u.File.FileDir)))
C.printf("xBCLeftHeat: 0x%0.16lx\n", [long](&(config.BC.xBCLeftHeat.u.File.FileDir))) Gives the following output:
I'm far from even remotely familiar with Terra and its syntax, but is this expected? Shouldn't the |
Is a Better to use
|
Using |
Ok, well there is definitely something wrong because those pointers should be different addresses. E.g. here's output on x86_64:
Maybe try this test? This should show us the LLVM IR before and after optimizations, plus the assembly, so maybe we'll spot something in here.
|
Output from the above: define dso_local void @"$main"() {
entry:
%config = alloca %struct.Config, align 8
%0 = call i32 (i8*, %struct.Config*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, %struct.Config*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string", i64 0, i64 0), %struct.Config* nonnull %config)
%1 = getelementptr inbounds %struct.Config, %struct.Config* %config, i64 0, i32 0, i32 2, i32 1, i32 0, i32 1
%2 = bitcast %struct.anon.8* %1 to [256 x i8]*
%3 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string.1", i64 0, i64 0), [256 x i8]* nonnull %2)
%4 = getelementptr inbounds %struct.Config, %struct.Config* %config, i64 0, i32 0, i32 0, i32 1, i32 0, i32 0
%5 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string.2", i64 0, i64 0), [256 x i8]* nonnull %4)
ret void
}
; ModuleID = 'terra'
source_filename = "terra"
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "arm64-apple-darwin21.6.0"
%struct.Config = type { %struct.anon.1 }
%struct.anon.1 = type { %struct.anon.2, %struct.anon.4, %struct.anon.4 }
%struct.anon.2 = type { i32, %union.anon }
%union.anon = type { %struct.anon.3 }
%struct.anon.3 = type { [256 x i8] }
%struct.anon.4 = type { i32, %union.anon.5 }
%union.anon.5 = type { %struct.anon.7 }
%struct.anon.7 = type { double, %struct.anon.8, double, %struct.anon.8, %struct.anon.8 }
%struct.anon.8 = type { i32, [10 x double] }
%struct.anon.6 = type { double, [256 x i8] }
@"$string" = private unnamed_addr constant [32 x i8] c"Config: 0x%0.16p\0A\00", align 1
@"$string.1" = private unnamed_addr constant [32 x i8] c"xBCLeftInflowProfile: 0x%0.16p\0A\00", align 1
@"$string.2" = private unnamed_addr constant [32 x i8] c"xBCLeftHeat: 0x%0.16p\0A\00", align 1
define dso_local void @main() {
entry:
%config = alloca %struct.Config, align 8
%0 = call i32 (i8*, %struct.Config*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, %struct.Config*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string", i32 0, i32 0), %struct.Config* %config)
%1 = getelementptr %struct.Config, %struct.Config* %config, i32 0, i32 0
%2 = getelementptr %struct.anon.1, %struct.anon.1* %1, i32 0, i32 2
%3 = getelementptr %struct.anon.4, %struct.anon.4* %2, i32 0, i32 1
%4 = getelementptr %union.anon.5, %union.anon.5* %3, i32 0, i32 0
%5 = bitcast %struct.anon.7* %4 to %struct.anon.6*
%6 = getelementptr %struct.anon.6, %struct.anon.6* %5, i32 0, i32 1
%7 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string.1", i32 0, i32 0), [256 x i8]* %6)
%8 = getelementptr %struct.Config, %struct.Config* %config, i32 0, i32 0
%9 = getelementptr %struct.anon.1, %struct.anon.1* %8, i32 0, i32 0
%10 = getelementptr %struct.anon.2, %struct.anon.2* %9, i32 0, i32 1
%11 = getelementptr %union.anon, %union.anon* %10, i32 0, i32 0
%12 = getelementptr %struct.anon.3, %struct.anon.3* %11, i32 0, i32 0
%13 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([32 x i8], [32 x i8]* @"$string.2", i32 0, i32 0), [256 x i8]* %12)
ret void
}
declare dso_local i32 @printf(i8*, ...)
definition {} -> {}
assembly for function at address 0x10d674000
0x10d674000(+0): stp x28, x27, [sp, #-48]!
0x10d674004(+4): stp x20, x19, [sp, #16]
0x10d674008(+8): stp x29, x30, [sp, #32]
0x10d67400c(+12): sub sp, sp, #848
0x10d674010(+16): adrp x0, #0
0x10d674014(+20): ldr x0, [x0, #120]
0x10d674018(+24): adrp x19, #0
0x10d67401c(+28): ldr x19, [x19, #112]
0x10d674020(+32): add x20, sp, #8
0x10d674024(+36): add x1, sp, #8
0x10d674028(+40): blr x19
0x10d67402c(+44): add x1, x20, #568
0x10d674030(+48): adrp x0, #0
0x10d674034(+52): ldr x0, [x0, #104]
0x10d674038(+56): blr x19
0x10d67403c(+60): orr x1, x20, #0x4
0x10d674040(+64): adrp x0, #0
0x10d674044(+68): ldr x0, [x0, #96]
0x10d674048(+72): blr x19
0x10d67404c(+76): add sp, sp, #848
0x10d674050(+80): ldp x29, x30, [sp, #32]
0x10d674054(+84): ldp x20, x19, [sp, #16]
0x10d674058(+88): ldp x28, x27, [sp], #48
0x10d67405c(+92): ret
|
So running the test from #597 (comment) gives: Config: 0x0x0000000000000000
xBCLeftInflowProfile: 0x0x0000000000000000
xBCLeftHeat: 0x0x0000000000000000 Which is strange. Even stranger, re-adding the two Config: 0x0x0000000000007ffc
xBCLeftInflowProfile: 0x0x0000000000007ffc
xBCLeftHeat: 0x0x0000000000007ffc |
Removing the call to |
Could this somehow be an issue with vararg behaviour on aarch64 macOS being different to aarch64 Linux/general ARMv8 or aarch64 ABI? https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms |
Hm, maybe. I'd expect LLVM to handle most differences in the C ABI, at least when calling functions with scalar arguments (including, I believe, variadic functions). But let's test that! We can run Clang to see what code it generates: #include "stdbool.h"
#include "stdint.h"
#include "stdlib.h"
#include "stdio.h"
#include "string.h"
typedef int TempProfile;
#define TempProfile_File 0
typedef int InflowProfile;
#define InflowProfile_File 0
#define InflowProfile_SuctionAndBlowing 1
struct Config {
struct {
struct {
int32_t type;
union {
struct {
int8_t FileDir[256];
} File;
} u;
} xBCLeftHeat;
struct {
int32_t type;
union {
struct {
double addedVelocity;
int8_t FileDir[256];
} File;
struct {
double sigma;
struct {
uint32_t length;
double values[10];
} beta;
double Zw;
struct {
uint32_t length;
double values[10];
} A;
struct {
uint32_t length;
double values[10];
} omega;
} SuctionAndBlowing;
} u;
} yBCLeftInflowProfile;
struct {
int32_t type;
union {
struct {
double addedVelocity;
int8_t FileDir[256];
} File;
struct {
double sigma;
struct {
uint32_t length;
double values[10];
} beta;
double Zw;
struct {
uint32_t length;
double values[10];
} A;
struct {
uint32_t length;
double values[10];
} omega;
} SuctionAndBlowing;
} u;
} xBCLeftInflowProfile;
} BC;
};
int main() {
struct Config config;
printf("Config: %p\n", &(config));
printf("xBCLeftInflowProfile: %p\n", &(config.BC.xBCLeftInflowProfile.u.File.FileDir));
printf("xBCLeftHeat: %p\n", &(config.BC.xBCLeftHeat.u.File.FileDir));
return 0;
} Then run:
Should give you the unoptimized LLVM IR, which we can compare to what Terra produces. You can also run the program to sanity check it's behavior: (On x86:)
For best results, it's important to use the same Clang that you used to build Terra. Note in the C program I also used We could potentially add back more of Another possible explanation is that we're doing something bad with the JIT, but if so I don't see the mechanism for how that would be going wrong so far. |
; ModuleID = 'bug372d.c'
source_filename = "bug372d.c"
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "arm64-apple-macosx12.0.0"
%struct.Config = type { %struct.anon }
%struct.anon = type { %struct.anon.0, %struct.anon.2, %struct.anon.9 }
%struct.anon.0 = type { i32, %union.anon }
%union.anon = type { %struct.anon.1 }
%struct.anon.1 = type { [256 x i8] }
%struct.anon.2 = type { i32, %union.anon.3 }
%union.anon.3 = type { %struct.anon.5 }
%struct.anon.5 = type { double, %struct.anon.6, double, %struct.anon.7, %struct.anon.8 }
%struct.anon.6 = type { i32, [10 x double] }
%struct.anon.7 = type { i32, [10 x double] }
%struct.anon.8 = type { i32, [10 x double] }
%struct.anon.9 = type { i32, %union.anon.10 }
%union.anon.10 = type { %struct.anon.12 }
%struct.anon.12 = type { double, %struct.anon.13, double, %struct.anon.14, %struct.anon.15 }
%struct.anon.13 = type { i32, [10 x double] }
%struct.anon.14 = type { i32, [10 x double] }
%struct.anon.15 = type { i32, [10 x double] }
%struct.anon.11 = type { double, [256 x i8] }
@.str = private unnamed_addr constant [26 x i8] c"Config: %p\0A\00", align 1
@.str.1 = private unnamed_addr constant [26 x i8] c"xBCLeftInflowProfile: %p\0A\00", align 1
@.str.2 = private unnamed_addr constant [26 x i8] c"xBCLeftHeat: %p\0A\00", align 1
; Function Attrs: noinline nounwind optnone ssp uwtable
define i32 @main() #0 {
%1 = alloca i32, align 4
%2 = alloca %struct.Config, align 8
store i32 0, i32* %1, align 4
%3 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @.str, i64 0, i64 0), %struct.Config* %2)
%4 = getelementptr inbounds %struct.Config, %struct.Config* %2, i32 0, i32 0
%5 = getelementptr inbounds %struct.anon, %struct.anon* %4, i32 0, i32 2
%6 = getelementptr inbounds %struct.anon.9, %struct.anon.9* %5, i32 0, i32 1
%7 = bitcast %union.anon.10* %6 to %struct.anon.11*
%8 = getelementptr inbounds %struct.anon.11, %struct.anon.11* %7, i32 0, i32 1
%9 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @.str.1, i64 0, i64 0), [256 x i8]* %8)
%10 = getelementptr inbounds %struct.Config, %struct.Config* %2, i32 0, i32 0
%11 = getelementptr inbounds %struct.anon, %struct.anon* %10, i32 0, i32 0
%12 = getelementptr inbounds %struct.anon.0, %struct.anon.0* %11, i32 0, i32 1
%13 = bitcast %union.anon* %12 to %struct.anon.1*
%14 = getelementptr inbounds %struct.anon.1, %struct.anon.1* %13, i32 0, i32 0
%15 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @.str.2, i64 0, i64 0), [256 x i8]* %14)
ret i32 0
}
declare i32 @printf(i8*, ...) #1
attributes #0 = { noinline nounwind optnone ssp uwtable "frame-pointer"="non-leaf" "min-legal-vector-width"="0" "no-trapping-math"="true" "probe-stack"="__chkstk_darwin" "stack-protector-buffer-size"="8" "target-cpu"="apple-m1" "target-features"="+aes,+crc,+crypto,+dotprod,+fp-armv8,+fp16fml,+fullfp16,+lse,+neon,+ras,+rcpc,+rdm,+sha2,+sha3,+sm4,+v8.5a,+zcm,+zcz" }
attributes #1 = { "frame-pointer"="non-leaf" "no-trapping-math"="true" "probe-stack"="__chkstk_darwin" "stack-protector-buffer-size"="8" "target-cpu"="apple-m1" "target-features"="+aes,+crc,+crypto,+dotprod,+fp-armv8,+fp16fml,+fullfp16,+lse,+neon,+ras,+rcpc,+rdm,+sha2,+sha3,+sm4,+v8.5a,+zcm,+zcz" }
!llvm.module.flags = !{!0, !1, !2, !3, !4, !5, !6, !7, !8}
!llvm.ident = !{!9}
!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 12, i32 3]}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 1, !"branch-target-enforcement", i32 0}
!3 = !{i32 1, !"sign-return-address", i32 0}
!4 = !{i32 1, !"sign-return-address-all", i32 0}
!5 = !{i32 1, !"sign-return-address-with-bkey", i32 0}
!6 = !{i32 7, !"PIC Level", i32 2}
!7 = !{i32 7, !"uwtable", i32 1}
!8 = !{i32 7, !"frame-pointer", i32 1}
!9 = !{!"Apple clang version 13.1.6 (clang-1316.0.21.2.5)"}
$ clang bug372d.c -o bug372 && ./bug372
Config: 0x16daf2bc0
xBCLeftInflowProfile: 0x16daf2df8
xBCLeftHeat: 0x16daf2bc4
Expect this ouput:
Config: 0xfff9000000000000
xBCLeftInflowProfile: 0xfff9000000000000
xBCLeftHeat: 0xfff9000000000000
zsh: segmentation fault ../build/bin/terra bug372d.t |
So, been doing a bit of digging, and this does seem to be a calling convention issue. Here's the disassembled 0000000100003ec4 <_main>:
100003ec4: f4 4f be a9 stp x20, x19, [sp, #-32]!
100003ec8: fd 7b 01 a9 stp x29, x30, [sp, #16]
100003ecc: fd 43 00 91 add x29, sp, #16
100003ed0: ff 83 0d d1 sub sp, sp, #864
100003ed4: 1f 20 03 d5 nop
100003ed8: 88 09 00 58 ldr x8, 0x100004008 <_printf+0x100004008>
100003edc: 08 01 40 f9 ldr x8, [x8]
100003ee0: a8 83 1e f8 stur x8, [x29, #-24]
100003ee4: f3 43 00 91 add x19, sp, #16
100003ee8: f3 03 00 f9 str x19, [sp]
100003eec: e0 03 00 10 adr x0, #124
100003ef0: 1f 20 03 d5 nop
100003ef4: 1a 00 00 94 bl 0x100003f5c <_printf+0x100003f5c>
100003ef8: 68 e2 08 91 add x8, x19, #568
100003efc: e8 03 00 f9 str x8, [sp]
100003f00: 00 04 00 50 adr x0, #130
100003f04: 1f 20 03 d5 nop
100003f08: 15 00 00 94 bl 0x100003f5c <_printf+0x100003f5c>
100003f0c: 68 02 7e b2 orr x8, x19, #0x4
100003f10: e8 03 00 f9 str x8, [sp]
100003f14: 40 04 00 10 adr x0, #136
100003f18: 1f 20 03 d5 nop
100003f1c: 10 00 00 94 bl 0x100003f5c <_printf+0x100003f5c>
100003f20: a8 83 5e f8 ldur x8, [x29, #-24]
100003f24: 1f 20 03 d5 nop
100003f28: 09 07 00 58 ldr x9, 0x100004008 <_printf+0x100004008>
100003f2c: 29 01 40 f9 ldr x9, [x9]
100003f30: 3f 01 08 eb cmp x9, x8
100003f34: c1 00 00 54 b.ne 0x100003f4c <_main+0x88>
100003f38: 00 00 80 52 mov w0, #0
100003f3c: ff 83 0d 91 add sp, sp, #864
100003f40: fd 7b 41 a9 ldp x29, x30, [sp, #16]
100003f44: f4 4f c2 a8 ldp x20, x19, [sp], #32
100003f48: c0 03 5f d6 ret
100003f4c: 01 00 00 94 bl 0x100003f50 <_printf+0x100003f50> Note how it stores the first argument to Anyway, disassembly of the Terra-generated 0000000100003e4c <_main>:
100003e4c: f4 4f be a9 stp x20, x19, [sp, #-32]!
100003e50: fd 7b 01 a9 stp x29, x30, [sp, #16]
100003e54: ff 43 0d d1 sub sp, sp, #848
100003e58: 00 05 00 10 adr x0, #160
100003e5c: 1f 20 03 d5 nop
100003e60: e1 23 00 91 add x1, sp, #8
100003e64: f3 23 00 91 add x19, sp, #8
100003e68: 21 00 00 94 bl 0x100003eec <_printf+0x100003eec>
100003e6c: 20 05 00 50 adr x0, #166
100003e70: 61 e2 08 91 add x1, x19, #568
100003e74: 1f 20 03 d5 nop
100003e78: 1d 00 00 94 bl 0x100003eec <_printf+0x100003eec>
100003e7c: 80 05 00 10 adr x0, #176
100003e80: 61 02 7e b2 orr x1, x19, #0x4
100003e84: 1f 20 03 d5 nop
100003e88: 19 00 00 94 bl 0x100003eec <_printf+0x100003eec>
100003e8c: ff 43 0d 91 add sp, sp, #848
100003e90: fd 7b 41 a9 ldp x29, x30, [sp, #16]
100003e94: f4 4f c2 a8 ldp x20, x19, [sp], #32
100003e98: c0 03 5f d6 ret Note how it's using both Running the first
And the Terra-generated version gives:
This looked strange initially as multiple runs of the C version always seemed to have the stack somewhere in the ...
(lldb)
Process 29276 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
frame #0: 0x0000000100003e68 bug372_terra`main + 28
bug372_terra`main:
-> 0x100003e68 <+28>: bl 0x100003eec ; symbol stub for: printf
0x100003e6c <+32>: adr x0, #0xa6 ; "xBCLeftInflowProfile: %p\n"
0x100003e70 <+36>: add x1, x19, #0x238
0x100003e74 <+40>: nop
Target 0: (bug372_terra) stopped.
(lldb) reg read x0 x1
x0 = 0x0000000100003ef8 "Config: %p\n"
x1 = 0x000000016fdfe648
...
(lldb)
Process 29276 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
frame #0: 0x0000000100003e78 bug372_terra`main + 44
bug372_terra`main:
-> 0x100003e78 <+44>: bl 0x100003eec ; symbol stub for: printf
0x100003e7c <+48>: adr x0, #0xb0 ; "xBCLeftHeat: %p\n"
0x100003e80 <+52>: orr x1, x19, #0x4
0x100003e84 <+56>: nop
Target 0: (bug372_terra) stopped.
(lldb) reg r x0 x1
x0 = 0x0000000100003f12 "xBCLeftInflowProfile: %p\n"
x1 = 0x000000016fdfe880
...
(lldb)
Process 29276 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
frame #0: 0x0000000100003e88 bug372_terra`main + 60
bug372_terra`main:
-> 0x100003e88 <+60>: bl 0x100003eec ; symbol stub for: printf
0x100003e8c <+64>: add sp, sp, #0x350
0x100003e90 <+68>: ldp x29, x30, [sp, #0x10]
0x100003e94 <+72>: ldp x20, x19, [sp], #0x20
Target 0: (bug372_terra) stopped.
(lldb) reg r x0 x1
x0 = 0x0000000100003f2c "xBCLeftHeat: %p\n"
x1 = 0x000000016fdfe64c
(lldb) x/2x $sp
0x16fdfe640: 0x00084310 0x00000001 So, Terra-generated IR is, for some reason, placing the I can see in the LLVM IR that Terra for some reason is defining a different set of arguments for
%7 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.1", i32 0, i32 0), [256 x i8]* %6)
%9 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @.str.1, i64 0, i64 0), [256 x i8]* %8) I'm not very familiar with AAPCS64, however the docs at https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#id82 seem to imply that variadic arguments can first be passed in registers, as Terra is doing. However, Apple's documentation at https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms seems to differ in that it assigns variadic arguments to 8-byte stack slots instead of registers. This could explain the issues being seen here. |
Looks like this issue is with Terra's generation of the ; ModuleID = 'terra'
source_filename = "terra"
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "arm64-apple-darwin21.6.0"
%struct.Config = type { %struct.anon.1 }
%struct.anon.1 = type { %struct.anon.2, %struct.anon.4, %struct.anon.4 }
%struct.anon.2 = type { i32, %union.anon }
%union.anon = type { %struct.anon.3 }
%struct.anon.3 = type { [256 x i8] }
%struct.anon.4 = type { i32, %union.anon.5 }
%union.anon.5 = type { %struct.anon.7 }
%struct.anon.7 = type { double, %struct.anon.8, double, %struct.anon.8, %struct.anon.8 }
%struct.anon.8 = type { i32, [10 x double] }
%struct.anon.6 = type { double, [256 x i8] }
@"$string" = private unnamed_addr constant [26 x i8] c"Config: %p\0A\00", align 1
@"$string.1" = private unnamed_addr constant [26 x i8] c"xBCLeftInflowProfile: %p\0A\00", align 1
@"$string.2" = private unnamed_addr constant [26 x i8] c"xBCLeftHeat: %p\0A\00", align 1
define dso_local void @main() {
entry:
%config = alloca %struct.Config, align 8
;%0 = call i32 (i8*, %struct.Config*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, %struct.Config*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string", i32 0, i32 0), %struct.Config* %config)
%0 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string", i32 0, i32 0), %struct.Config* %config)
%1 = getelementptr %struct.Config, %struct.Config* %config, i32 0, i32 0
%2 = getelementptr %struct.anon.1, %struct.anon.1* %1, i32 0, i32 2
%3 = getelementptr %struct.anon.4, %struct.anon.4* %2, i32 0, i32 1
%4 = getelementptr %union.anon.5, %union.anon.5* %3, i32 0, i32 0
%5 = bitcast %struct.anon.7* %4 to %struct.anon.6*
%6 = getelementptr %struct.anon.6, %struct.anon.6* %5, i32 0, i32 1
;%7 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.1", i32 0, i32 0), [256 x i8]* %6)
%7 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.1", i32 0, i32 0), [256 x i8]* %6)
%8 = getelementptr %struct.Config, %struct.Config* %config, i32 0, i32 0
%9 = getelementptr %struct.anon.1, %struct.anon.1* %8, i32 0, i32 0
%10 = getelementptr %struct.anon.2, %struct.anon.2* %9, i32 0, i32 1
%11 = getelementptr %union.anon, %union.anon* %10, i32 0, i32 0
%12 = getelementptr %struct.anon.3, %struct.anon.3* %11, i32 0, i32 0
;%13 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.2", i32 0, i32 0), [256 x i8]* %12)
%13 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.2", i32 0, i32 0), [256 x i8]* %12)
ret void
}
declare dso_local i32 @printf(i8*, ...)
; definition {} -> {}
define dso_local void @"$main"() {
entry:
%config = alloca %struct.Config, align 8
%0 = call i32 (i8*, %struct.Config*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, %struct.Config*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string", i64 0, i64 0), %struct.Config* nonnull %config)
%1 = getelementptr inbounds %struct.Config, %struct.Config* %config, i64 0, i32 0, i32 2, i32 1, i32 0, i32 1
%2 = bitcast %struct.anon.8* %1 to [256 x i8]*
%3 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.1", i64 0, i64 0), [256 x i8]* nonnull %2)
%4 = getelementptr inbounds %struct.Config, %struct.Config* %config, i64 0, i32 0, i32 0, i32 1, i32 0, i32 0
%5 = call i32 (i8*, [256 x i8]*, ...) bitcast (i32 (i8*, ...)* @printf to i32 (i8*, [256 x i8]*, ...)*)(i8* getelementptr inbounds ([26 x i8], [26 x i8]* @"$string.2", i64 0, i64 0), [256 x i8]* nonnull %4)
ret void
} Result:
Not sure why it's treating I'd guess that the Linux test failures, which this issue was initially opened for, are unrelated to this, so maybe this belongs better in a separate issue? |
Wow, thanks for doing all this digging. Yes, I think this is a separate issue: why is Terra thinking that |
Just one more note: the place to look for issues is likely going to be Lines 1467 to 1470 in 0cf6be6
Hopefully it's just a matter of teaching Terra not to add arguments beyond those in the original function to the formal type it returns, but this may potentially go deeper into the compiler plumbing. |
I think trying to track this down is beyond me at the moment, I'm not familiar with Terra's (or LLVM's) compilation process, but might be able to take a look at somepoint in the future. Probably worth opening a separate issue for it. |
Coming back to the issue of Linux AArch64, I see that the following tests are still failing:
From #625 |
I've seen this pop up a couple of times, so I'm documenting it here so people know about it. On ARM (AArch64), LLVM versions >= 12 seem to experience higher test failure rates. I've seen this happen on both Linux (Graviton, NVIDIA Jetson)
and macOS (Apple M1), so it seems to be a feature of ARM processorsand not of a specific OS.Test pass rate on LLVM 14.0.0:
Test pass rate on LLVM 11.1.0:
This is with dcd2eff on the following machine:
My best guess at a root cause is #485, because LLVM 12 is the version where we switched back to MCJIT after ORCv1 was removed. However, it's possible something else changed in LLVM that is breaking something, and we haven't accounted for it yet.
Right now the best workaround is to stick to LLVM 11 on ARM platforms.
The text was updated successfully, but these errors were encountered: