-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
M1/M2: Large matrix multiplications can contains NaNs #381
Comments
adding |
Adding |
I cannot reproduce at all on Metal.jl#master using an M3 Pro, but it does seem reproducible on an M1 Pro. I wonder if this is a problem with |
I can reproduce the issue on M1 master. It also looks like all the tasks run on the same queue. |
The issue was found on a M2 Max. The MWE only happens if the array is large enough. It seems to be launching the subsequent kernel before the matmul finished. Is it possible that the p.s. I'm about to board the plane to JuliaCon so I won't be able to test it soon. |
It also reproduces when comparing on the CPU, just much less likely, so this isn't a |
Looks like a bunch of NaN's in the second matrix. |
My current MWE is:
I define C out of the loop to access it afterwards. When I had |
I can: using Metal, LinearAlgebra
function main(T=Float32, N=10000)
a = Metal.rand(T, N, N)
b = Metal.rand(T, N, N)
c = a * b'
synchronize()
for i in 1:100
println("Iteration $i")
d = Metal.zeros(T, size(a))
MPS.matmul!(d, a, b, #=alpha=#true, #=beta=#false,
#=transpose_a=#false, #=transpose_b=#true)
@assert !any(isnan.(Array(d))) "NaN in iteration $i"
# XXX: this redundant check is needed, or the failure never occurs
@assert !any(isnan.(d))
end
end
isinteractive() || main() The need for a secondary kernel is very weird. |
It is for i in 1:10
C = Metal.zeros(Float32, size(a))
GPUArrays.generic_matmatmul!(C, a, b, MulAddMul())
@assert C ≈ c "$i"
end |
I don't see how that's related; it's an entirely different kernel. Does it contain NaNs in similar places? |
Just wanted to confirm that its MPS rather than the synchronisation between kernel launches. |
I've been seeing the NaN issues with large arrays for a long time in #145 MPX seems fine: import mlx.core as mx
a = mx.random.normal((10000, 10000))
b = mx.random.normal((10000, 10000))
c = a @ b.T
for i in range(0,10):
C = a @ b.T
assert(mx.allclose(C,c)) |
I would love for someone to review my code because I'm not a Swift expert by any means, but I was able to reproduce this in the Swift REPL. Swift MWE
|
@christiangnrd Your Swift Code looks good to me. It turns out MPX doesn’t even use MPS. |
Haven't been able to look into this, but here's the ObjC version: #import <Foundation/Foundation.h>
#import <Metal/Metal.h>
#import <MetalPerformanceShaders/MetalPerformanceShaders.h>
void performMatrixMultiplication(NSInteger N) {
if (N == 0) {
N = 10000;
}
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
id<MTLCommandQueue> commandQueue = [device newCommandQueue];
if (!device || !commandQueue) {
NSLog(@"Metal device or command queue could not be created");
return;
}
NSLog(@"Initializing a & b");
// Generate random NxN matrices
float *a = calloc(N * N, sizeof(float));
float *b = calloc(N * N, sizeof(float));
for (NSInteger i = 0; i < N * N; i++) {
a[i] = 1.0f;
b[i] = 1.0f;
}
NSLog(@"a and b created\n");
// Metal buffers for matrices
id<MTLBuffer> aBuffer = [device newBufferWithBytes:a length:sizeof(float) * N * N options:MTLResourceStorageModeShared];
id<MTLBuffer> bBuffer = [device newBufferWithBytes:b length:sizeof(float) * N * N options:MTLResourceStorageModeShared];
NSLog(@"Starting matmul\n");
for (NSInteger i = 1; i <= 10; i++) {
NSLog(@"%ld\n", (long)i);
// Create MPSMatrices
MPSMatrixDescriptor *aMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
columns:N
rowBytes:sizeof(float) * N
dataType:MPSDataTypeFloat32];
MPSMatrixDescriptor *bMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
columns:N
rowBytes:sizeof(float) * N
dataType:MPSDataTypeFloat32];
MPSMatrix *aMatrix = [[MPSMatrix alloc] initWithBuffer:aBuffer descriptor:aMatrixDescriptor];
MPSMatrix *bMatrix = [[MPSMatrix alloc] initWithBuffer:bBuffer descriptor:bMatrixDescriptor];
// Matrix multiplication using MPSMatrixMultiplication
MPSMatrixMultiplication *matrixMultiplication = [[MPSMatrixMultiplication alloc] initWithDevice:device
transposeLeft:NO
transposeRight:NO
resultRows:N
resultColumns:N
interiorColumns:N
alpha:1.0
beta:0.0];
id<MTLBuffer> cBuffer = [device newBufferWithLength:sizeof(float) * N * N options:MTLResourceStorageModeShared];
MPSMatrixDescriptor *cMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
columns:N
rowBytes:sizeof(float) * N
dataType:MPSDataTypeFloat32];
MPSMatrix *cMatrix = [[MPSMatrix alloc] initWithBuffer:cBuffer descriptor:cMatrixDescriptor];
id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
[matrixMultiplication encodeToCommandBuffer:commandBuffer
leftMatrix:aMatrix
rightMatrix:bMatrix
resultMatrix:cMatrix];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
// Check for NaNs in the result matrix
float *cPointer = cBuffer.contents;
for (NSInteger j = 0; j < N * N; j++) {
if (isnan(cPointer[j])) {
NSLog(@"NaN in iteration %ld", (long)i);
free(a);
free(b);
return;
}
}
}
free(a);
free(b);
}
int main(int argc, const char * argv[]) {
@autoreleasepool {
NSInteger N = 10000;
if (argc > 1) {
N = atoi(argv[1]);
}
performMatrixMultiplication(N);
}
return 0;
}
|
Should we just file a radar / feedback? |
I'll have a better look first and forward it to our Apple contact. |
Apparently this looks like an ARC bug. Curiously, the ObjC reproducer is "fixed" by adding an Of course, the Julia MWE is more complex, as the Still broken Julia MWEusing Metal, LinearAlgebra
using ObjectiveC, .Foundation
function main(T=Float32, N=10000)
a = Metal.rand(T, N, N)
b = Metal.rand(T, N, N)
synchronize()
for i in 1:100
@autoreleasepool begin
println("Iteration $i")
d = Metal.zeros(T, size(a))
MPS.matmul!(d, a, b, #=alpha=#true, #=beta=#false,
#=transpose_a=#false, #=transpose_b=#false)
@assert !any(isnan.(Array(d))) "NaN in iteration $i"
# XXX: this redundant check is needed, or the failure never occurs
@assert !any(isnan.(d))
end
end
end
isinteractive() || main() "Fixed" ObjeC MWE#import <Foundation/Foundation.h>
#import <Metal/Metal.h>
#import <MetalPerformanceShaders/MetalPerformanceShaders.h>
void performMatrixMultiplication(NSInteger N) {
if (N == 0) {
N = 10000;
}
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
id<MTLCommandQueue> commandQueue = [device newCommandQueue];
if (!device || !commandQueue) {
NSLog(@"Metal device or command queue could not be created");
return;
}
NSLog(@"Initializing a & b");
// Generate random NxN matrices
float *a = calloc(N * N, sizeof(float));
float *b = calloc(N * N, sizeof(float));
for (NSInteger i = 0; i < N * N; i++) {
a[i] = 1.0f;
b[i] = 1.0f;
}
NSLog(@"a and b created\n");
// Metal buffers for matrices
id<MTLBuffer> aBuffer = [device newBufferWithBytes:a length:sizeof(float) * N * N options:MTLResourceStorageModeShared];
id<MTLBuffer> bBuffer = [device newBufferWithBytes:b length:sizeof(float) * N * N options:MTLResourceStorageModeShared];
NSLog(@"Starting matmul\n");
for (NSInteger i = 1; i <= 100; i++) {
@autoreleasepool {
NSLog(@"Iteration %ld\n", (long)i);
// Create MPSMatrices
MPSMatrixDescriptor *aMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
columns:N
rowBytes:sizeof(float) * N
dataType:MPSDataTypeFloat32];
MPSMatrixDescriptor *bMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
columns:N
rowBytes:sizeof(float) * N
dataType:MPSDataTypeFloat32];
MPSMatrix *aMatrix = [[MPSMatrix alloc] initWithBuffer:aBuffer descriptor:aMatrixDescriptor];
MPSMatrix *bMatrix = [[MPSMatrix alloc] initWithBuffer:bBuffer descriptor:bMatrixDescriptor];
// Matrix multiplication using MPSMatrixMultiplication
MPSMatrixMultiplication *matrixMultiplication = [[MPSMatrixMultiplication alloc] initWithDevice:device
transposeLeft:NO
transposeRight:NO
resultRows:N
resultColumns:N
interiorColumns:N
alpha:1.0
beta:0.0];
id<MTLBuffer> cBuffer = [device newBufferWithLength:sizeof(float) * N * N options:MTLResourceStorageModeShared];
MPSMatrixDescriptor *cMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
columns:N
rowBytes:sizeof(float) * N
dataType:MPSDataTypeFloat32];
MPSMatrix *cMatrix = [[MPSMatrix alloc] initWithBuffer:cBuffer descriptor:cMatrixDescriptor];
id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
[matrixMultiplication encodeToCommandBuffer:commandBuffer
leftMatrix:aMatrix
rightMatrix:bMatrix
resultMatrix:cMatrix];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
// Check for NaNs in the result matrix
float *cPointer = cBuffer.contents;
for (NSInteger j = 0; j < N * N; j++) {
if (isnan(cPointer[j])) {
NSLog(@"NaN in iteration %ld", (long)i);
free(a);
free(b);
return;
}
}
}
}
free(a);
free(b);
}
int main(int argc, const char * argv[]) {
@autoreleasepool {
NSInteger N = 10000;
if (argc > 1) {
N = atoi(argv[1]);
}
performMatrixMultiplication(N);
}
return 0;
} |
Couldn't reproduce the ObjectiveC case today with and without |
I can reproduce the error in both Swift and ObjectiveC and it goes away when surrounded by an |
Oops. I just overlooked the second |
By "the first one" do you mean the |
I'm able to reproduce this without the second redundant check. Still broken simpler Julia MWE
|
Our
[NSAutoreleasePool showPools] |
Are we using ARC in Julia? |
We don’t use ARC, but the libraries we are using might have been compiled with ARC enabled. |
When I turned off ARC in XCode for the objc version, even with the |
AFAIU |
That's my understanding too. However, from what I understand about our implementation of the The only thing is that I don't know it this information is actually helpful. |
I could no longer reproduce after calling |
AutoreleasePools only hide the issue. It seems like when the buffer overlaps an address that's a multiple of 2 GiB, and the buffer length itself is not a power of 2, the NaNs sometimes happen. I haven't figured out the exact conditions (and I probably won't), but I can somewhat reliably predict them. Not exactly a MWE, but you should be able to copy-paste into a new xcode swift project: Swift working example:
Swift working example output:
@maleadt Is this worth forwarding to your Apple rep again? If you do, I updated my initial report (FB14293696). |
Similar issue? pytorch/pytorch#117549. Pytorch fixed it by implementing their own matmul kernel. pytorch/pytorch#117549 |
MWE:
The text was updated successfully, but these errors were encountered: