BMM in Mojo #2380

Burned65 · 2024-04-22T13:07:35Z

Burned65
Apr 22, 2024

I was trying to implement a batch-matrix-matrix multiplication in Mojo, so I started building my own tensor type looking like this:

struct SelfmadeDoubleTensor:
    var pointer: Pointer[Float64]
    var batch_size: Int
    var left: Int
    var right: Int


    fn __init__(inout self, batch_size: Int, left: Int, right: Int):
        self.pointer = Pointer[Float64].alloc(batch_size*left*right)
        self.batch_size = batch_size
        self.left = left
        self.right = right

I then continued implementing an unoptimised bmm as follows:

fn __mul__(A: Self, B: Self) -> Self:
    var C: SelfmadeDoubleTensor = SelfmadeDoubleTensor(A.batch_size, A.left, B.right)     
    for i in range(A.batch_size):
        for j in range(A.left):
            for k in range(B.right):
                for l in range(A.right):
                    C.pointer[i*A.left*B.right+j*C.right+l] += A.pointer[i*A.left*A.right+j*A.right+l] * B.pointer[i*B.left*B.right+B.right*l+k]
    return C

The result of that one actually didn't look very unpromising, I got runtimes of around 85.5 seconds, with batch size 128 and the other two sizes of both tensors being at 512. That is around as fast as my C implementation.

Afterwards I tried to use the builtin tensor to achieve the same result. I figured the bare multiplication of the builtin tensors is not actual tensor multiplication, so not what I am trying to achieve. So I tried implementing my own function using the builtin tensor class:

fn bmm_with_builtin_tensor_type(A: Tensor[DType.float64], B: Tensor[DType.float64]) -> Tensor[DType.float64]:
    var C: Tensor[DType.float64] = Tensor[DType.float64](TensorSpec(DType.float64, A.shape()[0], A.shape()[1], B.shape()[2]))
    for i in range(A.shape()[0]):
        for j in range(A.shape()[1]):
            for k in range(B.shape()[2]):
                for l in range(A.shape()[2]):
                    C[Index(i,j,k)] += A[Index(i,j,l)] * B[Index(i,k,l)]
    return C

The problem is: I am probably doing something wrong, but I am unsure whats the issue, since this implementation runs around 10 seconds slower, than my own implementation, even though I expected it to run faster. Actually I cam back to this Problem after some time and it got better, since some updates ago the builtin method was around 10 times slower.

Can somebody please explain to me, what am I doing wrong?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BMM in Mojo #2380

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

Select a reply

BMM in Mojo #2380

Burned65 Apr 22, 2024

Replies: 1 comment · 2 replies

Burned65
Apr 22, 2024

Replies: 1 comment 2 replies