You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying to implement a batch-matrix-matrix multiplication in Mojo, so I started building my own tensor type looking like this:
structSelfmadeDoubleTensor:
varpointer: Pointer[Float64]
varbatch_size: Int
varleft: Int
varright: Int
fn__init__(inoutself, batch_size: Int, left: Int, right: Int):
self.pointer = Pointer[Float64].alloc(batch_size*left*right)
self.batch_size = batch_size
self.left = left
self.right = right
I then continued implementing an unoptimised bmm as follows:
fn__mul__(A: Self, B: Self) -> Self:
varC: SelfmadeDoubleTensor = SelfmadeDoubleTensor(A.batch_size, A.left, B.right)
for i inrange(A.batch_size):
for j inrange(A.left):
for k inrange(B.right):
for l inrange(A.right):
C.pointer[i*A.left*B.right+j*C.right+l] += A.pointer[i*A.left*A.right+j*A.right+l] * B.pointer[i*B.left*B.right+B.right*l+k]
return C
The result of that one actually didn't look very unpromising, I got runtimes of around 85.5 seconds, with batch size 128 and the other two sizes of both tensors being at 512. That is around as fast as my C implementation.
Afterwards I tried to use the builtin tensor to achieve the same result. I figured the bare multiplication of the builtin tensors is not actual tensor multiplication, so not what I am trying to achieve. So I tried implementing my own function using the builtin tensor class:
fnbmm_with_builtin_tensor_type(A: Tensor[DType.float64], B: Tensor[DType.float64]) -> Tensor[DType.float64]:
varC: Tensor[DType.float64] = Tensor[DType.float64](TensorSpec(DType.float64, A.shape()[0], A.shape()[1], B.shape()[2]))
for i inrange(A.shape()[0]):
for j inrange(A.shape()[1]):
for k inrange(B.shape()[2]):
for l inrange(A.shape()[2]):
C[Index(i,j,k)] += A[Index(i,j,l)] * B[Index(i,k,l)]
return C
The problem is: I am probably doing something wrong, but I am unsure whats the issue, since this implementation runs around 10 seconds slower, than my own implementation, even though I expected it to run faster. Actually I cam back to this Problem after some time and it got better, since some updates ago the builtin method was around 10 times slower.
Can somebody please explain to me, what am I doing wrong?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I was trying to implement a batch-matrix-matrix multiplication in Mojo, so I started building my own tensor type looking like this:
I then continued implementing an unoptimised bmm as follows:
The result of that one actually didn't look very unpromising, I got runtimes of around 85.5 seconds, with batch size 128 and the other two sizes of both tensors being at 512. That is around as fast as my C implementation.
Afterwards I tried to use the builtin tensor to achieve the same result. I figured the bare multiplication of the builtin tensors is not actual tensor multiplication, so not what I am trying to achieve. So I tried implementing my own function using the builtin tensor class:
The problem is: I am probably doing something wrong, but I am unsure whats the issue, since this implementation runs around 10 seconds slower, than my own implementation, even though I expected it to run faster. Actually I cam back to this Problem after some time and it got better, since some updates ago the builtin method was around 10 times slower.
Can somebody please explain to me, what am I doing wrong?
Beta Was this translation helpful? Give feedback.
All reactions