You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @nobelchowdary yes I am running everything as written in the blog. my machine is not identical to the aws graviton ( in the blog the stated that this is the machine they are running on )
I am running this on a lab machine I have with 70 cires of v2 neoverse arm cpus and I get these results
I
Hey I am working on an ubuntu machine with 70 cores: arm neoverse v2 cpus and I was following the tutoria, managed to run everything but the results I see are much slower than what this post show:
the blog: https://learn.arm.com/learning-paths/servers-and-cloud-computing/pytorch-llama/pytorch-llama/
the results I get :
Input tokens : 24
Generated tokens : 32
Time to first token : 5.24 s
Prefill Speed : 4.58 t/s
Generation Speed : 4.14 t/s
which is much slower than the results shown:
generation speed of 24.6 t/s and time to first token of 0.66s
any direction to debug this?
thanks
The text was updated successfully, but these errors were encountered: