Run a Large Language model (LLM) chatbot on Arm servers #1447

RachelShalom · 2024-12-16T20:08:58Z

Hey I am working on an ubuntu machine with 70 cores: arm neoverse v2 cpus and I was following the tutoria, managed to run everything but the results I see are much slower than what this post show:
the blog: https://learn.arm.com/learning-paths/servers-and-cloud-computing/pytorch-llama/pytorch-llama/

the results I get :

Input tokens : 24
Generated tokens : 32
Time to first token : 5.24 s
Prefill Speed : 4.58 t/s
Generation Speed : 4.14 t/s

which is much slower than the results shown:
generation speed of 24.6 t/s and time to first token of 0.66s

any direction to debug this?

thanks

nobelchowdary · 2024-12-19T17:50:36Z

Hi @RachelShalom

Make sure you follow the steps in the blog/learning path properly, and
Are you using the following command to run the inference

LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python torchchat.py generate llama3.1 --dso-path exportedModels/llama3.1.so --device cpu --max-new-tokens 32 --chat

Are you ruuning it with 16 threads? OMP_NUM_THREADS=16?

RachelShalom · 2024-12-19T18:44:52Z

Hi @nobelchowdary yes I am running everything as written in the blog. my machine is not identical to the aws graviton ( in the blog the stated that this is the machine they are running on )
I am running this on a lab machine I have with 70 cires of v2 neoverse arm cpus and I get these results
I

nobelchowdary · 2024-12-19T19:55:18Z

@RachelShalom can you share the output after executing lscpu command ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run a Large Language model (LLM) chatbot on Arm servers #1447

Run a Large Language model (LLM) chatbot on Arm servers #1447

RachelShalom commented Dec 16, 2024

nobelchowdary commented Dec 19, 2024 •

edited

Loading

RachelShalom commented Dec 19, 2024

nobelchowdary commented Dec 19, 2024

Run a Large Language model (LLM) chatbot on Arm servers #1447

Run a Large Language model (LLM) chatbot on Arm servers #1447

Comments

RachelShalom commented Dec 16, 2024

the results I get :

Input tokens : 24 Generated tokens : 32 Time to first token : 5.24 s Prefill Speed : 4.58 t/s Generation Speed : 4.14 t/s

nobelchowdary commented Dec 19, 2024 • edited Loading

RachelShalom commented Dec 19, 2024

nobelchowdary commented Dec 19, 2024

Input tokens : 24
Generated tokens : 32
Time to first token : 5.24 s
Prefill Speed : 4.58 t/s
Generation Speed : 4.14 t/s

nobelchowdary commented Dec 19, 2024 •

edited

Loading