Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run a Large Language model (LLM) chatbot on Arm servers #1447

Open
RachelShalom opened this issue Dec 16, 2024 · 3 comments
Open

Run a Large Language model (LLM) chatbot on Arm servers #1447

RachelShalom opened this issue Dec 16, 2024 · 3 comments

Comments

@RachelShalom
Copy link

Hey I am working on an ubuntu machine with 70 cores: arm neoverse v2 cpus and I was following the tutoria, managed to run everything but the results I see are much slower than what this post show:
the blog: https://learn.arm.com/learning-paths/servers-and-cloud-computing/pytorch-llama/pytorch-llama/

the results I get :

Input tokens : 24
Generated tokens : 32
Time to first token : 5.24 s
Prefill Speed : 4.58 t/s
Generation Speed : 4.14 t/s

which is much slower than the results shown:
generation speed of 24.6 t/s and time to first token of 0.66s

any direction to debug this?

thanks

@nobelchowdary
Copy link
Contributor

nobelchowdary commented Dec 19, 2024

Hi @RachelShalom

Make sure you follow the steps in the blog/learning path properly, and
Are you using the following command to run the inference

LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python torchchat.py generate llama3.1 --dso-path exportedModels/llama3.1.so --device cpu --max-new-tokens 32 --chat

Are you ruuning it with 16 threads? OMP_NUM_THREADS=16?

@RachelShalom
Copy link
Author

Hi @nobelchowdary yes I am running everything as written in the blog. my machine is not identical to the aws graviton ( in the blog the stated that this is the machine they are running on )
I am running this on a lab machine I have with 70 cires of v2 neoverse arm cpus and I get these results
I

@nobelchowdary
Copy link
Contributor

@RachelShalom can you share the output after executing lscpu command ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants