Benchmarking NVIDIA TensorRT-LLM
Jan now supports NVIDIA TensorRT-LLM in addition to llama.cpp, making Jan multi-engine and ultra-fast for users with Nvidia GPUs.
We’ve been excited for TensorRT-LLM for a while, and had a lot of fun implementing it. As part of the process, we’ve run some benchmarks, to see how TensorRT-LLM fares on consumer hardware (e.g. 4090s, 3090s) we commonly see in the Jan’s hardware community.
Give it a try! Jan’s TensorRT-LLM extension is available in Jan v0.4.9 and up (see more). We precompiled some TensorRT-LLM models for you to try:
Mistral 7b
,TinyLlama-1.1b
,TinyJensen-1.1b
😂Bugs or feedback? Let us know on GitHub or via Discord.
An interesting aside: Jan actually started out in June 2023 building on NVIDIA FastTransformer, the precursor library to TensorRT-LLM. TensorRT-LLM was released in September 2023, making it a very young library. We’re excited to see its roadmap develop!
Key Findings
TensorRT-LLM was:
- 30–70% faster than llama.cpp on the same hardware
- Consumes less memory on consecutive runs and marginally more GPU VRAM utilization than llama.cpp
- 20%+ smaller compiled model sizes than llama.cpp
- Less convenient as models have to be compiled for a specific OS and GPU architecture, vs. llama.cpp’s “Compile once, run everywhere” portability
- Less accessible as it does not support older-generation NVIDIA GPUs
Why TensorRT-LLM?
TensorRT-LLM is Nvidia’s open-source inference library that incorporates Nvidia’s proprietary optimizations beyond the open-source cuBLAS(opens in a new tab) library.
As compared to llama.cpp, which today dominates Desktop AI as a cross-platform inference engine, TensorRT-LLM is highly optimized for Nvidia GPUs. While llama.cpp compiles models compiles models into a single, generalizable CUDA “backend” that can run on a wide range of Nvidia GPUs, TensorRT-LLM compiles models into a GPU-specific execution graph that is highly optimized for that specific GPU’s Tensor Cores, CUDA cores, VRAM and memory bandwidth.
TensorRT-LLM is typically used in datacenter-grade GPUs, where it produces a face-melting 10,000 tokens/s on NVIDIA H100 Tensor Core GPUs. We were curious for how TensorRT-LLM performs on consumer-grade GPUs, and gave it a spin.