Benchmarking NVIDIA TensorRT-LLM

Nicole Zhu
2 min readMay 1, 2024

Jan now supports NVIDIA TensorRT-LLM in addition to llama.cpp, making Jan multi-engine and ultra-fast for users with Nvidia GPUs.

We’ve been excited for TensorRT-LLM for a while, and had a lot of fun implementing it. As part of the process, we’ve run some benchmarks, to see how TensorRT-LLM fares on consumer hardware (e.g. 4090s, 3090s) we commonly see in the Jan’s hardware community.

Give it a try! Jan’s TensorRT-LLM extension is available in Jan v0.4.9 and up (see more). We precompiled some TensorRT-LLM models for you to try: Mistral 7b, TinyLlama-1.1b, TinyJensen-1.1b 😂

Bugs or feedback? Let us know on GitHub or via Discord.

An interesting aside: Jan actually started out in June 2023 building on NVIDIA FastTransformer, the precursor library to TensorRT-LLM. TensorRT-LLM was released in September 2023, making it a very young library. We’re excited to see its roadmap develop!

Key Findings

TensorRT-LLM was:

  • 30–70% faster than llama.cpp on the same hardware
  • Consumes less memory on consecutive runs and marginally more GPU VRAM utilization than llama.cpp
  • 20%+ smaller compiled model sizes than llama.cpp
  • Less convenient as models have to be compiled for a specific OS and GPU architecture, vs. llama.cpp’s “Compile once, run everywhere” portability
  • Less accessible as it does not support older-generation NVIDIA GPUs

Why TensorRT-LLM?

TensorRT-LLM is Nvidia’s open-source inference library that incorporates Nvidia’s proprietary optimizations beyond the open-source cuBLAS(opens in a new tab) library.

As compared to llama.cpp, which today dominates Desktop AI as a cross-platform inference engine, TensorRT-LLM is highly optimized for Nvidia GPUs. While llama.cpp compiles models compiles models into a single, generalizable CUDA “backend” that can run on a wide range of Nvidia GPUs, TensorRT-LLM compiles models into a GPU-specific execution graph that is highly optimized for that specific GPU’s Tensor Cores, CUDA cores, VRAM and memory bandwidth.

TensorRT-LLM is typically used in datacenter-grade GPUs, where it produces a face-melting 10,000 tokens/s on NVIDIA H100 Tensor Core GPUs. We were curious for how TensorRT-LLM performs on consumer-grade GPUs, and gave it a spin.

Read more