# CI Benchmark

This is the performance benchmark for vllm github repo This benchmark is facing towards developers.

# Hardware

Please select a hardware. All plots down below will only show results of the selected hardware (more hardware is coming).

# Smoothing

Taking the average of last commits when drawing the curve. (default: 10)

# TL; DR

- Latency of vllm.
- Metric: median end-to-end latency (ms). We use median as it is more stable than mean when outliers occur.
- Input length: 32 tokens.
- Output length: 128 tokens.

- Throughput of vllm.
- Metric: throughput (request per second)
- Input length: 200 prompts from ShareGPT.
- Output length: the corresponding output length of these 200 prompts.

- Serving test of vllm
- Metrics: median TTFT (time-to-first-token, unit: ms) & median ITL (inter-token latency, unit: ms). We use median as it is more stable than mean when outliers occur.
- Input length: 200 prompts from ShareGPT.
- Output length: the corresponding output length of these 200 prompts.
- Average QPS: 1, 4, 16 and inf. QPS = inf means all requests come at once.

- We also test speculative decoding in vllm serving test. Concretely:
- Metrics: median TTFT (time-to-first-token, unit: ms) & median ITL (inter-token latency, unit: ms). We use median as it is more stable than mean when outliers occur.
- Input length: 200 prompts from ShareGPT.
- Output length: the corresponding output length of these 200 prompts.
- Draft model: Qwama-0.5B
- Number of tokens proposed per step: 4
- Average QPS: 2.

# Latency tests

### Description

This test suite aims to test vLLM's end-to-end latency under a controlled setup.

- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Evaluation metrics: end-to-end latency (mean, median, p99).

### Plot

# Throughput tests

### Description

This test suite aims to test vLLM's throughput.

- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Evaluation metrics: throughput.

### Plot

# Serving Benchmark (on ShareGPT)

### Description

This test suite aims to test vllm's real serving metrics.

- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).