Cut GPUs needed and costs to serve generative AI models with FriendliAI PeriFlow, the fastest generative AI serving engine on the market. PeriFlow is both powerful and versatile.


High speeds at low costs


Support for various LLMs and workloads

PeriFlow is based on our cutting-edge research and experience of running generative AI models and workloads, including multi-level optimization, scheduling, and batching techniques. Our batching technology is protected by patents in the United States and Korea.

Generative AI models

PeriFlow supports a wide range of generative AI models. LLMs trained for generation tasks (e.g., ChatGPT, GPT-3, PaLM, OPT, BLOOM, LLaMA) have proven to be of paramount importance to many types of applications, including but not limited to chatbots, translation, summarization, code generation, and caption-generation. However, serving these models is costly and burdensome.

Supported models

GPT, GPT-J, GPT-NeoX, MPT, LLaMA, Dolly, OPT, BLOOM, CodeGen, T5, FLAN, UL2, etc.

Supported decoding options

greedy, top-k, top-p, beam search, stochastic beam search, etc.

Supported data types

fp32, fp16, bf16, int8

High speed at low costs

PeriFlow significantly outperforms NVIDIA Triton+FasterTransformer in both latency and throughput for serving LLMs ranging from 1.3B to 341B. For example, 10x throughput improvement for a GPT-3 175B model at the same level of latency.

PeriFlow throughput graph

Two ways to use PeriFlow

You can either use our PeriFlow container or PeriFlow cloud.

How to Use PeriFlow

There are two steps to using PeriFlow as an end user:

1) deploying your generative AI model with PeriFlow, and 2) sending inference requests for downstream applications to the model.

How to send an inference request to a PeriFlow endpoint

PeriFlow Serving

Once you deploy the model, you will be provided with an HTTP endpoint, to which you can send an inference request to produce text generation results. Here’s how to send one:

curl http://<periflow-endpoint>/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "Say this is a test",
"max_tokens": 5

Below is an example of a response to the above HTTP request.

"index": 0,
"text": ", say it works!",
"tokens": [11, 910, 340, 2499, 0]
