Cut GPUs needed and costs to serve generative AI models with FriendliAI PeriFlow, the fastest generative AI serving engine on the market. PeriFlow is both powerful and versatile.
High speeds at low costs
Support for various LLMs and workloads
Generative AI models
PeriFlow supports a wide range of generative AI models. LLMs trained for generation tasks (e.g., ChatGPT, GPT-3, PaLM, OPT, BLOOM, LLaMA) have proven to be of paramount importance to many types of applications, including but not limited to chatbots, translation, summarization, code generation, and caption-generation. However, serving these models is costly and burdensome.
GPT, GPT-J, GPT-NeoX, MPT, LLaMA, Dolly, OPT, BLOOM, CodeGen, T5, FLAN, UL2, etc.
Supported decoding options
greedy, top-k, top-p, beam search, stochastic beam search, etc.
Supported data types
fp32, fp16, bf16, int8
High speed at low costs
PeriFlow significantly outperforms NVIDIA Triton+FasterTransformer in both latency and throughput for serving LLMs ranging from 1.3B to 341B. For example, 10x throughput improvement for a GPT-3 175B model at the same level of latency.
Two ways to use PeriFlow
You can either use our PeriFlow container or PeriFlow cloud.
How to Use PeriFlow
There are two steps to using PeriFlow as an end user:
1) deploying your generative AI model with PeriFlow, and 2) sending inference requests for downstream applications to the model.
How to send an inference request to a PeriFlow endpoint
Once you deploy the model, you will be provided with an HTTP endpoint, to which you can send an inference request to produce text generation results. Here’s how to send one:
Below is an example of a response to the above HTTP request.