Skip to content

Latest commit

 

History

History
154 lines (123 loc) · 142 KB

README.md

File metadata and controls

154 lines (123 loc) · 142 KB

FriendliChat

(dedicated.chat)

Overview

Available Operations

complete

Given a list of messages forming a conversation, the model generates a response.

Example Usage

from friendli import Friendli
import os

s = Friendli(
    token=os.getenv("FRIENDLI_TOKEN", ""),
)

res = s.dedicated.chat.complete(model="(endpoint-id):(adapter-route)", messages=[
    {
        "role": "system",
        "content": "You are a helpful assistant.",
    },
    {
        "role": "user",
        "content": "Hello!",
    },
], max_tokens=200)

if res is not None:
    # handle response
    pass

Parameters

Parameter Type Required Description Example
model str ✔️ ID of target endpoint. If you want to send request to specific adapter, using "ENDPOINT_ID:ADAPTER_ROUTE" format. (endpoint-id):(adapter-route)
messages List[models.Message] ✔️ A list of messages comprising the conversation so far. [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
x_friendli_team Optional[str] ID of team to run requests as (optional parameter).
eos_token List[int] A list of endpoint sentence tokens.
frequency_penalty OptionalNullable[float] Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled, taking into account their frequency in the preceding text. This penalization diminishes the model's tendency to reproduce identical lines verbatim.
logit_bias OptionalNullable[models.DedicatedChatCompleteBodyLogitBias] Accepts a JSON object that maps tokens to an associated bias value. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model.
logprobs OptionalNullable[bool] Whether to return log probabilities of the output tokens or not.
max_tokens OptionalNullable[int] The maximum number of tokens to generate. For decoder-only models like GPT, the length of your input tokens plus max_tokens should not exceed the model's maximum length (e.g., 2048 for OpenAI GPT-3). For encoder-decoder models like T5 or BlenderBot, max_tokens should not exceed the model's maximum output length. This is similar to Hugging Face's max_new_tokens argument. 200
min_tokens OptionalNullable[int] The minimum number of tokens to generate. Default value is 0. This is similar to Hugging Face's min_new_tokens argument.

This field is unsupported when tools are specified.
n OptionalNullable[int] The number of independently generated results for the prompt. Not supported when using beam search. Defaults to 1. This is similar to Hugging Face's num_return_sequences argument.
parallel_tool_calls OptionalNullable[bool] Whether to enable parallel function calling.
presence_penalty OptionalNullable[float] Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled at least once in the existing text.
repetition_penalty OptionalNullable[float] Penalizes tokens that have already appeared in the generated result (plus the input tokens for decoder-only models). Should be greater than or equal to 1.0 (1.0 means no penalty). See keskar et al., 2019 for more details. This is similar to Hugging Face's repetition_penalty argument.
response_format OptionalNullable[models.ResponseFormat] The enforced format of the model's output.

Note that the content of the output message may be truncated if it exceeds the max_tokens.
You can check this by verifying that the finish_reason of the output message is length.

Important
You must explicitly instruct the model to produce the desired output format using a system prompt or user message (e.g., You are an API generating a valid JSON as output.).
Otherwise, the model may result in an unending stream of whitespace or other characters.
seed List[int] Seed to control random procedure. If nothing is given, random seed is used for sampling, and return the seed along with the generated result. When using the n argument, you can pass a list of seed values to control all of the independent generations.
stop List[str] When one of the stop phrases appears in the generation result, the API will stop generation. The stop phrases are excluded from the result. Defaults to empty list.
stream OptionalNullable[bool] Whether to stream generation result. When set true, each token will be sent as server-sent events once generated.
stream_options OptionalNullable[models.DedicatedChatCompleteBodyStreamOptions] Options related to stream.
It can only be used when stream: true.
temperature OptionalNullable[float] Sampling temperature. Smaller temperature makes the generation result closer to greedy, argmax (i.e., top_k = 1) sampling. Defaults to 1.0. This is similar to Hugging Face's temperature argument.
timeout_microseconds OptionalNullable[int] Request timeout. Gives the HTTP 429 Too Many Requests response status code. Default behavior is no timeout.
tool_choice Optional[models.DedicatedChatCompleteBodyToolChoice] Determines the tool calling behavior of the model.
When set to none, the model will bypass tool execution and generate a response directly.
In auto mode (the default), the model dynamically decides whether to call a tool or respond with a message.
Alternatively, setting required ensures that the model invokes at least one tool before responding to the user.
You can also specify a particular tool by {"type": "function", "function": {"name": "my_function"}}.
tools List[models.Tool] A list of tools the model may call.
Currently, only functions are supported as a tool.
A maximum of 128 functions is supported.
Use this to provide a list of functions the model may generate JSON inputs for.

When tools are specified, min_tokens field is unsupported.
top_k OptionalNullable[int] The number of highest probability tokens to keep for sampling. Numbers between 0 and the vocab size of the model (both inclusive) are allowed. The default value is 0, which means that the API does not apply top-k filtering. This is similar to Hugging Face's top_k argument.
top_logprobs OptionalNullable[int] The number of most likely tokens to return at each token position, each with an associated log probability. logprobs must be set to true if this parameter is used.
top_p OptionalNullable[float] Tokens comprising the top top_p probability mass are kept for sampling. Numbers between 0.0 (exclusive) and 1.0 (inclusive) are allowed. Defaults to 1.0. This is similar to Hugging Face's top_p argument.
retries Optional[utils.RetryConfig] Configuration to override the default retry behavior of the client.

Response

models.ChatResult

Errors

Error Type Status Code Content Type
models.SDKError 4XX, 5XX */*

stream

Given a list of messages forming a conversation, the model generates a response.

Example Usage

from friendli import Friendli
import os

s = Friendli(
    token=os.getenv("FRIENDLI_TOKEN", ""),
)

res = s.dedicated.chat.stream(model="(endpoint-id):(adapter-route)", messages=[
    {
        "role": "system",
        "content": "You are a helpful assistant.",
    },
    {
        "role": "user",
        "content": "Hello!",
    },
], max_tokens=200)

if res is not None:
    for event in res:
        # handle event
        print(event, flush=True)

Parameters

Parameter Type Required Description Example
model str ✔️ ID of target endpoint. If you want to send request to specific adapter, using "ENDPOINT_ID:ADAPTER_ROUTE" format. (endpoint-id):(adapter-route)
messages List[models.Message] ✔️ A list of messages comprising the conversation so far. [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
x_friendli_team Optional[str] ID of team to run requests as (optional parameter).
eos_token List[int] A list of endpoint sentence tokens.
frequency_penalty OptionalNullable[float] Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled, taking into account their frequency in the preceding text. This penalization diminishes the model's tendency to reproduce identical lines verbatim.
logit_bias OptionalNullable[models.DedicatedChatStreamBodyLogitBias] Accepts a JSON object that maps tokens to an associated bias value. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model.
logprobs OptionalNullable[bool] Whether to return log probabilities of the output tokens or not.
max_tokens OptionalNullable[int] The maximum number of tokens to generate. For decoder-only models like GPT, the length of your input tokens plus max_tokens should not exceed the model's maximum length (e.g., 2048 for OpenAI GPT-3). For encoder-decoder models like T5 or BlenderBot, max_tokens should not exceed the model's maximum output length. This is similar to Hugging Face's max_new_tokens argument. 200
min_tokens OptionalNullable[int] The minimum number of tokens to generate. Default value is 0. This is similar to Hugging Face's min_new_tokens argument.

This field is unsupported when tools are specified.
n OptionalNullable[int] The number of independently generated results for the prompt. Not supported when using beam search. Defaults to 1. This is similar to Hugging Face's num_return_sequences argument.
parallel_tool_calls OptionalNullable[bool] Whether to enable parallel function calling.
presence_penalty OptionalNullable[float] Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled at least once in the existing text.
repetition_penalty OptionalNullable[float] Penalizes tokens that have already appeared in the generated result (plus the input tokens for decoder-only models). Should be greater than or equal to 1.0 (1.0 means no penalty). See keskar et al., 2019 for more details. This is similar to Hugging Face's repetition_penalty argument.
response_format OptionalNullable[models.ResponseFormat] The enforced format of the model's output.

Note that the content of the output message may be truncated if it exceeds the max_tokens.
You can check this by verifying that the finish_reason of the output message is length.

Important
You must explicitly instruct the model to produce the desired output format using a system prompt or user message (e.g., You are an API generating a valid JSON as output.).
Otherwise, the model may result in an unending stream of whitespace or other characters.
seed List[int] Seed to control random procedure. If nothing is given, random seed is used for sampling, and return the seed along with the generated result. When using the n argument, you can pass a list of seed values to control all of the independent generations.
stop List[str] When one of the stop phrases appears in the generation result, the API will stop generation. The stop phrases are excluded from the result. Defaults to empty list.
stream OptionalNullable[bool] Whether to stream generation result. When set true, each token will be sent as server-sent events once generated.
stream_options OptionalNullable[models.DedicatedChatStreamBodyStreamOptions] Options related to stream.
It can only be used when stream: true.
temperature OptionalNullable[float] Sampling temperature. Smaller temperature makes the generation result closer to greedy, argmax (i.e., top_k = 1) sampling. Defaults to 1.0. This is similar to Hugging Face's temperature argument.
timeout_microseconds OptionalNullable[int] Request timeout. Gives the HTTP 429 Too Many Requests response status code. Default behavior is no timeout.
tool_choice Optional[models.DedicatedChatStreamBodyToolChoice] Determines the tool calling behavior of the model.
When set to none, the model will bypass tool execution and generate a response directly.
In auto mode (the default), the model dynamically decides whether to call a tool or respond with a message.
Alternatively, setting required ensures that the model invokes at least one tool before responding to the user.
You can also specify a particular tool by {"type": "function", "function": {"name": "my_function"}}.
tools List[models.Tool] A list of tools the model may call.
Currently, only functions are supported as a tool.
A maximum of 128 functions is supported.
Use this to provide a list of functions the model may generate JSON inputs for.

When tools are specified, min_tokens field is unsupported.
top_k OptionalNullable[int] The number of highest probability tokens to keep for sampling. Numbers between 0 and the vocab size of the model (both inclusive) are allowed. The default value is 0, which means that the API does not apply top-k filtering. This is similar to Hugging Face's top_k argument.
top_logprobs OptionalNullable[int] The number of most likely tokens to return at each token position, each with an associated log probability. logprobs must be set to true if this parameter is used.
top_p OptionalNullable[float] Tokens comprising the top top_p probability mass are kept for sampling. Numbers between 0.0 (exclusive) and 1.0 (inclusive) are allowed. Defaults to 1.0. This is similar to Hugging Face's top_p argument.
retries Optional[utils.RetryConfig] Configuration to override the default retry behavior of the client.

Response

Union[Generator[models.StreamedChatResult, None, None], AsyncGenerator[models.StreamedChatResult, None]]

Errors

Error Type Status Code Content Type
models.SDKError 4XX, 5XX */*