LLM transform plugin
Leverage the power of a large language model (LLM) to process data by sending it to the LLM and receiving the generated results. Utilize the LLM's capabilities to label, clean, enrich data, perform data inference, and more.
name | type | required | default value |
---|---|---|---|
model_provider | enum | yes | |
output_data_type | enum | no | String |
output_column_name | string | no | llm_output |
prompt | string | yes | |
inference_columns | list | no | |
model | string | yes | |
api_key | string | yes | |
api_path | string | no | |
custom_config | map | no | |
custom_response_parse | string | no | |
custom_request_headers | map | no | |
custom_request_body | map | no |
The model provider to use. The available options are: OPENAI, DOUBAO, KIMIAI, MICROSOFT, CUSTOM
tips: If you use Microsoft, please make sure api_path cannot be empty
The data type of the output data. The available options are: STRING,INT,BIGINT,DOUBLE,BOOLEAN. Default value is STRING.
Custom output data field name. A custom field name that is the same as an existing field name is replaced with 'llm_output'.
The prompt to send to the LLM. This parameter defines how LLM will process and return data, eg:
The data read from source is a table like this:
name | age |
---|---|
Jia Fan | 20 |
Hailin Wang | 20 |
Eric | 20 |
Guangdong Liu | 20 |
The prompt can be:
Determine whether someone is Chinese or American by their name
The result will be:
name | age | llm_output |
---|---|---|
Jia Fan | 20 | Chinese |
Hailin Wang | 20 | Chinese |
Eric | 20 | American |
Guangdong Liu | 20 | Chinese |
The inference_columns
option allows you to specify which columns from the input data should be used as inputs for the LLM. By default, all columns will be used as inputs.
For example:
transform {
LLM {
model_provider = OPENAI
model = gpt-4o-mini
api_key = sk-xxx
inference_columns = ["name", "age"]
prompt = "Determine whether someone is Chinese or American by their name"
}
}
The model to use. Different model providers have different models. For example, the OpenAI model can be gpt-4o-mini
.
If you use OpenAI model, please refer https://platform.openai.com/docs/models/model-endpoint-compatibility
of /v1/chat/completions
endpoint.
The API key to use for the model provider. If you use OpenAI model, please refer https://platform.openai.com/docs/api-reference/api-keys of how to get the API key.
The API path to use for the model provider. In most cases, you do not need to change this configuration. If you are using an API agent's service, you may need to configure it to the agent's API address.
The custom_config
option allows you to provide additional custom configurations for the model. This is a map where you
can define various settings that might be required by the specific model you're using.
The custom_response_parse
option allows you to specify how to parse the model's response. You can use JsonPath to
extract the specific data you need from the response. For example, by using $.choices[*].message.content
, you can
extract the content
field values from the following JSON. For more details on using JsonPath, please refer to
the JsonPath Getting Started guide.
{
"id": "chatcmpl-9s4hoBNGV0d9Mudkhvgzg64DAWPnx",
"object": "chat.completion",
"created": 1722674828,
"model": "gpt-4o-mini",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "[\"Chinese\"]"
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 107,
"completion_tokens": 3,
"total_tokens": 110
},
"system_fingerprint": "fp_0f03d4f0ee",
"code": 0,
"msg": "ok"
}
The custom_request_headers
option allows you to define custom headers that should be included in the request sent to
the model's API. This is useful if the API requires additional headers beyond the standard ones, such as authorization
tokens, content types, etc.
The custom_request_body
option supports placeholders:
${model}
: Placeholder for the model name.${input}
: Placeholder to determine input value and define request body request type based on the type of body value. Example:"${input}"
-> "input"${prompt}
:Placeholder for LLM model prompts.
Transform plugin common parameters, please refer to Transform Plugin for details
The API interface usually has a rate limit, which can be configured with Seatunnel's speed limit to ensure smooth operation of the task. For details about Seatunnel speed limit Settings, please refer to speed-limit for details.
Determine the user's country through a LLM.
env {
parallelism = 1
job.mode = "BATCH"
read_limit.rows_per_second = 10
}
source {
FakeSource {
row.num = 5
schema = {
fields {
id = "int"
name = "string"
}
}
rows = [
{fields = [1, "Jia Fan"], kind = INSERT}
{fields = [2, "Hailin Wang"], kind = INSERT}
{fields = [3, "Tomas"], kind = INSERT}
{fields = [4, "Eric"], kind = INSERT}
{fields = [5, "Guangdong Liu"], kind = INSERT}
]
}
}
transform {
LLM {
model_provider = OPENAI
model = gpt-4o-mini
api_key = sk-xxx
prompt = "Determine whether someone is Chinese or American by their name"
}
}
sink {
console {
}
}
Determine whether a person is a historical emperor of China.
env {
parallelism = 1
job.mode = "BATCH"
read_limit.rows_per_second = 10
}
source {
FakeSource {
row.num = 5
schema = {
fields {
id = "int"
name = "string"
}
}
rows = [
{fields = [1, "Zhuge Liang"], kind = INSERT}
{fields = [2, "Li Shimin"], kind = INSERT}
{fields = [3, "Sun Wukong"], kind = INSERT}
{fields = [4, "Zhu Yuanzhuang"], kind = INSERT}
{fields = [5, "George Washington"], kind = INSERT}
]
}
}
transform {
LLM {
model_provider = KIMIAI
model = moonshot-v1-8k
api_key = sk-xxx
prompt = "Determine whether a person is a historical emperor of China"
output_data_type = boolean
}
}
sink {
console {
}
}
env {
job.mode = "BATCH"
}
source {
FakeSource {
row.num = 5
schema = {
fields {
id = "int"
name = "string"
}
}
rows = [
{fields = [1, "Jia Fan"], kind = INSERT}
{fields = [2, "Hailin Wang"], kind = INSERT}
{fields = [3, "Tomas"], kind = INSERT}
{fields = [4, "Eric"], kind = INSERT}
{fields = [5, "Guangdong Liu"], kind = INSERT}
]
result_table_name = "fake"
}
}
transform {
LLM {
source_table_name = "fake"
model_provider = CUSTOM
model = gpt-4o-mini
api_key = sk-xxx
prompt = "Determine whether someone is Chinese or American by their name"
openai.api_path = "http://mockserver:1080/v1/chat/completions"
custom_config={
custom_response_parse = "$.choices[*].message.content"
custom_request_headers = {
Content-Type = "application/json"
Authorization = "Bearer xxxxxxxx"
}
custom_request_body ={
model = "${model}"
messages = [
{
role = "system"
content = "${prompt}"
},
{
role = "user"
content = "${input}"
}]
}
}
result_table_name = "llm_output"
}
}
sink {
Assert {
source_table_name = "llm_output"
rules =
{
field_rules = [
{
field_name = llm_output
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
}
]
}
}
}