References → LLM Recipe

The Large Language Model (LLM) Recipe is an excellent tool for enriching your datasets. Some examples include:

  • Sentiment Analysis: Determining sentiment in customer feedback, social media posts, or reviews, making it helpful in understanding customer satisfaction and brand perception.
  • Entity Extraction: Identifying specific entities (like names, places, products) in text data for categorization, tagging, or extracting structured information.
  • Classification and Tagging: Automatically categorizing rows, such as tagging emails by intent or support tickets by issue type, can streamline prioritization and triage.
  • Summarization: Generating concise summaries for long texts in rows (like news articles or product descriptions), enabling faster review and analysis.
  • Keyword or Theme Extraction: Identifying key topics or themes in the text helps understand recurring themes in survey responses, product reviews, or support cases.
  • Translation: Translating text fields row-by-row in multilingual datasets, useful for customer feedback, chat logs, and reviews in multiple languages.
  • Scoring and Relevance Ranking: Assessing relevance or assigning scores to data based on specific criteria (e.g., ranking news relevance).
Note

This recipe is available to any customer on 2024.7.2 releases and later.

Enabling the LLM Recipe

Pre-requisites

  • Incorta Premium is Enabled
  • Data Studio is Enabled
  • A LLM model or set of LLM models in a JSON array (contact an Incorta Account Executive for a default model)

Creating a JSON Array

Using the below template, create a JSON array.

[
{
"id": "1",
"name": "MODEL_ DISPLAY_NAME",
"provider": "PROVIDER",
"providerModelID": "",
"baseUrl": "BASE_URL",
"apiToken": "YOU_API_TOKEN_HERE",
"modelName": "YOUR_MODEL_NAME_HERE",
"maxRows": "1000",
"extraConfigs": {
"costPerKTokens": "0.00075",
"costUnit": "credits"
}
}
]
ParameterDefinition
idA unique identifier for the model configuration.
nameThe name of the LLM model.
providerThe name of the provider offering the LLM model. The currently supported providers are: OpenAI, Azure Openai, and Vectara.
providerModelIDThe unique identifier for the model provided by the provider. This ID can be used to reference or access the model from the provider's platform. This is only needed if the provider is aiXplain.
baseUrlThe base URL for accessing the model's API. This is only needed if the provider is OpenAI.
apiTokenThe API token is required to authenticate and access the model's API.
modelNameThe specific name or identifier for the model as recognized within the provider's platform. This is only required if the provider is OpenAI.
maxRowsThe maximum number of rows or records that the model can process in a single request or operation.
extraConfigsAn object containing additional configuration options for the model.
costPerKTokensThe cost of processing 1,000 tokens using this model.
costUnitThe unit of currency or credits used to measure the cost.

LLM Recipe Setup

The Data Studio will not display the LLM by default when no JSON arrays have been added to the CMC. To enable the LLM, complete the following steps:

  • Login to the CMC
  • Navigate to Clusters Cluster Configurations Server Configurations Incorta Data Studio.
  • Add the JSON array containing all LLM models that the LLM recipe should contain.
  • Restart the Analytics Service
Note

Based on the provider being used, you may also need to install the provider library. You can do so from the cloud portal from Configurations Python Packages. For OpenAI and Azure OpenAI providers, you would need the python package openai==1.14.x. For On-premises, make sure this library is installed on the machine(s) on which Analytics, Loader, and Spark are running.

Configuration

When entering the configuration interface of this recipe, it’s recommended to enter a recipe name and data input and then proceed to Run Sampled Trials. This experience will provide an experimental environment that, in turn, informs what configurations should be set.

ConfigurationDescription
Recipe NameA freeform name of how a user would like to name a recipe
Data InputSelect a previously constructed recipe to process
PromptDescribe what the LLM should do. Reference input fields using ‘@’ followed by the column name.
Output Mode
  ●  In Standard Mode, the LLM (Language Model) provides immediate responses based on the input prompt. This mode is designed for quick, direct answers.
  ●  Thinking Mode allows you to instruct the LLM to consider the query more deeply before providing an answer. In this mode, the LLM will think through the query step by step and present the final answer(s) in a JSON object with keys named after the target output columns. The LLM recipe will automatically parse the output and extract the output column names, their values, and the rationale in separate columns.
Sample SizeThe sample size will be used as input to the LLM Model in Data Studio. Keep the sampled value low to reduce costs when validating the LLM recipe.
LLMSelect which model you would like to return results to your datasets. Different models come with different performance, accuracy, and expenses based on the value of training data and underlying algorithms. Use Run sampled trials to test which model will be best for your data.
Maximum TokensThis parameter specifies the upper limit on the total number of tokens the model can generate in a single response. This is crucial in controlling response length and ensuring model efficiency, as tokens correspond to chunks of text, such as whole words, word fragments, or punctuation marks.
TemperatureA parameter that controls the randomness of the model's predictions between 0-1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. For lower creativity tasks, like categorization, select a lower temperature.
Maximum Calls per MinuteLimit the number of rows processed by the LLM per minute.

Run Sampled Trials

This environment is an excellent experimental environment to measure the accuracy of your model responses against values in your dataset. There are options to add multiple tabs at the top of the trial environment. These tabs can serve to compare experimental configurations. Once you find your optimal model, select Use Prompt. The sampled trials view will close, and the sample trial configurations will auto-apply to the configuration dialogue.

ConfigurationDescription
PromptDescribe what the LLM should do. Reference input fields using ‘@’ followed by the column name.
Referenced ColumnIf your prompt contains a column mention, the column will be made available so you can choose a single value in that column to test your prompt against.
Output Mode
  ●  In Standard Mode, the LLM (Language Model) provides immediate responses based on the input prompt. This mode is designed for quick, direct answers.
  ●  Thinking Mode allows you to instruct the LLM to consider the query more deeply before providing an answer. In this mode, the LLM will think through the query step by step and present the final answer(s) in a JSON object with keys named after the target output columns. The LLM recipe will automatically parse the output and extract the output column names, their values, and the rationale in separate columns.
ModelSelect which model you would like to return results to your datasets. Different models come with different performance, accuracy, and expense based on the value of training data and underlying algorithms. Use Run sampled trials to test which model will be best for your data.
TemperatureA parameter that controls the randomness of the model's predictions between 0-1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. For lower creativity tasks, like categorization, select a lower temperature.
Max TokensThis parameter specifies the upper limit on the total number of tokens the model can generate in a single response. This is crucial in controlling response length and ensuring model efficiency, as tokens correspond to chunks of text, such as whole words, word fragments, or punctuation marks.

Results Panel

Once a prompt is Run, the following information is returned:

ConfigurationDescription
OutputA value returned that takes into consideration the prompt and any referenced columns.
Elapsed TimeThe time elapsed for the LLM to receive a prompt and return an output.
Tokens UsedToken usage refers to the cost associated with processing text data via the model. Each model execution consumes tokens based on the input size and the operation's complexity.
CostThe unit of currency or credits used to measure the cost is calculated based on the JSON array and the tokens used. This is the cost of a single value that can be loosely extrapolated across the number of rows in the dataset to define the cost to execute the LLM on the full dataset.
Note

The cost is for customers who brought their own model/key. Customers deployed on a private instance will have fixed pricing and thus have less concern about cost measurement.

Deploying the LLM

When deploying a large language model (LLM), it is crucial to carefully consider your loading strategy. Since LLMs can incur costs depending on the deployment type, the frequency of scheduled executions and whether you opt for full loads versus incremental loads can significantly affect the expense of managing the model. It is advisable to share the model cautiously to avoid running it on demand without first understanding the associated costs.