LLMDatasetConversation#
- class council.prompt.LLMDatasetConversation(messages: Sequence[LLMMessage], labels: Mapping[str, str] | None)[source]#
Bases:
object
Represents a conversation between user and assistant with optional labels.
LLMDataset#
- class council.prompt.LLMDatasetObject(kind: str, version: str, metadata: DataObjectMetadata, spec: T)[source]#
Bases:
DataObject
[LLMDatasetSpec
]Helper class to instantiate a LLMDataset from a YAML file.
LLMDataset represents a dataset to be used for fine-tuning / batch API or managing few shot examples. Contains a list of conversations between user and assistant and optional shared system prompt; if specified, it will be a system prompt for every conversation in the dataset.
- property system_prompt: str | None#
Return system prompt if any.
- property conversations: List[LLMDatasetConversation]#
Return all raw conversations in the dataset.
- count_labels() DefaultDict[str, Counter] [source]#
Count occurrences of each label value grouped by label key. Returns a dictionary where keys are label names and values are Counters of label values.
- to_jsonl_messages() List[Dict[str, List[Dict[str, str]]]] [source]#
Convert the dataset to JSONL format with OpenAI messages structure. Returns a list of dictionaries containing messages.
- save_jsonl_messages(path: str, random_seed: int | None = None, val_split: float | None = None) None [source]#
Save the dataset as JSONL messages file(s), optionally splitting into training and validation sets. JSONL file then can be used for fine-tuning. See https://platform.openai.com/docs/guides/fine-tuning.
- Parameters:
path – Base path for saving the file(s)
random_seed – If provided, will be used to shuffle dataset before saving (default: None)
val_split – If provided, fraction of data to use for validation and create separate files for train and val. If None, saves all data to a single file (default: None)
Examples
# Save all data into a single my_dataset.jsonl file dataset.save_jsonl(“my_dataset.jsonl”) # Creates my_dataset.jsonl
# Split into train/val sets (80/20 split) and saves into my_dataset_train.jsonl and my_dataset_val.jsonl dataset.save_jsonl(“my_dataset.jsonl”, random_seed=42, val_split=0.2)
- save_jsonl_requests(path: str, model: str, url: str = '/v1/chat/completions') None [source]#
Save the dataset as JSONL request file, which can be used for batch API. See https://platform.openai.com/docs/guides/batch.
- Parameters:
path – Path to the output file
model – OpenAI model name
url – OpenAI API URL (default: “/v1/chat/completions”)
Examples
dataset.save_jsonl_request(“my_batch.jsonl”, “gpt-4o-mini”)
Fine-tuning#
Here’s an example of LLMDataset YAML file for fine-tuning:
kind: LLMDataset
version: 0.1
metadata:
name: "ToyFinetuningDataset"
description: "Toy fine-tuning dataset"
labels:
kind: fine-tuning
spec:
system_prompt: |
You are a happy assistant that puts a positive spin on everything.
conversations:
- messages:
- role: user
content: |
I fell off my bike today.
- role: assistant
content: |
It's great that you're getting exercise outdoors!
labels:
number: dataset-one
kind: simple
- messages:
- role: user
content: |
I lost my tennis match today.
- role: assistant
content: |
It's ok, it happens to everyone.
- role: user
content: |
But I trained so hard!
- role: assistant
content: |
It will pay off next time.
- role: user
content: |
I'm going to switch to golf.
- role: assistant
content: |
Golf is fun too!
- role: user
content: |
I don't even know how to play golf.
- role: assistant
content: |
It's easy to learn!
labels:
number: dataset-one
kind: multiple
- messages:
- role: user
content: |
I lost my book today.
- role: assistant
content: |
You can read everything on ebooks these days!
labels:
number: dataset-one
kind: simple
- messages:
- role: user
content: |
I'm hungry.
- role: assistant
content: |
Eat a banana!Eat a banana!Eat a banana!Eat a banana!Eat a banana!Eat a banana!Eat a banana!Eat a banana!Eat a banana!Eat a banana!
labels:
number: dataset-two
kind: simple
outlier-label: outlier
You can load this dataset from a YAML file and save it as a JSONL file to then use for OpenAI fine-tuning API:
import os
import tempfile
from council.prompt import LLMDatasetObject
with tempfile.TemporaryDirectory() as tmp_dir:
tmp_file = os.path.join(tmp_dir, "dataset.jsonl")
dataset = LLMDatasetObject.from_yaml("data/datasets/llm-dataset-fine-tuning.yaml")
dataset.save_jsonl_messages(tmp_file)
lines = LLMDatasetObject.read_jsonl(tmp_file)
for line in lines:
print(line)
This will produce the following lines:
{'messages': [{'role': 'system', 'content': 'You are a happy assistant that puts a positive spin on everything.'}, {'role': 'user', 'content': 'I fell off my bike today.'}, {'role': 'assistant', 'content': "It's great that you're getting exercise outdoors!"}]}
{'messages': [{'role': 'system', 'content': 'You are a happy assistant that puts a positive spin on everything.'}, {'role': 'user', 'content': 'I lost my tennis match today.'}, {'role': 'assistant', 'content': "It's ok, it happens to everyone."}, {'role': 'user', 'content': 'But I trained so hard!'}, {'role': 'assistant', 'content': 'It will pay off next time.'}, {'role': 'user', 'content': "I'm going to switch to golf."}, {'role': 'assistant', 'content': 'Golf is fun too!'}, {'role': 'user', 'content': "I don't even know how to play golf."}, {'role': 'assistant', 'content': "It's easy to learn!"}]}
{'messages': [{'role': 'system', 'content': 'You are a happy assistant that puts a positive spin on everything.'}, {'role': 'user', 'content': 'I lost my book today.'}, {'role': 'assistant', 'content': 'You can read everything on ebooks these days!'}]}
{'messages': [{'role': 'system', 'content': 'You are a happy assistant that puts a positive spin on everything.'}, {'role': 'user', 'content': "I'm hungry."}, {'role': 'assistant', 'content': 'Eat a banana!Eat a banana!Eat a banana!Eat a banana!Eat a banana!Eat a banana!Eat a banana!Eat a banana!Eat a banana!Eat a banana!'}]}
Few-shot examples#
You can use the same dataset to manage few-shot examples and format them by calling dataset.format_examples()
.
import os
import tempfile
from council.prompt import LLMDatasetObject
dataset = LLMDatasetObject.from_yaml("data/datasets/llm-dataset-fine-tuning.yaml")
examples = dataset.format_examples(
start_prefix="### Example {i} ###",
end_prefix="### End Example {i} ###"
)
print(examples[0])
### Example 1 ###
user: I fell off my bike today.
assistant: It's great that you're getting exercise outdoors!
### End Example 1 ###
Batch API#
Here’s an example of LLMDataset YAML file for batch API:
kind: LLMDataset
version: 0.1
metadata:
name: "ToyBatchDataset"
description: "Dataset for sentiment prediction (positive, neutral, negative)"
labels:
kind: batch
spec:
system_prompt: |
Classify the sentiment of user inputs into one of three categories:
positive, neutral, or negative.
Respond with just the sentiment label.
conversations:
- messages:
- role: user
content: |
I had a wonderful day at the park with my family.
- messages:
- role: user
content: |
The weather was okay, not too bad, not too great.
- messages:
- role: user
content: |
My car broke down on the way to work, and it ruined my entire day.
- messages:
- role: user
content: |
I received a promotion at work today, and I'm feeling ecstatic!
- messages:
- role: user
content: |
The movie was average; it wasn't what I expected.
- messages:
- role: user
content: |
I missed my flight and had to reschedule everything, which was frustrating.
You can load a dataset from a YAML file and save it as a JSONL file to then use for OpenAI batch API:
import os
import tempfile
from council.prompt import LLMDatasetObject
with tempfile.TemporaryDirectory() as tmp_dir:
tmp_file = os.path.join(tmp_dir, "batch.jsonl")
dataset = LLMDatasetObject.from_yaml("data/datasets/llm-dataset-batch.yaml")
dataset.save_jsonl_requests(tmp_file, model="gpt-4o-mini")
lines = LLMDatasetObject.read_jsonl(tmp_file)
for line in lines:
print(line)
This will produce the following lines:
{'custom_id': 'request-0', 'method': 'POST', 'url': '/v1/chat/completions', 'body': {'model': 'gpt-4o-mini', 'messages': [{'role': 'system', 'content': 'Classify the sentiment of user inputs into one of three categories: \npositive, neutral, or negative. \nRespond with just the sentiment label.'}, {'role': 'user', 'content': 'I had a wonderful day at the park with my family.'}]}}
{'custom_id': 'request-1', 'method': 'POST', 'url': '/v1/chat/completions', 'body': {'model': 'gpt-4o-mini', 'messages': [{'role': 'system', 'content': 'Classify the sentiment of user inputs into one of three categories: \npositive, neutral, or negative. \nRespond with just the sentiment label.'}, {'role': 'user', 'content': 'The weather was okay, not too bad, not too great.'}]}}
{'custom_id': 'request-2', 'method': 'POST', 'url': '/v1/chat/completions', 'body': {'model': 'gpt-4o-mini', 'messages': [{'role': 'system', 'content': 'Classify the sentiment of user inputs into one of three categories: \npositive, neutral, or negative. \nRespond with just the sentiment label.'}, {'role': 'user', 'content': 'My car broke down on the way to work, and it ruined my entire day.'}]}}
{'custom_id': 'request-3', 'method': 'POST', 'url': '/v1/chat/completions', 'body': {'model': 'gpt-4o-mini', 'messages': [{'role': 'system', 'content': 'Classify the sentiment of user inputs into one of three categories: \npositive, neutral, or negative. \nRespond with just the sentiment label.'}, {'role': 'user', 'content': "I received a promotion at work today, and I'm feeling ecstatic!"}]}}
{'custom_id': 'request-4', 'method': 'POST', 'url': '/v1/chat/completions', 'body': {'model': 'gpt-4o-mini', 'messages': [{'role': 'system', 'content': 'Classify the sentiment of user inputs into one of three categories: \npositive, neutral, or negative. \nRespond with just the sentiment label.'}, {'role': 'user', 'content': "The movie was average; it wasn't what I expected."}]}}
{'custom_id': 'request-5', 'method': 'POST', 'url': '/v1/chat/completions', 'body': {'model': 'gpt-4o-mini', 'messages': [{'role': 'system', 'content': 'Classify the sentiment of user inputs into one of three categories: \npositive, neutral, or negative. \nRespond with just the sentiment label.'}, {'role': 'user', 'content': 'I missed my flight and had to reschedule everything, which was frustrating.'}]}}
LLMDatasetValidator#
- class council.prompt.LLMDatasetValidator[source]#
Bases:
object
Helper class to validate the content of LLMDatasetObject.
- static validate_for_batch_api(dataset: LLMDatasetObject) None [source]#
Validate dataset for batch API.
- Raises:
LLMDatasetValidationException – If dataset contains conversations that do not end with a user message.
- static validate_for_fine_tuning(dataset: LLMDatasetObject) None [source]#
Validate dataset for fine-tuning.
- Raises:
LLMDatasetValidationException – If dataset contains conversations that does not follow the pattern: user -> assistant -> user -> assistant -> …