Datasets and DataLoaders#
Warning
Before running any code, ensure you are logged in to the Afnio backend (afnio login).. See Logging in to Afnio Backend for details.
Code for processing data samples can quickly become complex and difficult to maintain. For better readability and modularity, it’s best to keep your dataset code decoupled from your agent training code. Afnio provides two data primitives, afnio.utils.data.Dataset and afnio.utils.data.DataLoader, which allow you to use pre-loaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, while DataLoader wraps an iterable around the Dataset to enable easy access, batching and shuffling of samples. These tools help you organize, preprocess, and batch your data for agent workflows, evaluation, and optimization.
Afnio also includes several pre-loaded datasets (such as TREC), which subclass afnio.utils.data.Dataset and provide dataset-specific functionality. These built-in datasets are useful for prototyping and benchmarking your agents. Explore the available datasets here: afnio.utils.datasets.
Loading a Dataset#
Afnio provides built-in datasets for benchmarking and prototyping, such as the Meta’s Facility Support Analyzer dataset. The FacilitySupport dataset contains 200 examples of facility-related support messages, each annotated with urgency, sentiment, and categories (e.g., “routine_maintenance_requests”, “customer_feedback_and_complaints”). This is useful for training and evaluating agents on classification and prioritization tasks.
urgency: the priority level of the request (e.g., “low”, “medium”, “high”),
sentiment: the sentiment expressed in the message (e.g., “positive”, “negative”),
categories: one or more relevant tags (e.g., “routine_maintenance_requests”, “customer_feedback_and_complaints”).
This dataset is useful for training and evaluating agents on tasks such as classification, prioritization, and semantic understanding.
To load a dataset, specify:
split: which subset to use ("train","val", or"test"),root: the directory where the data will be downloaded or loaded from.
Example: Loading the Facility Support dataset
from afnio.utils.datasets import FacilitySupport
training_data = FacilitySupport(split="train", root="data")
validation_data = FacilitySupport(split="val", root="data")
test_data = FacilitySupport(split="test", root="data")
print(f"Number of training samples: {len(training_data)}")
print(f"Number of validation samples: {len(validation_data)}")
print(f"Number of test samples: {len(test_data)}")
Output:
Downloading https://raw.githubusercontent.com/meta-llama/llama-prompt-ops/refs/heads/main/use-cases/facility-support-analyzer/dataset.json to data/FacilitySupport/raw/dataset.json
Downloading ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 383.7/383.7 kB 686.8 kB/s 0:00:00
Using downloaded and verified file: data/FacilitySupport/raw/dataset.json
Number of training samples: 66
Number of validation samples: 66
Number of test samples: 68
Iterating through the Dataset#
You can index Afnio datasets manually like a list: training_data[index]. Each sample is typically a tuple containing the input and its labels or annotations.
Example: Iterating through the Facility Support dataset
for i in range(len(training_data)):
message, (urgency, sentiment, categories) = training_data[i]
print(f"Message: {message.data!r}")
print(f"Urgency: {urgency.data}")
print(f"Sentiment: {sentiment.data}")
print(f"Categories: {categories.data}")
break
Output:
Message: "Subject: Adjusting Bi-Weekly Cleaning Schedule for My Office\n\nDear ProCare Facility Solutions Support Team,\n\nI hope this message finds you well. My name is Dr. Alex Turner, and I have been utilizing your services for my office space for the past year. I must say, your team's dedication to maintaining a pristine environment has been commendable and greatly appreciated.\n\nI am reaching out to discuss the scheduling of our regular cleaning services. While I find the logistical challenges of coordinating these services intellectually stimulating, I believe we could optimize the current schedule to better suit the needs of my team and our workflow. Specifically, I would like to explore the possibility of adjusting our cleaning schedule to a bi-weekly arrangement, ideally on Tuesdays and Fridays, to ensure our workspace remains consistently clean without disrupting our research activities.\n\nPreviously, I have attempted to adjust the schedule through the online portal, but I encountered some difficulties in finalizing the changes. I would appreciate your assistance in making these adjustments or guiding me through the process if there is a more efficient way to do so.\n\nThank you for your attention to this matter. I look forward to your response and continued excellent service.\n\nBest regards,\n\nDr. Alex Turner\nCryptography Researcher"
Urgency: low
Sentiment: neutral
Categories: {"routine_maintenance_requests": false, "customer_feedback_and_complaints": false, "training_and_support_requests": false, "quality_and_safety_concerns": false, "sustainability_and_environmental_practices": false, "cleaning_services_scheduling": true, "specialized_cleaning_services": false, "emergency_repair_services": false, "facility_management_issues": false, "general_inquiries": false}
Creating a Custom Dataset#
To use your own data, subclass afnio.utils.data.Dataset and implement three methods:
__init__: Initializes your dataset object and loads or stores any data or metadata you need.__len__: Returns the number of samples in your dataset.__getitem__: Loads and returns a sample at a given index. This is where you retrieve and format your data.
Example: Custom text dataset
from afnio.utils.data.dataset import Dataset
class MyTextDataset(Dataset):
def __init__(self, texts, labels):
self.texts = texts
self.labels = labels
def __getitem__(self, idx):
return self.texts[idx], self.labels[idx]
def __len__(self):
return len(self.texts)
# Usage
texts = ["Hello world!", "How are you?", "Afnio is awesome!"]
labels = ["greeting", "question", "statement"]
dataset = MyTextDataset(texts, labels)
print(dataset[1])
Output:
('How are you?', 'question')
Preparing your data for training with DataLoaders#
The Dataset retrieves your data samples one at a time. When training agents, you typically want to pass samples in “minibatches”, reshuffle the data at every epoch to avoid overfitting — that is, over-adapting to the training samples and failing to generalize to new data — and use efficient data loading strategies. The DataLoader abstracts this complexity and provides an easy API for batching and shuffling.
Example: Creating a DataLoader with reshuffled data
from afnio.utils.data import DataLoader
dataloader = DataLoader(dataset, batch_size=2, shuffle=True, seed=42)
Example: Creating a DataLoader with deterministic batching
BATCH_SIZE = 33
val_dataloader = DataLoader(validation_data, batch_size=BATCH_SIZE, seed=42)
test_dataloader = DataLoader(test_data, batch_size=BATCH_SIZE, seed=42)
Iterating through the DataLoader#
You can iterate through the DataLoader to get batches of data. Each batch will contain multiple samples, batched according to the structure returned by your dataset.
Example: Iterating through batches from a shuffled DataLoader
for batch in dataloader:
print(batch)
Output:
(['Afnio is awesome!', 'Hello world!'], ['statement', 'greeting'])
(['How are you?'], ['question'])
Example: Displaying batch contents from a deterministic DataLoader
message, (urgency, sentiment, categories) = next(iter(test_dataloader))
assert len(urgency.data) == len(sentiment.data) == len(categories.data) == BATCH_SIZE
print(f"Input batch size: {len(message.data)}")
print(f"Outputs batch size: {len(urgency.data)}")
print(f"Message: {message.data[0]!r}")
print(f"Urgency: {urgency.data[0]}")
print(f"Sentiment: {sentiment.data[0]}")
print(f"Categories: {categories.data[0]}")
Output:
Input batch size: 33
Outputs batch size: 33
Message: "Hey ProCare Support Team,\n\nHope you all are doing great! My name is Alex, and I've been using your awesome services for my apartment complex for a few months now. I must say, you guys are doing a fantastic job keeping everything spick and span.\n\nI wanted to reach out because I've been thinking a lot about how we can make our building more eco-friendly. I know you guys are big on sustainability, which is one of the reasons I chose ProCare in the first place. I was wondering if you could share some tips or maybe even offer some additional services that could help us reduce our environmental impact even more.\n\nI haven't really done much on my own yet, just some basic recycling and switching to LED bulbs, but I feel like there's so much more we could be doing. Any advice or guidance you could provide would be super helpful.\n\nThanks a ton for your help and for all the great work you do!\n\nBest,\nAlex"
Urgency: low
Sentiment: positive
Categories: {"routine_maintenance_requests": false, "customer_feedback_and_complaints": false, "training_and_support_requests": false, "quality_and_safety_concerns": false, "sustainability_and_environmental_practices": true, "cleaning_services_scheduling": false, "specialized_cleaning_services": false, "emergency_repair_services": false, "facility_management_issues": false, "general_inquiries": false}
Note
The DataLoader automatically batches your dataset outputs:
For tuples, each element is grouped into a list, so a batch of
(input, label)becomes([input1, input2], [label1, label2]).For dictionaries, each key is grouped into a list, resulting in
{'key': [value1, value2]}for each key.For
Variableobjects, the.datafields are combined into a list, creating a singleVariablewhose.datais a list of the original values. Theroleandrequires_gradattributes are inherited from the first item in the batch. For example, batching[Variable(data="a"), Variable(data="b")]producesVariable(data=["a", "b"]).
Samplers#
Afnio provides several samplers to control the order in which data is loaded:
SequentialSampler: iterates through the dataset in order.RandomSampler: samples elements randomly, with or without replacement.WeightedRandomSampler: samples elements according to specified probabilities—often used with imbalanced datasets to ensure fair learning across all classes.
You can pass a sampler to the DataLoader for custom sampling strategies. This is especially useful when your dataset is imbalanced and you want to give underrepresented classes a higher chance of being sampled.
Example: Computing sample weights for imbalanced datasets
# The training set is inbalanced, so we assign weights to each sample
# to ensure fair learning across all classes
def compute_sample_weights(data):
with te.suppress_variable_notifications():
labels = [y.data for _, (_, y, _) in data]
counts = {label: labels.count(label) for label in set(labels)}
total = len(data)
return [total / counts[label] for label in labels]
weights = compute_sample_weights(training_data)
Example: Using a weighted random sampler with DataLoader
from afnio.utils.data import WeightedRandomSampler
sampler = WeightedRandomSampler(weights, num_samples=len(training_data), replacement=True)
train_dataloader = DataLoader(training_data, sampler=sampler, batch_size=BATCH_SIZE)
message, (urgency, sentiment, categories) = next(iter(train_dataloader))
print(f"Message: {message.data[0]!r}")
print(f"Urgency: {urgency.data[0]}")
print(f"Sentiment: {sentiment.data[0]}")
print(f"Categories: {categories.data[0]}")
Output:
Message: 'Subject: Immediate Assistance Required for HVAC System Failure\n\nDear ProCare Support Team,\n\nI hope this message finds you well. My name is [Sender], and I am an editor at a popular entertainment magazine. We have been utilizing ProCare Facility Solutions for our office maintenance needs for the past year, and I must say, your services have always been top-notch.\n\nHowever, we are currently facing a critical issue that requires your immediate attention. Our HVAC system has completely failed, and with the summer heat, this has created an extremely uncomfortable working environment for our staff. Given the nature of our work, a comfortable and conducive environment is essential for productivity.\n\nWe have tried basic troubleshooting steps, such as resetting the system and checking the circuit breakers, but nothing seems to be working. This issue is beyond our in-house capabilities and needs professional intervention.\n\nCould you please dispatch a technician as soon as possible to address this urgent repair? We are in dire need of a swift resolution to ensure our operations can continue smoothly.\n\nThank you for your prompt attention to this matter. I look forward to your quick response.\n\nBest regards,\n[Sender]'
Urgency: high
Sentiment: neutral
Categories: {"routine_maintenance_requests": false, "customer_feedback_and_complaints": false, "training_and_support_requests": false, "quality_and_safety_concerns": false, "sustainability_and_environmental_practices": false, "cleaning_services_scheduling": false, "specialized_cleaning_services": false, "emergency_repair_services": true, "facility_management_issues": false, "general_inquiries": false}