afnio.utils.datasets.trec#
The Text REtrieval Conference (TREC) Question Classification dataset.
Classes
|
The Text REtrieval Conference (TREC) Question Classification dataset contains 5452 labeled questions in the training set (before removing duplicates) and 5382 unique labeled questions (after removing duplicates), along with another 500 questions for the test set. |
|
- class afnio.utils.datasets.trec.TREC(task=None, split=None, validation_split=0.0, root=None)[source]#
Bases:
DatasetThe Text REtrieval Conference (TREC) Question Classification dataset contains 5452 labeled questions in the training set (before removing duplicates) and 5382 unique labeled questions (after removing duplicates), along with another 500 questions for the test set.
The dataset has 6 coarse class labels and 50 fine class labels. Average length of each sentence is 10, vocabulary size of 8700.
Data are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set. These questions were manually labeled.
TRECprovides a stratified train set and validation set, ensuring that both splits maintain the same class distribution proportions as in the original dataset.- Parameters:
task (str, optional) – Defines the classes to classify between
["coarse", "fine"]. Defaults to None.split (str, optional) – The dataset split in
["train", "val", "test"]. Defaults to None.validation_split (Optional[float], optional) – Float between 0 and 1. Fraction of the training data to be used as validation data. Defaults to 0.0.
root (Union[str, Path], optional) – Root directory of dataset where
TREC/raw/train_5500.labelandTREC/raw/TREC_10.labelexist. Defaults to None.
- mirrors = ['https://cogcomp.seas.upenn.edu/Data/QA/QC/']#
- resources = [('train_5500.label', '073462e3fcefaae31e00edb1f18d2d02'), ('TREC_10.label', '323a3554401d86e650717e2d2f942589')]#
- class afnio.utils.datasets.trec.TRECTmp(task=None, split=None, validation_split=0.0, root=None)[source]#
Bases:
Dataset- mirrors = ['https://cogcomp.seas.upenn.edu/Data/QA/QC/']#
- resources = [('train_5500.label', '073462e3fcefaae31e00edb1f18d2d02'), ('TREC_10.label', '323a3554401d86e650717e2d2f942589')]#