torch_frame.data.Dataset
- class Dataset(df: DataFrame, col_to_stype: dict[str, torch_frame.stype], target_col: str | None = None, split_col: str | None = None, col_to_sep: str | None | dict[str, str | None] = None, col_to_text_embedder_cfg: dict[str, TextEmbedderConfig] | TextEmbedderConfig | None = None, col_to_text_tokenizer_cfg: dict[str, TextTokenizerConfig] | TextTokenizerConfig | None = None, col_to_image_embedder_cfg: dict[str, ImageEmbedderConfig] | ImageEmbedderConfig | None = None, col_to_time_format: str | None | dict[str, str | None] = None)[source]
Bases:
ABC
A base class for creating tabular datasets.
- Parameters:
df (DataFrame) – The tabular data frame.
col_to_stype (Dict[str, torch_frame.stype]) – A dictionary that maps each column in the data frame to a semantic type.
target_col (str, optional) – The column used as target. (default:
None
)split_col (str, optional) – The column that stores the pre-defined split information. The column should only contain
0
,1
, or2
. (default:None
).col_to_sep (Union[str, Dict[str, Optional[str]]]) – A dictionary or a string/
None
specifying the separator/delimiter for the multi-categorical columns. If a string/None
is specified, then the same separator will be used throughout all the multi-categorical columns. Note that ifNone
is specified, it assumes a multi-category is given as alist
of categories. If a dictionary is given, we use a separator specified for each column. (default:None
)col_to_text_embedder_cfg (TextEmbedderConfig or dict, optional) – A text embedder configuration or a dictionary of configurations specifying
text_embedder
that embeds texts into vectors andbatch_size
that specifies the mini-batch size fortext_embedder
. (default:None
)col_to_text_tokenizer_cfg (TextTokenizerConfig or dict, optional) – A text tokenizer configuration or dictionary of configurations specifying
text_tokenizer
that maps sentences into a list of dictionary of tensors. Each element in the list corresponds to each sentence, keys are input arguments to the model such asinput_ids
, and values are tensors such as tokens.batch_size
specifies the mini-batch size fortext_tokenizer
. (default:None
)col_to_time_format (Union[str, Dict[str, Optional[str]]], optional) – A dictionary or a string specifying the format for the timestamp columns. See strfttime documentation for more information on formats. If a string is specified, then the same format will be used throughout all the timestamp columns. If a dictionary is given, we use a different format specified for each column. If not specified, pandas’s internal to_datetime function will be used to auto parse time columns. (default:
None
)
- static download_url(url: str, root: str, filename: str | None = None, *, log: bool = True) str [source]
Downloads the content of
url
to the specified folderroot
.
- property num_rows
The number of rows of the dataset.
- materialize(device: torch.device | None = None, path: str | None = None, col_stats: dict[str, dict[StatType, Any]] | None = None) Dataset [source]
Materializes the dataset into a tensor representation. From this point onwards, the dataset should be treated as read-only.
- Parameters:
device (torch.device, optional) – Device to load the
TensorFrame
object. (default:None
)path (str, optional) – If path is specified and a cached file exists, this will try to load the saved the
TensorFrame
object andcol_stats
. Ifpath
is specified but a cached file does not exist, this will perform materialization and then save theTensorFrame
object andcol_stats
topath
. Ifpath
isNone
, this will materialize the dataset without caching. (default:None
)col_stats (Dict[str, Dict[StatType, Any]], optional) – optional
provided (col_stats provided by the user. If not) –
statistics (the) –
(default (is calculated from the dataframe itself.) –
None
)
- property tensor_frame: TensorFrame
Returns the
TensorFrame
of the dataset.
- property col_stats: dict[str, dict[torch_frame.data.stats.StatType, Any]]
Returns column-wise dataset statistics.
- index_select(index: Union[int, list[int], range, slice, Tensor]) Dataset [source]
Returns a subset of the dataset from specified indices
index
.
- shuffle(return_perm: bool = False) Dataset | tuple[Dataset, Tensor] [source]
Randomly shuffles the rows in the dataset.
- col_select(cols: Union[str, list[str]]) Dataset [source]
Returns a subset of the dataset from specified columns
cols
.
- get_split(split: str) Dataset [source]
Returns a subset of the dataset that belongs to a given training split (as defined in
split_col
).- Parameters:
split (str) – The split name (either
"train"
,"val"
, or"test"
.
- split() tuple[torch_frame.data.dataset.Dataset, torch_frame.data.dataset.Dataset, torch_frame.data.dataset.Dataset] [source]
Splits the dataset into training, validation and test splits.