torch_frame.data.Dataset

class Dataset(df: DataFrame, col_to_stype: dict[str, torch_frame._stype.stype], target_col: Optional[str] = None, split_col: Optional[str] = None, col_to_sep: Union[str, None, dict[str, str | None]] = None, col_to_text_embedder_cfg: Optional[Union[dict[str, torch_frame.config.text_embedder.TextEmbedderConfig], TextEmbedderConfig]] = None, col_to_text_tokenizer_cfg: Optional[Union[dict[str, torch_frame.config.text_tokenizer.TextTokenizerConfig], TextTokenizerConfig]] = None, col_to_image_embedder_cfg: Optional[Union[dict[str, torch_frame.config.image_embedder.ImageEmbedderConfig], ImageEmbedderConfig]] = None, col_to_time_format: Union[str, None, dict[str, str | None]] = None)[source]

Bases: object

A base class for creating tabular datasets.

Parameters:

df (DataFrame) – The tabular data frame.
col_to_stype (Dict[str, torch_frame.stype]) – A dictionary that maps each column in the data frame to a semantic type.
target_col (str, optional) – The column used as target. (default: None)
split_col (str, optional) – The column that stores the pre-defined split information. The column should only contain 0, 1, or 2. (default: None).
col_to_sep (Union[str, Dict[str, Optional[str]]]) – A dictionary or a string/None specifying the separator/delimiter for the multi-categorical columns. If a string/None is specified, then the same separator will be used throughout all the multi-categorical columns. Note that if None is specified, it assumes a multi-category is given as a list of categories. If a dictionary is given, we use a separator specified for each column. (default: None)
col_to_text_embedder_cfg (TextEmbedderConfig or dict, optional) – A text embedder configuration or a dictionary of configurations specifying text_embedder that embeds texts into vectors and batch_size that specifies the mini-batch size for text_embedder. (default: None)
col_to_text_tokenizer_cfg (TextTokenizerConfig or dict, optional) – A text tokenizer configuration or dictionary of configurations specifying text_tokenizer that maps sentences into a list of dictionary of tensors. Each element in the list corresponds to each sentence, keys are input arguments to the model such as input_ids, and values are tensors such as tokens. batch_size specifies the mini-batch size for text_tokenizer. (default: None)
col_to_time_format (Union[str, Dict[str, Optional[str]]], optional) – A dictionary or a string specifying the format for the timestamp columns. See strfttime documentation for more information on formats. If a string is specified, then the same format will be used throughout all the timestamp columns. If a dictionary is given, we use a different format specified for each column. If not specified, pandas’s internal to_datetime function will be used to auto parse time columns. (default: None)

static download_url(url: str, root: str, filename: Optional[str] = None, *, log: bool = True) → str[source]

Downloads the content of url to the specified folder root.

Parameters:

url (str) – The URL.
root (str) – The root folder.
filename (str, optional) – If set, will rename the downloaded file. (default: None)
log (bool, optional) – If False, will not print anything to the console. (default: True)

property feat_cols: list[str]: The input feature columns of the dataset.

property task_type: TaskType: The task type of the dataset.

property num_rows: The number of rows of the dataset.

materialize(device: Optional[device] = None, path: Optional[str] = None, col_stats: Optional[dict[str, dict[torch_frame.data.stats.StatType, Any]]] = None) → Dataset[source]

Materializes the dataset into a tensor representation. From this point onwards, the dataset should be treated as read-only.

Parameters:

device (torch.device, optional) – Device to load the TensorFrame object. (default: None)
path (str, optional) – If path is specified and a cached file exists, this will try to load the saved the TensorFrame object and col_stats. If path is specified but a cached file does not exist, this will perform materialization and then save the TensorFrame object and col_stats to path. If path is None, this will materialize the dataset without caching. (default: None)
col_stats (Dict[str, Dict[StatType, Any]], optional) – optional
provided (col_stats provided by the user. If not) –
statistics (the) –
(default (is calculated from the dataframe itself.) – None)

property is_materialized: bool: Whether the dataset is already materialized.

property tensor_frame: TensorFrame: Returns the TensorFrame of the dataset.

property col_stats: dict[str, dict[torch_frame.data.stats.StatType, Any]]: Returns column-wise dataset statistics.

index_select(index: int | list[int] | range | slice | torch.Tensor) → Dataset[source]: Returns a subset of the dataset from specified indices index.

shuffle(return_perm: bool = False) → torch_frame.data.dataset.Dataset | tuple[torch_frame.data.dataset.Dataset, torch.Tensor][source]: Randomly shuffles the rows in the dataset.

col_select(cols: str | list[str]) → Dataset[source]: Returns a subset of the dataset from specified columns cols.

get_split(split: str) → Dataset[source]

Returns a subset of the dataset that belongs to a given training split (as defined in split_col).

Parameters:: split (str) – The split name (either "train", "val", or "test".

split() → tuple[torch_frame.data.dataset.Dataset, torch_frame.data.dataset.Dataset, torch_frame.data.dataset.Dataset][source]: Splits the dataset into training, validation and test splits.