torch_frame.data.Dataset

class Dataset(df: DataFrame, col_to_stype: dict[str, torch_frame.stype], target_col: str | None = None, split_col: str | None = None, col_to_sep: str | None | dict[str, str | None] = None, col_to_text_embedder_cfg: dict[str, TextEmbedderConfig] | TextEmbedderConfig | None = None, col_to_text_tokenizer_cfg: dict[str, TextTokenizerConfig] | TextTokenizerConfig | None = None, col_to_image_embedder_cfg: dict[str, ImageEmbedderConfig] | ImageEmbedderConfig | None = None, col_to_time_format: str | None | dict[str, str | None] = None)[source]

Bases: ABC

A base class for creating tabular datasets.

Parameters:
  • df (DataFrame) – The tabular data frame.

  • col_to_stype (Dict[str, torch_frame.stype]) – A dictionary that maps each column in the data frame to a semantic type.

  • target_col (str, optional) – The column used as target. (default: None)

  • split_col (str, optional) – The column that stores the pre-defined split information. The column should only contain 0, 1, or 2. (default: None).

  • col_to_sep (Union[str, Dict[str, Optional[str]]]) – A dictionary or a string/None specifying the separator/delimiter for the multi-categorical columns. If a string/None is specified, then the same separator will be used throughout all the multi-categorical columns. Note that if None is specified, it assumes a multi-category is given as a list of categories. If a dictionary is given, we use a separator specified for each column. (default: None)

  • col_to_text_embedder_cfg (TextEmbedderConfig or dict, optional) – A text embedder configuration or a dictionary of configurations specifying text_embedder that embeds texts into vectors and batch_size that specifies the mini-batch size for text_embedder. (default: None)

  • col_to_text_tokenizer_cfg (TextTokenizerConfig or dict, optional) – A text tokenizer configuration or dictionary of configurations specifying text_tokenizer that maps sentences into a list of dictionary of tensors. Each element in the list corresponds to each sentence, keys are input arguments to the model such as input_ids, and values are tensors such as tokens. batch_size specifies the mini-batch size for text_tokenizer. (default: None)

  • col_to_time_format (Union[str, Dict[str, Optional[str]]], optional) – A dictionary or a string specifying the format for the timestamp columns. See strfttime documentation for more information on formats. If a string is specified, then the same format will be used throughout all the timestamp columns. If a dictionary is given, we use a different format specified for each column. If not specified, pandas’s internal to_datetime function will be used to auto parse time columns. (default: None)

static download_url(url: str, root: str, filename: str | None = None, *, log: bool = True) str[source]

Downloads the content of url to the specified folder root.

Parameters:
  • url (str) – The URL.

  • root (str) – The root folder.

  • filename (str, optional) – If set, will rename the downloaded file. (default: None)

  • log (bool, optional) – If False, will not print anything to the console. (default: True)

property feat_cols: list[str]

The input feature columns of the dataset.

property task_type: TaskType

The task type of the dataset.

property num_rows

The number of rows of the dataset.

materialize(device: torch.device | None = None, path: str | None = None, col_stats: dict[str, dict[StatType, Any]] | None = None) Dataset[source]

Materializes the dataset into a tensor representation. From this point onwards, the dataset should be treated as read-only.

Parameters:
  • device (torch.device, optional) – Device to load the TensorFrame object. (default: None)

  • path (str, optional) – If path is specified and a cached file exists, this will try to load the saved the TensorFrame object and col_stats. If path is specified but a cached file does not exist, this will perform materialization and then save the TensorFrame object and col_stats to path. If path is None, this will materialize the dataset without caching. (default: None)

  • col_stats (Dict[str, Dict[StatType, Any]], optional) – optional

  • provided (col_stats provided by the user. If not) –

  • statistics (the) –

  • (default (is calculated from the dataframe itself.) – None)

property is_materialized: bool

Whether the dataset is already materialized.

property tensor_frame: TensorFrame

Returns the TensorFrame of the dataset.

property col_stats: dict[str, dict[torch_frame.data.stats.StatType, Any]]

Returns column-wise dataset statistics.

index_select(index: Union[int, list[int], range, slice, Tensor]) Dataset[source]

Returns a subset of the dataset from specified indices index.

shuffle(return_perm: bool = False) Dataset | tuple[Dataset, Tensor][source]

Randomly shuffles the rows in the dataset.

col_select(cols: Union[str, list[str]]) Dataset[source]

Returns a subset of the dataset from specified columns cols.

get_split(split: str) Dataset[source]

Returns a subset of the dataset that belongs to a given training split (as defined in split_col).

Parameters:

split (str) – The split name (either "train", "val", or "test".

split() tuple[torch_frame.data.dataset.Dataset, torch_frame.data.dataset.Dataset, torch_frame.data.dataset.Dataset][source]

Splits the dataset into training, validation and test splits.