torch_frame.data.Dataset
- class Dataset(df: DataFrame, col_to_stype: dict[str, torch_frame._stype.stype], target_col: Optional[str] = None, split_col: Optional[str] = None, col_to_sep: Union[str, None, dict[str, str | None]] = None, col_to_text_embedder_cfg: Optional[Union[dict[str, torch_frame.config.text_embedder.TextEmbedderConfig], TextEmbedderConfig]] = None, col_to_text_tokenizer_cfg: Optional[Union[dict[str, torch_frame.config.text_tokenizer.TextTokenizerConfig], TextTokenizerConfig]] = None, col_to_image_embedder_cfg: Optional[Union[dict[str, torch_frame.config.image_embedder.ImageEmbedderConfig], ImageEmbedderConfig]] = None, col_to_time_format: Union[str, None, dict[str, str | None]] = None)[source]
Bases:
objectA base class for creating tabular datasets.
- Parameters:
df (DataFrame) – The tabular data frame.
col_to_stype (Dict[str, torch_frame.stype]) – A dictionary that maps each column in the data frame to a semantic type.
target_col (str, optional) – The column used as target. (default:
None)split_col (str, optional) – The column that stores the pre-defined split information. The column should only contain
0,1, or2. (default:None).col_to_sep (Union[str, Dict[str, Optional[str]]]) – A dictionary or a string/
Nonespecifying the separator/delimiter for the multi-categorical columns. If a string/Noneis specified, then the same separator will be used throughout all the multi-categorical columns. Note that ifNoneis specified, it assumes a multi-category is given as alistof categories. If a dictionary is given, we use a separator specified for each column. (default:None)col_to_text_embedder_cfg (TextEmbedderConfig or dict, optional) – A text embedder configuration or a dictionary of configurations specifying
text_embedderthat embeds texts into vectors andbatch_sizethat specifies the mini-batch size fortext_embedder. (default:None)col_to_text_tokenizer_cfg (TextTokenizerConfig or dict, optional) – A text tokenizer configuration or dictionary of configurations specifying
text_tokenizerthat maps sentences into a list of dictionary of tensors. Each element in the list corresponds to each sentence, keys are input arguments to the model such asinput_ids, and values are tensors such as tokens.batch_sizespecifies the mini-batch size fortext_tokenizer. (default:None)col_to_time_format (Union[str, Dict[str, Optional[str]]], optional) – A dictionary or a string specifying the format for the timestamp columns. See strfttime documentation for more information on formats. If a string is specified, then the same format will be used throughout all the timestamp columns. If a dictionary is given, we use a different format specified for each column. If not specified, pandas’s internal to_datetime function will be used to auto parse time columns. (default:
None)
- static download_url(url: str, root: str, filename: Optional[str] = None, *, log: bool = True) str[source]
Downloads the content of
urlto the specified folderroot.
- property num_rows
The number of rows of the dataset.
- materialize(device: Optional[device] = None, path: Optional[str] = None, col_stats: Optional[dict[str, dict[torch_frame.data.stats.StatType, Any]]] = None) Dataset[source]
Materializes the dataset into a tensor representation. From this point onwards, the dataset should be treated as read-only.
- Parameters:
device (torch.device, optional) – Device to load the
TensorFrameobject. (default:None)path (str, optional) – If path is specified and a cached file exists, this will try to load the saved the
TensorFrameobject andcol_stats. Ifpathis specified but a cached file does not exist, this will perform materialization and then save theTensorFrameobject andcol_statstopath. IfpathisNone, this will materialize the dataset without caching. (default:None)col_stats (Dict[str, Dict[StatType, Any]], optional) – optional
provided (col_stats provided by the user. If not) –
statistics (the) –
(default (is calculated from the dataframe itself.) –
None)
- property tensor_frame: TensorFrame
Returns the
TensorFrameof the dataset.
- property col_stats: dict[str, dict[torch_frame.data.stats.StatType, Any]]
Returns column-wise dataset statistics.
- index_select(index: int | list[int] | range | slice | torch.Tensor) Dataset[source]
Returns a subset of the dataset from specified indices
index.
- shuffle(return_perm: bool = False) torch_frame.data.dataset.Dataset | tuple[torch_frame.data.dataset.Dataset, torch.Tensor][source]
Randomly shuffles the rows in the dataset.
- col_select(cols: str | list[str]) Dataset[source]
Returns a subset of the dataset from specified columns
cols.
- get_split(split: str) Dataset[source]
Returns a subset of the dataset that belongs to a given training split (as defined in
split_col).- Parameters:
split (str) – The split name (either
"train","val", or"test".
- split() tuple[torch_frame.data.dataset.Dataset, torch_frame.data.dataset.Dataset, torch_frame.data.dataset.Dataset][source]
Splits the dataset into training, validation and test splits.