torch_frame.datasets.HuggingFaceDatasetDict

class HuggingFaceDatasetDict(path: str, name: Optional[str] = None, columns: Optional[list[str]] = None, col_to_stype: Optional[dict[str, torch_frame._stype.stype]] = None, target_col: Optional[str] = None, **kwargs)[source]

Bases: Dataset

Load a Hugging Face datasets.DatasetDict dataset to a torch_frame.data.Dataset with pre-defined split information. To use this class, please install the Datasets package at first. For all available dataset paths and names, please refer to the Hugging Face Datasets.

Parameters:

path (str) – Path or name of the dataset.
name (str, optional) – Defining the name of the dataset configuration. (default: None)
columns (list, optional) – List of columns to be included. (default: None)

Example

Load the spotify-tracks-dataset dataset from the Hugging Face Hub to the torch_frame.data.Dataset:

>>> from torch_frame.datasets import HuggingFaceDatasetDict
>>> from torch_frame.config.text_embedder import TextEmbedderConfig
>>> from torch_frame.testing.text_embedder import HashTextEmbedder
>>> dataset = HuggingFaceDatasetDict(
...     path="maharshipandya/spotify-tracks-dataset",
...     columns=["artists", "album_name", "track_name",
...              "popularity", "duration_ms", "explicit",
...              "danceability", "energy", "key", "loudness",
...              "mode", "speechiness", "acousticness",
...              "instrumentalness", "liveness", "valence",
...              "tempo", "time_signature", "track_genre"
...     ],
...     target_col="track_genre",
...     col_to_text_embedder_cfg=TextEmbedderConfig(
...         text_embedder=HashTextEmbedder(10)),
... )
>>> dataset.materialize()
>>> dataset.tensor_frame
TensorFrame(
    num_cols=18,
    num_rows=114000,
    numerical (11): [
        'acousticness',
        'danceability',
        'duration_ms',
        'energy',
        'instrumentalness',
        'liveness',
        'loudness',
        'popularity',
        'speechiness',
        'tempo',
        'valence',
    ],
    categorical (4): [
        'explicit',
        'key',
        'mode',
        'time_signature',
    ],
    embedding (3): ['artists', 'album_name', 'track_name'],
    has_target=True,
    device='cpu',
)