torch_frame.datasets.HuggingFaceDatasetDict
- class HuggingFaceDatasetDict(path: str, name: Optional[str] = None, columns: Optional[list[str]] = None, col_to_stype: Optional[dict[str, torch_frame._stype.stype]] = None, target_col: Optional[str] = None, **kwargs)[source]
Bases:
DatasetLoad a Hugging Face
datasets.DatasetDictdataset to atorch_frame.data.Datasetwith pre-defined split information. To use this class, please install the Datasets package at first. For all available dataset paths and names, please refer to the Hugging Face Datasets.- Parameters:
Example
Load the spotify-tracks-dataset dataset from the Hugging Face Hub to the
torch_frame.data.Dataset:>>> from torch_frame.datasets import HuggingFaceDatasetDict >>> from torch_frame.config.text_embedder import TextEmbedderConfig >>> from torch_frame.testing.text_embedder import HashTextEmbedder >>> dataset = HuggingFaceDatasetDict( ... path="maharshipandya/spotify-tracks-dataset", ... columns=["artists", "album_name", "track_name", ... "popularity", "duration_ms", "explicit", ... "danceability", "energy", "key", "loudness", ... "mode", "speechiness", "acousticness", ... "instrumentalness", "liveness", "valence", ... "tempo", "time_signature", "track_genre" ... ], ... target_col="track_genre", ... col_to_text_embedder_cfg=TextEmbedderConfig( ... text_embedder=HashTextEmbedder(10)), ... ) >>> dataset.materialize() >>> dataset.tensor_frame TensorFrame( num_cols=18, num_rows=114000, numerical (11): [ 'acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 'liveness', 'loudness', 'popularity', 'speechiness', 'tempo', 'valence', ], categorical (4): [ 'explicit', 'key', 'mode', 'time_signature', ], embedding (3): ['artists', 'album_name', 'track_name'], has_target=True, device='cpu', )