torch_frame.datasets.DataFrameTextBenchmark

class DataFrameTextBenchmark(root: str, task_type: TaskType, scale: str, idx: int, text_stype: stype = stype.text_embedded, col_to_text_embedder_cfg: Optional[Union[dict[str, torch_frame.config.text_embedder.TextEmbedderConfig], TextEmbedderConfig]] = None, col_to_text_tokenizer_cfg: Optional[Union[dict[str, torch_frame.config.text_tokenizer.TextTokenizerConfig], TextTokenizerConfig]] = None, split_random_state: int = 42)[source]

Bases: Dataset

A collection of datasets for tabular learning with text columns, covering categorical, numerical, multi-categorical and timestamp features. The datasets are categorized according to their task types and scales.

Parameters:

root (str) – Root directory.
task_type (TaskType) – The task type. Either TaskType.BINARY_CLASSIFICATION, TaskType.MULTICLASS_CLASSIFICATION, or TaskType.REGRESSION
scale (str) – The scale of the dataset. "small" means 5K to 50K rows. "medium" means 50K to 500K rows. "large" means more than 500K rows.
text_stype (torch_frame.stype) – Text stype to use for text columns in the dataset. (default: torch_frame.text_embedded).
idx (int) – The index of the dataset within a category specified via task_type and scale.

STATS:

Task	Scale	Idx	#rows	#cols (numerical)	#cols (categorical)	#cols (text)	#cols (other)	#classes	Class object	Missing value ratio
binary_classification	small	0	15,907	0	3	2	0	2	MultimodalTextBenchmark(name=’fake_job_postings2’)	23.8%
binary_classification	medium	0	125,000	29	0	1	0	2	MultimodalTextBenchmark(name=’jigsaw_unintended_bias100K’)	41.4%
binary_classification	medium	1	108,128	1	3	3	2	2	MultimodalTextBenchmark(name=’kick_starter_funding’)	0.0%
multiclass_classification	small	0	6,364	0	1	1	0	4	MultimodalTextBenchmark(name=’product_sentiment_machine_hack’)	0.0%
multiclass_classification	small	1	25,355	14	0	1	0	6	MultimodalTextBenchmark(name=’news_channel’)	0.0%
multiclass_classification	small	2	19,802	0	3	2	1	6	MultimodalTextBenchmark(name=’data_scientist_salary’)	12.3%
multiclass_classification	small	3	22,895	26	47	13	3	10	MultimodalTextBenchmark(name=’melbourne_airbnb’)	9.6%
multiclass_classification	medium	0	105,154	2	2	1	0	30	MultimodalTextBenchmark(name=’wine_reviews’)	1.0%
multiclass_classification	medium	1	114,000	11	5	3	0	114	HuggingFaceDatasetDict(path=’maharshipandya/spotify-tracks-dataset’, target_col=’track_genre’)	0.0%
multiclass_classification	large	0	568,454	2	3	2	0	5	AmazonFineFoodReviews()	0.0%
regression	small	0	6,079	0	1	3	0	1	MultimodalTextBenchmark(name=’google_qa_answer_type_reason_explanation’)	0.0%
regression	small	1	6,079	0	1	3	0	1	MultimodalTextBenchmark(name=’google_qa_question_type_reason_explanation’)	0.0%
regression	small	2	6,237	2	3	3	0	1	MultimodalTextBenchmark(name=’bookprice_prediction’)	1.7%
regression	small	3	13,575	2	1	2	0	1	MultimodalTextBenchmark(name=’jc_penney_products’)	13.7%
regression	small	4	23,486	1	3	2	0	1	MultimodalTextBenchmark(name=’women_clothing_review’)	1.8%
regression	small	5	30,009	3	0	1	0	1	MultimodalTextBenchmark(name=’news_popularity2’)	0.0%
regression	small	6	28,328	2	5	1	3	1	MultimodalTextBenchmark(name=’ae_price_prediction’)	6.1%
regression	small	7	47,439	18	8	2	11	1	MultimodalTextBenchmark(name=’california_house_price’)	13.8%
regression	medium	0	125,000	0	6	2	1	1	MultimodalTextBenchmark(name=’mercari_price_suggestion100K’)	3.4%
regression	large	0	1,482,535	1	4	2	1	1	Mercari()	0.0%

classmethod datasets_available(task_type: TaskType, scale: str) → list[tuple[str, dict[str, Any]]][source]: List of datasets available for a given task_type and scale.

classmethod num_datasets_available(task_type: TaskType, scale: str)[source]: Number of datasets available for a given task_type and scale.

materialize(*args, **kwargs) → Dataset[source]

Materializes the dataset into a tensor representation. From this point onwards, the dataset should be treated as read-only.

Parameters:

device (torch.device, optional) – Device to load the TensorFrame object. (default: None)
path (str, optional) – If path is specified and a cached file exists, this will try to load the saved the TensorFrame object and col_stats. If path is specified but a cached file does not exist, this will perform materialization and then save the TensorFrame object and col_stats to path. If path is None, this will materialize the dataset without caching. (default: None)
col_stats (Dict[str, Dict[StatType, Any]], optional) – optional
provided (col_stats provided by the user. If not) –
statistics (the) –
(default (is calculated from the dataframe itself.) – None)