torch_frame.datasets.MultimodalTextBenchmark

class MultimodalTextBenchmark(root: str, name: str, text_stype: torch_frame.stype = stype.text_embedded, col_to_text_embedder_cfg: dict[str, TextEmbedderConfig] | TextEmbedderConfig | None = None, col_to_text_tokenizer_cfg: dict[str, TextTokenizerConfig] | TextTokenizerConfig | None = None)[source]

The tabular data with text columns benchmark datasets used by “Benchmarking Multimodal AutoML for Tabular Data with Text Fields”. Some regression datasets’ target column is transformed from log scale to original scale.

Parameters:

name (str) – The name of the dataset to download.
text_stype (torch_frame.stype) – Text stype to use for text columns in the dataset. (default: torch_frame.text_embedded)

STATS:

Name	#rows	#cols (numerical)	#cols (categorical)	#cols (text)	#cols (other)	#classes	Task	Missing value ratio
product_sentiment_machine_hack	6,364	0	1	1	0	4	multiclass_classification	0.0%
jigsaw_unintended_bias100K	125,000	29	0	1	0	2	binary_classification	41.4%
news_channel	25,355	14	0	1	0	6	multiclass_classification	0.0%
wine_reviews	105,154	2	2	1	0	30	multiclass_classification	1.0%
data_scientist_salary	19,802	0	3	2	1	6	multiclass_classification	12.3%
melbourne_airbnb	22,895	26	47	13	3	10	multiclass_classification	9.6%
imdb_genre_prediction	1,000	7	1	2	1	2	binary_classification	0.0%
kick_starter_funding	108,128	1	3	3	2	2	binary_classification	0.0%
fake_job_postings2	15,907	0	3	2	0	2	binary_classification	23.8%
google_qa_answer_type_reason_explanation	6,079	0	1	3	0	1	regression	0.0%
google_qa_question_type_reason_explanation	6,079	0	1	3	0	1	regression	0.0%
bookprice_prediction	6,237	2	3	3	0	1	regression	1.7%
jc_penney_products	13,575	2	1	2	0	1	regression	13.7%
women_clothing_review	23,486	1	3	2	0	1	regression	1.8%
news_popularity2	30,009	3	0	1	0	1	regression	0.0%
ae_price_prediction	28,328	2	5	1	3	1	regression	6.1%
california_house_price	47,439	18	8	2	11	1	regression	13.8%
mercari_price_suggestion100K	125,000	0	6	2	1	1	regression	3.4%