torch_frame.datasets.MultimodalTextBenchmark
- class MultimodalTextBenchmark(root: str, name: str, text_stype: torch_frame.stype = stype.text_embedded, col_to_text_embedder_cfg: dict[str, TextEmbedderConfig] | TextEmbedderConfig | None = None, col_to_text_tokenizer_cfg: dict[str, TextTokenizerConfig] | TextTokenizerConfig | None = None)[source]
Bases:
Dataset
The tabular data with text columns benchmark datasets used by “Benchmarking Multimodal AutoML for Tabular Data with Text Fields”. Some regression datasets’ target column is transformed from log scale to original scale.
- Parameters:
name (str) – The name of the dataset to download.
text_stype (torch_frame.stype) – Text stype to use for text columns in the dataset. (default:
torch_frame.text_embedded
)
STATS:
Name
#rows
#cols (numerical)
#cols (categorical)
#cols (text)
#cols (other)
#classes
Task
Missing value ratio
product_sentiment_machine_hack
6,364
0
1
1
0
4
multiclass_classification
0.0%
jigsaw_unintended_bias100K
125,000
29
0
1
0
2
binary_classification
41.4%
news_channel
25,355
14
0
1
0
6
multiclass_classification
0.0%
wine_reviews
105,154
2
2
1
0
30
multiclass_classification
1.0%
data_scientist_salary
19,802
0
3
2
1
6
multiclass_classification
12.3%
melbourne_airbnb
22,895
26
47
13
3
10
multiclass_classification
9.6%
imdb_genre_prediction
1,000
7
1
2
1
2
binary_classification
0.0%
kick_starter_funding
108,128
1
3
3
2
2
binary_classification
0.0%
fake_job_postings2
15,907
0
3
2
0
2
binary_classification
23.8%
google_qa_answer_type_reason_explanation
6,079
0
1
3
0
1
regression
0.0%
google_qa_question_type_reason_explanation
6,079
0
1
3
0
1
regression
0.0%
bookprice_prediction
6,237
2
3
3
0
1
regression
1.7%
jc_penney_products
13,575
2
1
2
0
1
regression
13.7%
women_clothing_review
23,486
1
3
2
0
1
regression
1.8%
news_popularity2
30,009
3
0
1
0
1
regression
0.0%
ae_price_prediction
28,328
2
5
1
3
1
regression
6.1%
california_house_price
47,439
18
8
2
11
1
regression
13.8%
mercari_price_suggestion100K
125,000
0
6
2
1
1
regression
3.4%