torch_frame.datasets.MultimodalTextBenchmark

class MultimodalTextBenchmark(root: str, name: str, text_stype: torch_frame.stype = stype.text_embedded, col_to_text_embedder_cfg: dict[str, TextEmbedderConfig] | TextEmbedderConfig | None = None, col_to_text_tokenizer_cfg: dict[str, TextTokenizerConfig] | TextTokenizerConfig | None = None)[source]

Bases: Dataset

The tabular data with text columns benchmark datasets used by “Benchmarking Multimodal AutoML for Tabular Data with Text Fields”. Some regression datasets’ target column is transformed from log scale to original scale.

Parameters:
  • name (str) – The name of the dataset to download.

  • text_stype (torch_frame.stype) – Text stype to use for text columns in the dataset. (default: torch_frame.text_embedded)

STATS:

Name

#rows

#cols (numerical)

#cols (categorical)

#cols (text)

#cols (other)

#classes

Task

Missing value ratio

product_sentiment_machine_hack

6,364

0

1

1

0

4

multiclass_classification

0.0%

jigsaw_unintended_bias100K

125,000

29

0

1

0

2

binary_classification

41.4%

news_channel

25,355

14

0

1

0

6

multiclass_classification

0.0%

wine_reviews

105,154

2

2

1

0

30

multiclass_classification

1.0%

data_scientist_salary

19,802

0

3

2

1

6

multiclass_classification

12.3%

melbourne_airbnb

22,895

26

47

13

3

10

multiclass_classification

9.6%

imdb_genre_prediction

1,000

7

1

2

1

2

binary_classification

0.0%

kick_starter_funding

108,128

1

3

3

2

2

binary_classification

0.0%

fake_job_postings2

15,907

0

3

2

0

2

binary_classification

23.8%

google_qa_answer_type_reason_explanation

6,079

0

1

3

0

1

regression

0.0%

google_qa_question_type_reason_explanation

6,079

0

1

3

0

1

regression

0.0%

bookprice_prediction

6,237

2

3

3

0

1

regression

1.7%

jc_penney_products

13,575

2

1

2

0

1

regression

13.7%

women_clothing_review

23,486

1

3

2

0

1

regression

1.8%

news_popularity2

30,009

3

0

1

0

1

regression

0.0%

ae_price_prediction

28,328

2

5

1

3

1

regression

6.1%

california_house_price

47,439

18

8

2

11

1

regression

13.8%

mercari_price_suggestion100K

125,000

0

6

2

1

1

regression

3.4%