torch_frame.datasets.DataFrameTextBenchmark

class DataFrameTextBenchmark(root: str, task_type: TaskType, scale: str, idx: int, text_stype: stype = stype.text_embedded, col_to_text_embedder_cfg: Optional[Union[dict[str, torch_frame.config.text_embedder.TextEmbedderConfig], TextEmbedderConfig]] = None, col_to_text_tokenizer_cfg: Optional[Union[dict[str, torch_frame.config.text_tokenizer.TextTokenizerConfig], TextTokenizerConfig]] = None, split_random_state: int = 42)[source]

Bases: Dataset

A collection of datasets for tabular learning with text columns, covering categorical, numerical, multi-categorical and timestamp features. The datasets are categorized according to their task types and scales.

Parameters:
  • root (str) – Root directory.

  • task_type (TaskType) – The task type. Either TaskType.BINARY_CLASSIFICATION, TaskType.MULTICLASS_CLASSIFICATION, or TaskType.REGRESSION

  • scale (str) – The scale of the dataset. "small" means 5K to 50K rows. "medium" means 50K to 500K rows. "large" means more than 500K rows.

  • text_stype (torch_frame.stype) – Text stype to use for text columns in the dataset. (default: torch_frame.text_embedded).

  • idx (int) – The index of the dataset within a category specified via task_type and scale.

STATS:

Task

Scale

Idx

#rows

#cols (numerical)

#cols (categorical)

#cols (text)

#cols (other)

#classes

Class object

Missing value ratio

binary_classification

small

0

15,907

0

3

2

0

2

MultimodalTextBenchmark(name=’fake_job_postings2’)

23.8%

binary_classification

medium

0

125,000

29

0

1

0

2

MultimodalTextBenchmark(name=’jigsaw_unintended_bias100K’)

41.4%

binary_classification

medium

1

108,128

1

3

3

2

2

MultimodalTextBenchmark(name=’kick_starter_funding’)

0.0%

multiclass_classification

small

0

6,364

0

1

1

0

4

MultimodalTextBenchmark(name=’product_sentiment_machine_hack’)

0.0%

multiclass_classification

small

1

25,355

14

0

1

0

6

MultimodalTextBenchmark(name=’news_channel’)

0.0%

multiclass_classification

small

2

19,802

0

3

2

1

6

MultimodalTextBenchmark(name=’data_scientist_salary’)

12.3%

multiclass_classification

small

3

22,895

26

47

13

3

10

MultimodalTextBenchmark(name=’melbourne_airbnb’)

9.6%

multiclass_classification

medium

0

105,154

2

2

1

0

30

MultimodalTextBenchmark(name=’wine_reviews’)

1.0%

multiclass_classification

medium

1

114,000

11

5

3

0

114

HuggingFaceDatasetDict(path=’maharshipandya/spotify-tracks-dataset’, target_col=’track_genre’)

0.0%

multiclass_classification

large

0

568,454

2

3

2

0

5

AmazonFineFoodReviews()

0.0%

regression

small

0

6,079

0

1

3

0

1

MultimodalTextBenchmark(name=’google_qa_answer_type_reason_explanation’)

0.0%

regression

small

1

6,079

0

1

3

0

1

MultimodalTextBenchmark(name=’google_qa_question_type_reason_explanation’)

0.0%

regression

small

2

6,237

2

3

3

0

1

MultimodalTextBenchmark(name=’bookprice_prediction’)

1.7%

regression

small

3

13,575

2

1

2

0

1

MultimodalTextBenchmark(name=’jc_penney_products’)

13.7%

regression

small

4

23,486

1

3

2

0

1

MultimodalTextBenchmark(name=’women_clothing_review’)

1.8%

regression

small

5

30,009

3

0

1

0

1

MultimodalTextBenchmark(name=’news_popularity2’)

0.0%

regression

small

6

28,328

2

5

1

3

1

MultimodalTextBenchmark(name=’ae_price_prediction’)

6.1%

regression

small

7

47,439

18

8

2

11

1

MultimodalTextBenchmark(name=’california_house_price’)

13.8%

regression

medium

0

125,000

0

6

2

1

1

MultimodalTextBenchmark(name=’mercari_price_suggestion100K’)

3.4%

regression

large

0

1,482,535

1

4

2

1

1

Mercari()

0.0%

classmethod datasets_available(task_type: TaskType, scale: str) list[tuple[str, dict[str, Any]]][source]

List of datasets available for a given task_type and scale.

classmethod num_datasets_available(task_type: TaskType, scale: str)[source]

Number of datasets available for a given task_type and scale.

materialize(*args, **kwargs) Dataset[source]

Materializes the dataset into a tensor representation. From this point onwards, the dataset should be treated as read-only.

Parameters:
  • device (torch.device, optional) – Device to load the TensorFrame object. (default: None)

  • path (str, optional) – If path is specified and a cached file exists, this will try to load the saved the TensorFrame object and col_stats. If path is specified but a cached file does not exist, this will perform materialization and then save the TensorFrame object and col_stats to path. If path is None, this will materialize the dataset without caching. (default: None)

  • col_stats (Dict[str, Dict[StatType, Any]], optional) – optional

  • provided (col_stats provided by the user. If not) –

  • statistics (the) –

  • (default (is calculated from the dataframe itself.) – None)