torch_frame.datasets.DataFrameTextBenchmark
- class DataFrameTextBenchmark(root: str, task_type: TaskType, scale: str, idx: int, text_stype: stype = stype.text_embedded, col_to_text_embedder_cfg: Optional[Union[dict[str, torch_frame.config.text_embedder.TextEmbedderConfig], TextEmbedderConfig]] = None, col_to_text_tokenizer_cfg: Optional[Union[dict[str, torch_frame.config.text_tokenizer.TextTokenizerConfig], TextTokenizerConfig]] = None, split_random_state: int = 42)[source]
Bases:
DatasetA collection of datasets for tabular learning with text columns, covering categorical, numerical, multi-categorical and timestamp features. The datasets are categorized according to their task types and scales.
- Parameters:
root (str) – Root directory.
task_type (TaskType) – The task type. Either
TaskType.BINARY_CLASSIFICATION,TaskType.MULTICLASS_CLASSIFICATION, orTaskType.REGRESSIONscale (str) – The scale of the dataset.
"small"means 5K to 50K rows."medium"means 50K to 500K rows."large"means more than 500K rows.text_stype (torch_frame.stype) – Text stype to use for text columns in the dataset. (default:
torch_frame.text_embedded).idx (int) – The index of the dataset within a category specified via
task_typeandscale.
STATS:
Task
Scale
Idx
#rows
#cols (numerical)
#cols (categorical)
#cols (text)
#cols (other)
#classes
Class object
Missing value ratio
binary_classification
small
0
15,907
0
3
2
0
2
MultimodalTextBenchmark(name=’fake_job_postings2’)
23.8%
binary_classification
medium
0
125,000
29
0
1
0
2
MultimodalTextBenchmark(name=’jigsaw_unintended_bias100K’)
41.4%
binary_classification
medium
1
108,128
1
3
3
2
2
MultimodalTextBenchmark(name=’kick_starter_funding’)
0.0%
multiclass_classification
small
0
6,364
0
1
1
0
4
MultimodalTextBenchmark(name=’product_sentiment_machine_hack’)
0.0%
multiclass_classification
small
1
25,355
14
0
1
0
6
MultimodalTextBenchmark(name=’news_channel’)
0.0%
multiclass_classification
small
2
19,802
0
3
2
1
6
MultimodalTextBenchmark(name=’data_scientist_salary’)
12.3%
multiclass_classification
small
3
22,895
26
47
13
3
10
MultimodalTextBenchmark(name=’melbourne_airbnb’)
9.6%
multiclass_classification
medium
0
105,154
2
2
1
0
30
MultimodalTextBenchmark(name=’wine_reviews’)
1.0%
multiclass_classification
medium
1
114,000
11
5
3
0
114
HuggingFaceDatasetDict(path=’maharshipandya/spotify-tracks-dataset’, target_col=’track_genre’)
0.0%
multiclass_classification
large
0
568,454
2
3
2
0
5
AmazonFineFoodReviews()
0.0%
regression
small
0
6,079
0
1
3
0
1
MultimodalTextBenchmark(name=’google_qa_answer_type_reason_explanation’)
0.0%
regression
small
1
6,079
0
1
3
0
1
MultimodalTextBenchmark(name=’google_qa_question_type_reason_explanation’)
0.0%
regression
small
2
6,237
2
3
3
0
1
MultimodalTextBenchmark(name=’bookprice_prediction’)
1.7%
regression
small
3
13,575
2
1
2
0
1
MultimodalTextBenchmark(name=’jc_penney_products’)
13.7%
regression
small
4
23,486
1
3
2
0
1
MultimodalTextBenchmark(name=’women_clothing_review’)
1.8%
regression
small
5
30,009
3
0
1
0
1
MultimodalTextBenchmark(name=’news_popularity2’)
0.0%
regression
small
6
28,328
2
5
1
3
1
MultimodalTextBenchmark(name=’ae_price_prediction’)
6.1%
regression
small
7
47,439
18
8
2
11
1
MultimodalTextBenchmark(name=’california_house_price’)
13.8%
regression
medium
0
125,000
0
6
2
1
1
MultimodalTextBenchmark(name=’mercari_price_suggestion100K’)
3.4%
regression
large
0
1,482,535
1
4
2
1
1
Mercari()
0.0%
- classmethod datasets_available(task_type: TaskType, scale: str) list[tuple[str, dict[str, Any]]][source]
List of datasets available for a given
task_typeandscale.
- classmethod num_datasets_available(task_type: TaskType, scale: str)[source]
Number of datasets available for a given
task_typeandscale.
- materialize(*args, **kwargs) Dataset[source]
Materializes the dataset into a tensor representation. From this point onwards, the dataset should be treated as read-only.
- Parameters:
device (torch.device, optional) – Device to load the
TensorFrameobject. (default:None)path (str, optional) – If path is specified and a cached file exists, this will try to load the saved the
TensorFrameobject andcol_stats. Ifpathis specified but a cached file does not exist, this will perform materialization and then save theTensorFrameobject andcol_statstopath. IfpathisNone, this will materialize the dataset without caching. (default:None)col_stats (Dict[str, Dict[StatType, Any]], optional) – optional
provided (col_stats provided by the user. If not) –
statistics (the) –
(default (is calculated from the dataframe itself.) –
None)