torch_frame.datasets.DataFrameBenchmark

class DataFrameBenchmark(root: str, task_type: TaskType, scale: str, idx: int, split_random_state: int = 42)[source]

Bases: Dataset

A collection of standardized datasets for tabular learning, covering categorical and numerical features. The datasets are categorized according to their task types and scales.

Parameters:

root (str) – Root directory.
task_type (TaskType) – The task type. Either TaskType.BINARY_CLASSIFICATION, TaskType.MULTICLASS_CLASSIFICATION, or TaskType.REGRESSION
scale (str) – The scale of the dataset. "small" means 5K to 50K rows. "medium" means 50K to 500K rows. "large" means more than 500K rows.
idx (int) – The index of the dataset within a category specified via task_type and scale.

STATS:

Task	Scale	Idx	#rows	#cols (numerical)	#cols (categorical)	#classes	Class object	Missing value ratio
binary_classification	small	0	32,561	4	8	2	AdultCensusIncome()	0.0%
binary_classification	small	1	8,124	0	22	2	Mushroom()	0.0%
binary_classification	small	2	45,211	7	9	2	BankMarketing()	0.0%
binary_classification	small	3	13,376	10	0	2	TabularBenchmark(name=’MagicTelescope’)	0.0%
binary_classification	small	4	10,578	7	0	2	TabularBenchmark(name=’bank-marketing’)	0.0%
binary_classification	small	5	20,634	8	0	2	TabularBenchmark(name=’california’)	0.0%
binary_classification	small	6	16,714	10	0	2	TabularBenchmark(name=’credit’)	0.0%
binary_classification	small	7	13,272	20	1	2	TabularBenchmark(name=’default-of-credit-card-clients’)	0.0%
binary_classification	small	8	38,474	7	1	2	TabularBenchmark(name=’electricity’)	0.0%
binary_classification	small	9	7,608	18	5	2	TabularBenchmark(name=’eye_movements’)	0.0%
binary_classification	small	10	10,000	22	0	2	TabularBenchmark(name=’heloc’)	0.0%
binary_classification	small	11	13,488	16	0	2	TabularBenchmark(name=’house_16H’)	0.0%
binary_classification	small	12	10,082	26	0	2	TabularBenchmark(name=’pol’)	0.0%
binary_classification	small	13	48,842	6	8	2	Yandex(name=’adult’)	0.0%
binary_classification	medium	0	92,650	0	116	2	Dota2()	0.0%
binary_classification	medium	1	199,523	7	34	2	KDDCensusIncome()	0.0%
binary_classification	medium	2	71,090	7	0	2	TabularBenchmark(name=’Diabetes130US’)	0.0%
binary_classification	medium	3	72,998	50	0	2	TabularBenchmark(name=’MiniBooNE’)	0.0%
binary_classification	medium	4	58,252	23	8	2	TabularBenchmark(name=’albert’)	0.0%
binary_classification	medium	5	423,680	10	44	2	TabularBenchmark(name=’covertype’)	0.0%
binary_classification	medium	6	57,580	54	0	2	TabularBenchmark(name=’jannis’)	0.0%
binary_classification	medium	7	111,762	24	8	2	TabularBenchmark(name=’road-safety’)	0.0%
binary_classification	medium	8	98,050	28	0	2	Yandex(name=’higgs_small’)	0.0%
binary_classification	large	0	940,160	24	0	2	TabularBenchmark(name=’Higgs’)	0.0%
multiclass_classification	medium	0	108,000	128	0	1,000	Yandex(name=’aloi’)	0.0%
multiclass_classification	medium	1	65,196	27	0	100	Yandex(name=’helena’)	0.0%
multiclass_classification	medium	2	83,733	54	0	4	Yandex(name=’jannis’)	0.0%
multiclass_classification	large	0	581,012	10	44	7	ForestCoverType()	0.0%
multiclass_classification	large	1	1,025,010	5	5	10	PokerHand()	0.0%
multiclass_classification	large	2	581,012	54	0	7	Yandex(name=’covtype’)	0.0%
regression	small	0	17,379	6	5	1	TabularBenchmark(name=’Bike_Sharing_Demand’)	0.0%
regression	small	1	10,692	7	4	1	TabularBenchmark(name=’Brazilian_houses’)	0.0%
regression	small	2	8,192	21	0	1	TabularBenchmark(name=’cpu_act’)	0.0%
regression	small	3	16,599	16	0	1	TabularBenchmark(name=’elevators’)	0.0%
regression	small	4	21,613	15	2	1	TabularBenchmark(name=’house_sales’)	0.0%
regression	small	5	20,640	8	0	1	TabularBenchmark(name=’houses’)	0.0%
regression	small	6	10,081	6	0	1	TabularBenchmark(name=’sulfur’)	0.0%
regression	small	7	21,263	79	0	1	TabularBenchmark(name=’superconduct’)	0.0%
regression	small	8	8,885	252	3	1	TabularBenchmark(name=’topo_2_1’)	0.0%
regression	small	9	8,641	3	1	1	TabularBenchmark(name=’visualizing_soil’)	0.0%
regression	small	10	6,497	11	0	1	TabularBenchmark(name=’wine_quality’)	0.0%
regression	small	11	8,885	42	0	1	TabularBenchmark(name=’yprop_4_1’)	0.0%
regression	small	12	20,640	8	0	1	Yandex(name=’california_housing’)	0.0%
regression	medium	0	188,318	25	99	1	TabularBenchmark(name=’Allstate_Claims_Severity’)	0.0%
regression	medium	1	241,600	3	6	1	TabularBenchmark(name=’SGEMM_GPU_kernel_performance’)	0.0%
regression	medium	2	53,940	6	3	1	TabularBenchmark(name=’diamonds’)	0.0%
regression	medium	3	163,065	3	0	1	TabularBenchmark(name=’medical_charges’)	0.0%
regression	medium	4	394,299	4	2	1	TabularBenchmark(name=’particulate-matter-ukair-2017’)	0.0%
regression	medium	5	52,031	3	1	1	TabularBenchmark(name=’seattlecrime6’)	0.0%
regression	large	0	1,000,000	5	0	1	TabularBenchmark(name=’Airlines_DepDelay_1M’)	0.0%
regression	large	1	5,465,575	8	0	1	TabularBenchmark(name=’delays_zurich_transport’)	0.0%
regression	large	2	581,835	9	0	1	TabularBenchmark(name=’nyc-taxi-green-dec-2016’)	0.0%
regression	large	3	1,200,192	136	0	1	Yandex(name=’microsoft’)	0.0%
regression	large	4	709,877	699	0	1	Yandex(name=’yahoo’)	0.0%
regression	large	5	515,345	90	0	1	Yandex(name=’year’)	0.0%

classmethod datasets_available(task_type: TaskType, scale: str) → list[tuple[str, dict[str, Any]]][source]: List of datasets available for a given task_type and scale.

classmethod num_datasets_available(task_type: TaskType, scale: str)[source]: Number of datasets available for a given task_type and scale.

materialize(*args, **kwargs) → Dataset[source]

Materializes the dataset into a tensor representation. From this point onwards, the dataset should be treated as read-only.

Parameters:

device (torch.device, optional) – Device to load the TensorFrame object. (default: None)
path (str, optional) – If path is specified and a cached file exists, this will try to load the saved the TensorFrame object and col_stats. If path is specified but a cached file does not exist, this will perform materialization and then save the TensorFrame object and col_stats to path. If path is None, this will materialize the dataset without caching. (default: None)
col_stats (Dict[str, Dict[StatType, Any]], optional) – optional
provided (col_stats provided by the user. If not) –
statistics (the) –
(default (is calculated from the dataframe itself.) – None)