Introduction by Example ======================= :pyf:`PyTorch Frame` is a tabular deep learning extension library for :pytorch:`null` `PyTorch `_. Modern data is stored in a table format with heterogeneous columns each with its own semantic type, *e.g.*, numerical (such as age or price), categorical (such as gender or product type), time, text (such as descriptions or comments), images, etc. The goal of :pyf:`PyTorch Frame` is to build a deep learning framework to perform effective machine learning on such complex and diverse data. Many recent tabular models follow the modular design of :obj:`~torch_frame.nn.encoder.FeatureEncoder`, :obj:`~torch_frame.nn.conv.TableConv`, and :obj:`~torch_frame.nn.decoder.Decoder`. :pyf:`PyTorch Frame` is designed to facilitate the creation, implementation and evaluation of deep learning models for tabular data under such modular architecture. Please refer to the :doc:`/get_started/modular_design` page for more information. In this doc, we introduce the fundamental concepts of :pyf:`PyTorch Frame` through self-contained examples. At its core, :pyf:`PyTorch Frame` provides the following main features: .. contents:: :local: Common Benchmark Datasets ------------------------- :pyf:`PyTorch Frame` contains a large number of common benchmark datasets. The list of all datasets are available in :doc:`/modules/datasets`. Initializing datasets is straightforward in :pyf:`PyTorch Frame`. An initialization of a dataset will automatically download its raw files and process the columns. In the below example, we will use one of the pre-loaded datasets, containing the Titanic passengers. If you would like to use your own dataset, refer to the example in :doc:`/handling_advanced_stypes/handle_heterogeneous_stypes`. .. code-block:: python >>> from torch_frame.datasets import Titanic >>> dataset = Titanic(root='/tmp/titanic') >>> len(dataset) 891 >>> dataset.feat_cols ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] >>> dataset.materialize() Titanic() >>> dataset.df.head(5) Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked PassengerId 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S :pyf:`PyTorch Frame` also supports a custom dataset, so that you can use :pyf:`PyTorch Frame` for your own problem. Let's say you prepare your :class:`pandas.DataFrame` as :obj:`df` with five columns: :obj:`cat1`, :obj:`cat2`, :obj:`num1`, :obj:`num2`, and :obj:`y`. Creating :obj:`torch_frame.data.Dataset` object is very easy: .. code-block:: python import torch_frame from torch_frame.data import Dataset # Specify the stype of each column with a dictionary. col_to_stype = { "cat1": torch_frame.categorical, "cat2": torch_frame.categorical, "num1": torch_frame.numerical, "num2": torch_frame.numerical, "y": torch_frame.categorical, } # Set "y" as the target column. dataset = Dataset(df, col_to_stype=col_to_stype, target_col="y") Data Handling of Tables ----------------------- A table contains different columns with different data types. Each data type is described by a semantic type which we refer to as :class:`~torch_frame.stype`. Currently :pyf:`PyTorch Frame` supports the following :class:`stypes`: - :obj:`stype.categorical` denotes categorical columns. - :obj:`stype.numerical` denotes numerical columns. - :obj:`stype.multicategorical` denotes multi_categorical columns. - :obj:`stype.text_embedded` denotes text columns that are pre-embedded via some text encoder. A table in :pyf:`PyTorch Frame` is described by an instance of :class:`~torch_frame.data.TensorFrame`, which holds the following attributes by default: - :obj:`col_names_dict`: A dictionary holding the column names for each :class:`~torch_frame.stype`. - :obj:`feat_dict`: A dictionary holding the :obj:`~torch.Tensor` of different :class:`stypes`. For :obj:`stype.numerical` and :obj:`stype.categorical`, the shape of :obj:`~torch.Tensor` is [`num_rows`, `num_cols`], while for :obj:`stype.text_embedded`, the shape is [`num_rows`, `num_cols`, `emb_dim`]. - :obj:`y` (optional): A tensor containing the target values for prediction. .. note:: The set of keys in :obj:`feat_dict` must exactly match with the set of keys in :obj:`col_names_dict`. :class:`~torch_frame.data.TensorFrame` is validated at initialization time. Creating a :class:`~torch_frame.data.TensorFrame` from :class:`~torch_frame.data.Dataset` is referred to as materialization. :meth:`~torch_frame.data.Dataset.materialize` converts raw data frame in :class:`~torch_frame.data.Dataset` into :class:`Tensors` and stores them in a :class:`~torch_frame.data.TensorFrame`. :meth:`~torch_frame.data.Dataset.materialize` also provides an optional argument `path` to cache the :class:`~torch_frame.data.TensorFrame` and `col_stats`. If `path` is specified, during the materialization :pyf:`PyTorch Frame` will try to load saved :class:`~torch_frame.data.TensorFrame` and `col_stats` at first. If there is no saved object found for that `path`, :pyf:`PyTorch Frame` will materialize the dataset and save the materialized :class:`~torch_frame.data.TensorFrame` and `col_stats` to the `path`. .. note:: Note that materialization does minimal processing of the original features, e.g., no normalization and missing value handling are performed. PyTorch Frame converts missing values in categorical :class:`torch_frame.stype` to `-1` and missing values in numerical :class:`torch_frame.stype` to `NaN`. We expect `NaN`/missing-value handling and normalization to be handled by the model side via :class:`torch_frame.nn.encoder.StypeEncoder`. The :class:`~torch_frame.data.TensorFrame` object has :class:`~torch.Tensor` at its core; therefore, it's friendly for training and inference with PyTorch. In :pyf:`PyTorch Frame`, we build data loaders and models around :class:`~torch_frame.data.TensorFrame`, benefitting from all the efficiency and flexibility from PyTorch. .. code-block:: python >>> from torch_frame import stype >>> # materialize the dataset >>> dataset.materialize() >>> # materialize the dataset with caching enabled >>> dataset.materialize(path='/tmp/titanic/data.pt') >>> # next materialization will load the cache >>> dataset.materialize(path='/tmp/titanic/data.pt') >>> tensor_frame = dataset.tensor_frame >>> tensor_frame.feat_dict.keys() dict_keys([, ]) >>> tensor_frame.feat_dict[stype.numerical] tensor([[22.0000, 1.0000, 0.0000, 7.2500], [38.0000, 1.0000, 0.0000, 71.2833], [26.0000, 0.0000, 0.0000, 7.9250], ..., [ nan, 1.0000, 2.0000, 23.4500], [26.0000, 0.0000, 0.0000, 30.0000], [32.0000, 0.0000, 0.0000, 7.7500]]) >>> tensor_frame.feat_dict[stype.categorical] tensor([[0, 0, 0], [1, 1, 1], [0, 1, 0], ..., [0, 1, 0], [1, 0, 1], [0, 0, 2]]) >>> tensor_frame.col_names_dict {: ['Pclass', 'Sex', 'Embarked'], : ['Age', 'SibSp', 'Parch', 'Fare']} >>> tensor_frame.y tensor([0, 1, 1, ..., 0, 1, 0]) A :class:`~torch_frame.data.TensorFrame` contains the following basic properties: .. code-block:: python >>> tensor_frame.stypes [, ] >>> tensor_frame.num_cols 7 >>> tensor_frame.num_rows 891 >>> tensor_frame.device device(type='cpu') We support transferring the data in a :class:`~torch_frame.data.TensorFrame` to devices supported by :pytorch:`PyTorch`. .. code-block:: python >>> tensor_frame = tensor_frame.to("cpu") >>> tensor_frame = tensor_frame.to("cuda") Once a :obj:`~torch_frame.data.Dataset` is materialized, we can retrieve column statistics on the data. For each :class:`~torch_frame.stype`, a different set of statistics is calculated. For categorical features, - :class:`StatType.COUNT` contains a tuple of two lists, where first list contains ordered category names and the second list contains category count, sorted from high to low. For numerical features, - :class:`StatType.MEAN` denotes the mean value of the numerical feature, - :class:`StatType.STD` denotes the standard deviation, - :class:`StatType.QUANTILES` contains a list containing minimum value, first quartile (25th percentile), median (50th percentile), third quartile (75th percentile) and maximum value of the column. .. code-block:: python >>> dataset.col_to_stype {'Survived': , 'Pclass': , 'Sex': , 'Age': , 'SibSp': , 'Parch': , 'Fare': , 'Embarked': } >>> dataset.col_stats['Sex'] {: (['male', 'female'], [577, 314])} >>> dataset.col_stats['Age'] {: 29.69911764705882, : 14.516321150817316, : [0.42, 20.125, 28.0, 38.0, 80.0]} Now let's say you have a new :class:`pandas.DataFrame` called :obj:`new_df`, and you want to convert it to a corresponding :class:`~torch_frame.data.TensorFrame` object. You can achieve this as follows: .. code-block:: python new_tf = dataset.convert_to_tensor_frame(new_df) Mini-batches ------------ Neural networks are usually trained in a mini-batch fashion. :pyf:`PyTorch Frame` contains its own :class:`~torch_frame.data.DataLoader`, which can load :class:`~torch_frame.data.Dataset` or :class:`~torch_frame.data.TensorFrame` in mini batches. .. code-block:: python >>> from torch_frame.data import DataLoader >>> data_loader = DataLoader(tensor_frame, batch_size=32, shuffle=True) >>> for batch in data_loader: ... batch ... TensorFrame( num_cols=7, num_rows=32, categorical (3): ['Pclass', 'Sex', 'Embarked'], numerical (4): ['Age', 'SibSp', 'Parch', 'Fare'], has_target=True, device='cpu', ) Learning Methods on Tabular Data -------------------------------- After learning about data handling, datasets, and loader in :pyf:`PyTorch Frame`, it’s time to implement our first model! Now let’s implement a model called :obj:`ExampleTransformer`. It uses :class:`~torch_frame.nn.conv.TabTransformerConv` as its convolution layer. Initializing a :class:`~torch_frame.nn.encoder.StypeWiseFeatureEncoder` requires :obj:`col_stats` and :obj:`col_names_dict`, we can directly get them as properties of any materialized dataset. .. code-block:: python from typing import Any import torch_frame from torch_frame import TensorFrame, stype from torch_frame.data.stats import StatType from torch_frame.nn.conv import TabTransformerConv from torch_frame.nn.encoder import ( EmbeddingEncoder, LinearEncoder, StypeWiseFeatureEncoder, ) class ExampleTransformer(torch.nn.Module): def __init__( self, channels: int, out_channels: int, num_layers: int, num_heads: int, col_stats: dict[str, dict[StatType, Any]], col_names_dict: dict[torch_frame.stype, list[str]], ): super().__init__() self.encoder = StypeWiseFeatureEncoder( out_channels=channels, col_stats=col_stats, col_names_dict=col_names_dict, stype_encoder_dict={ stype.categorical: EmbeddingEncoder(), stype.numerical: LinearEncoder() }, ) self.tab_transformer_convs = torch.nn.ModuleList([ TabTransformerConv( channels=channels, num_heads=num_heads, ) for _ in range(num_layers) ]) self.decoder = torch.nn.Linear(channels, out_channels) def forward(self, tf: TensorFrame) -> torch.Tensor: x, _ = self.encoder(tf) for tab_transformer_conv in self.tab_transformer_convs: x = tab_transformer_conv(x) return self.decoder(x.mean(dim=1)) In the example above, :class:`~torch_frame.nn.encoder.EmbeddingEncoder` is used to encode the categorical features and :class:`~torch_frame.nn.encoder.LinearEncoder` is used to encode the numerical features. The embeddings are then passed into layers of :class:`~torch_frame.nn.conv.TabTransformerConv`. Then the outputs are concatenated and fed into a :obj:`torch.nn.Linear` decoder. Let's create train-test split and create data loaders. .. code-block:: python from torch_frame.datasets import Yandex from torch_frame.data import DataLoader dataset = Yandex(root='/tmp/adult', name='adult') dataset.materialize() dataset.shuffle() train_dataset, test_dataset = dataset[:0.8], dataset[0.8:] train_loader = DataLoader(train_dataset.tensor_frame, batch_size=128, shuffle=True) test_loader = DataLoader(test_dataset.tensor_frame, batch_size=128) Let’s train this model for 50 epochs: .. code-block:: python import torch import torch.nn.functional as F device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = ExampleTransformer( channels=32, out_channels=dataset.num_classes, num_layers=2, num_heads=8, col_stats=train_dataset.col_stats, col_names_dict=train_dataset.tensor_frame.col_names_dict, ).to(device) optimizer = torch.optim.Adam(model.parameters()) for epoch in range(50): for tf in train_loader: tf = tf.to(device) pred = model(tf) loss = F.cross_entropy(pred, tf.y) optimizer.zero_grad() loss.backward() optimizer.step() Finally, we can evaluate our model on the test split: .. code-block:: python model.eval() correct = 0 for tf in test_loader: tf = tf.to(device) pred = model(tf) pred_class = pred.argmax(dim=-1) correct += (tf.y == pred_class).sum() acc = int(correct) / len(test_dataset) print(f'Accuracy: {acc:.4f}') # Accuracy: 0.8447 This is all it takes to implement your first deep tabular network. Happy hacking!