torch_frame.config.TextTokenizerConfig

class TextTokenizerConfig(text_tokenizer: Callable[[list[str]], list[collections.abc.Mapping[str, torch.Tensor]] | collections.abc.Mapping[str, torch.Tensor]], batch_size: Optional[int] = None)[source]

Bases: object

Text tokenizer that maps a list of strings/sentences into a dictionary of MultiNestedTensor.

Parameters:
  • text_tokenizer (callable) – A callable text tokenizer that takes a list of strings as input and outputs a list of dictionaries. Each dictionary contains keys that are arguments to the text encoder model and values are corresponding tensors such as tokens and attention masks.

  • batch_size (int, optional) – Batch size to use when tokenizing the sentences. If set to None, the text embeddings will be obtained in a full-batch manner. (default: None)