Embedders
bocoel.Embedder
Bases: Protocol
Embedders are responsible for encoding text into vectors. Embedders in this project are considered volatile because it requires CPU time, unless some database that encodes this functionality is found.
batch abstractmethod
property
batch: int
The batch size to use when encoding.
dims abstractmethod
property
dims: int
The dimensions of the embeddings
encode
encode(text: Sequence[str]) -> NDArray
Calls the encode function and performs some checks.
Source code in bocoel/corpora/embedders/interfaces.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
_encode abstractmethod
_encode(texts: Sequence[str]) -> Tensor
Implements the encode function.
Parameters
text: Sequence[str]
The text to encode. If a string is given, it is treated as a singleton batch. If a list is given, all those embeddings are processed together in a single batch.
Returns
A tensor of shape [batch, dims]. If the input is a string, the shape would be [dims].
Source code in bocoel/corpora/embedders/interfaces.py
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
|
bocoel.SbertEmbedder
SbertEmbedder(
model_name: str = "all-mpnet-base-v2",
device: str = "cpu",
batch_size: int = 64,
)
Bases: Embedder
Sentence-BERT embedder. Uses the sentence_transformers library.
Source code in bocoel/corpora/embedders/sberts.py
14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
encode
encode(text: Sequence[str]) -> NDArray
Calls the encode function and performs some checks.
Source code in bocoel/corpora/embedders/interfaces.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
bocoel.HuggingfaceEmbedder
HuggingfaceEmbedder(
path: str,
device: str = "cpu",
batch_size: int = 64,
transform: Callable[[Any], Tensor] = lambda: output.logits,
)
Bases: Embedder
Source code in bocoel/corpora/embedders/huggingface.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
encode
encode(text: Sequence[str]) -> NDArray
Calls the encode function and performs some checks.
Source code in bocoel/corpora/embedders/interfaces.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
bocoel.EnsembleEmbedder
EnsembleEmbedder(embedders: Sequence[Embedder], sequential: bool = False)
Bases: Embedder
Source code in bocoel/corpora/embedders/ensemble.py
11 12 13 14 15 16 17 18 19 20 21 22 |
|
encode
encode(text: Sequence[str]) -> NDArray
Calls the encode function and performs some checks.
Source code in bocoel/corpora/embedders/interfaces.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|