Embedders - Bayesian Optimization as a Coverage Tool for Evaluating LLM" > Embedders - Bayesian Optimization as a Coverage Tool for Evaluating LLM" >
Skip to content

Embedders

bocoel.Embedder

Bases: Protocol

Embedders are responsible for encoding text into vectors. Embedders in this project are considered volatile because it requires CPU time, unless some database that encodes this functionality is found.

batch abstractmethod property

batch: int

The batch size to use when encoding.

dims abstractmethod property

dims: int

The dimensions of the embeddings

encode

encode(text: Sequence[str]) -> NDArray

Calls the encode function and performs some checks.

Source code in bocoel/corpora/embedders/interfaces.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def encode(self, text: Sequence[str], /) -> NDArray:
    """
    Calls the encode function and performs some checks.
    """

    with torch.no_grad():
        encoded = self._encode(text)

    if (dim := encoded.shape[-1]) != self.dims:
        raise ValueError(
            f"Expected the encoded embeddings to have dimension {self.dims}, got {dim}"
        )

    return encoded.cpu().numpy()

_encode abstractmethod

_encode(texts: Sequence[str]) -> Tensor

Implements the encode function.

Parameters

text: Sequence[str] The text to encode. If a string is given, it is treated as a singleton batch. If a list is given, all those embeddings are processed together in a single batch.

Returns

A tensor of shape [batch, dims]. If the input is a string, the shape would be [dims].

Source code in bocoel/corpora/embedders/interfaces.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
@abc.abstractmethod
def _encode(self, texts: Sequence[str], /) -> Tensor:
    """
    Implements the encode function.

    Parameters
    ----------

    `text: Sequence[str]`
    The text to encode.
    If a string is given, it is treated as a singleton batch.
    If a list is given, all those embeddings are processed together in a single batch.

    Returns
    -------

    A tensor of shape [batch, dims]. If the input is a string, the shape would be [dims].
    """

    ...

bocoel.SbertEmbedder

SbertEmbedder(
    model_name: str = "all-mpnet-base-v2",
    device: str = "cpu",
    batch_size: int = 64,
)

Bases: Embedder

Sentence-BERT embedder. Uses the sentence_transformers library.

Source code in bocoel/corpora/embedders/sberts.py
14
15
16
17
18
19
20
21
22
23
24
25
26
def __init__(
    self,
    model_name: str = "all-mpnet-base-v2",
    device: str = "cpu",
    batch_size: int = 64,
) -> None:
    # Optional dependency.
    from sentence_transformers import SentenceTransformer

    self._name = model_name
    self._sbert = SentenceTransformer(model_name, device=device)

    self._batch_size = batch_size

encode

encode(text: Sequence[str]) -> NDArray

Calls the encode function and performs some checks.

Source code in bocoel/corpora/embedders/interfaces.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def encode(self, text: Sequence[str], /) -> NDArray:
    """
    Calls the encode function and performs some checks.
    """

    with torch.no_grad():
        encoded = self._encode(text)

    if (dim := encoded.shape[-1]) != self.dims:
        raise ValueError(
            f"Expected the encoded embeddings to have dimension {self.dims}, got {dim}"
        )

    return encoded.cpu().numpy()

bocoel.HuggingfaceEmbedder

HuggingfaceEmbedder(
    path: str,
    device: str = "cpu",
    batch_size: int = 64,
    transform: Callable[[Any], Tensor] = lambda: output.logits,
)

Bases: Embedder

Source code in bocoel/corpora/embedders/huggingface.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def __init__(
    self,
    path: str,
    device: str = "cpu",
    batch_size: int = 64,
    transform: Callable[[Any], Tensor] = lambda output: output.logits,
) -> None:
    self._path = path
    self._model = AutoModelForSequenceClassification.from_pretrained(path)
    self._tokenizer = AutoTokenizer.from_pretrained(path)
    self._batch_size = batch_size

    self._device = device
    self._model = self._model.to(device)
    self._transform = transform

encode

encode(text: Sequence[str]) -> NDArray

Calls the encode function and performs some checks.

Source code in bocoel/corpora/embedders/interfaces.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def encode(self, text: Sequence[str], /) -> NDArray:
    """
    Calls the encode function and performs some checks.
    """

    with torch.no_grad():
        encoded = self._encode(text)

    if (dim := encoded.shape[-1]) != self.dims:
        raise ValueError(
            f"Expected the encoded embeddings to have dimension {self.dims}, got {dim}"
        )

    return encoded.cpu().numpy()

bocoel.EnsembleEmbedder

EnsembleEmbedder(embedders: Sequence[Embedder], sequential: bool = False)

Bases: Embedder

Source code in bocoel/corpora/embedders/ensemble.py
11
12
13
14
15
16
17
18
19
20
21
22
def __init__(self, embedders: Sequence[Embedder], sequential: bool = False) -> None:
    # Check if all embedders have the same batch size.
    self._embedders = embedders
    self._batch_size = embedders[0].batch
    if len(set(emb.batch for emb in embedders)) != 1:
        raise ValueError("All embedders must have the same batch size")

    self._sequential = sequential

    cpus = os.cpu_count()
    assert cpus is not None
    self._cpus = cpus

encode

encode(text: Sequence[str]) -> NDArray

Calls the encode function and performs some checks.

Source code in bocoel/corpora/embedders/interfaces.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def encode(self, text: Sequence[str], /) -> NDArray:
    """
    Calls the encode function and performs some checks.
    """

    with torch.no_grad():
        encoded = self._encode(text)

    if (dim := encoded.shape[-1]) != self.dims:
        raise ValueError(
            f"Expected the encoded embeddings to have dimension {self.dims}, got {dim}"
        )

    return encoded.cpu().numpy()