Indices - Bayesian Optimization as a Coverage Tool for Evaluating LLM" > Indices - Bayesian Optimization as a Coverage Tool for Evaluating LLM" >
Skip to content

Indices

bocoel.Index

Index(*args: Any, **kwargs: Any)

Bases: Protocol

Index is responsible for fast retrieval given a vector query.

Source code in bocoel/corpora/indices/interfaces/indices.py
30
31
32
def __init__(self, *args: Any, **kwargs: Any) -> None:
    # Included s.t. constructors of Index can be used.
    ...

_embeddings abstractmethod property

_embeddings: NDArray | IndexedArray

The embeddings used by the index.

boundary abstractmethod property

boundary: Boundary

The boundary of the input.

distance abstractmethod property

distance: Distance

The distance metric used by the index.

dims property

dims: int

The number of dimensions that the query vector should be.

search

search(query: ArrayLike, k: int = 1) -> SearchResultBatch

Calls the search function and performs some checks.

Parameters

query: ArrayLike The query vector. Must be of shape [batch, dims].

k: int The number of nearest neighbors to return.

Returns

A SearchResultBatch instance. See SearchResultBatch for details.

Source code in bocoel/corpora/indices/interfaces/indices.py
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def search(self, query: ArrayLike, k: int = 1) -> SearchResultBatch:
    """
    Calls the search function and performs some checks.

    Parameters
    ----------

    `query: ArrayLike`
    The query vector. Must be of shape `[batch, dims]`.

    `k: int`
    The number of nearest neighbors to return.

    Returns
    -------

    A `SearchResultBatch` instance. See `SearchResultBatch` for details.
    """

    query = np.array(query)

    if (ndim := query.ndim) != 2:
        raise ValueError(
            f"Expected query to be a 2D vector, got a vector of dim {ndim}."
        )

    if (dim := query.shape[1]) != self.dims:
        raise ValueError(f"Expected query to have dimension {self.dims}, got {dim}")

    if k < 1:
        raise ValueError(f"Expected k to be at least 1, got {k}")

    results: list[InternalResult] = []
    for idx in range(0, len(query), self.batch):
        query_batch = query[idx : idx + self.batch]
        result = self._search(query_batch, k=k)
        results.append(result)

    indices = np.concatenate([res.indices for res in results], axis=0)
    distances = np.concatenate([res.distances for res in results], axis=0)
    vectors = self._embeddings[indices]

    return SearchResultBatch(
        query=query, vectors=vectors, distances=distances, indices=indices
    )
_search(query: NDArray, k: int = 1) -> InternalResult

Search the index with a given query.

Parameters

query: NDArray The query vector. Must be of shape [dims].

k: int The number of nearest neighbors to return.

Returns

A numpy array of shape [k]. This corresponds to the indices of the nearest neighbors.

Source code in bocoel/corpora/indices/interfaces/indices.py
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
@abc.abstractmethod
def _search(self, query: NDArray, k: int = 1) -> InternalResult:
    """
    Search the index with a given query.

    Parameters
    ----------

    `query: NDArray`
    The query vector. Must be of shape [dims].

    `k: int`
    The number of nearest neighbors to return.

    Returns
    -------

    A numpy array of shape [k].
    This corresponds to the indices of the nearest neighbors.
    """

    ...

bocoel.HnswlibIndex

HnswlibIndex(
    embeddings: NDArray,
    distance: str | Distance,
    threads: int = -1,
    batch_size: int = 64,
)

Bases: Index

HNSWLIB index. Uses the hnswlib library.

Score is calculated slightly differently https://github.com/nmslib/hnswlib#supported-distances

Source code in bocoel/corpora/indices/backend/hnswlib.py
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def __init__(
    self,
    embeddings: NDArray,
    distance: str | Distance,
    threads: int = -1,
    batch_size: int = 64,
) -> None:
    utils.validate_embeddings(embeddings)
    embeddings = utils.normalize(embeddings)

    self._emb = embeddings

    # Would raise ValueError if not a valid distance.
    self._dist = Distance.lookup(distance)
    self._batch_size = batch_size

    self._boundary = utils.boundaries(embeddings)

    # A public attribute because this can be changed at anytime.
    self.threads = threads

    self._init_index()

search

search(query: ArrayLike, k: int = 1) -> SearchResultBatch

Calls the search function and performs some checks.

Parameters

query: ArrayLike The query vector. Must be of shape [batch, dims].

k: int The number of nearest neighbors to return.

Returns

A SearchResultBatch instance. See SearchResultBatch for details.

Source code in bocoel/corpora/indices/interfaces/indices.py
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def search(self, query: ArrayLike, k: int = 1) -> SearchResultBatch:
    """
    Calls the search function and performs some checks.

    Parameters
    ----------

    `query: ArrayLike`
    The query vector. Must be of shape `[batch, dims]`.

    `k: int`
    The number of nearest neighbors to return.

    Returns
    -------

    A `SearchResultBatch` instance. See `SearchResultBatch` for details.
    """

    query = np.array(query)

    if (ndim := query.ndim) != 2:
        raise ValueError(
            f"Expected query to be a 2D vector, got a vector of dim {ndim}."
        )

    if (dim := query.shape[1]) != self.dims:
        raise ValueError(f"Expected query to have dimension {self.dims}, got {dim}")

    if k < 1:
        raise ValueError(f"Expected k to be at least 1, got {k}")

    results: list[InternalResult] = []
    for idx in range(0, len(query), self.batch):
        query_batch = query[idx : idx + self.batch]
        result = self._search(query_batch, k=k)
        results.append(result)

    indices = np.concatenate([res.indices for res in results], axis=0)
    distances = np.concatenate([res.distances for res in results], axis=0)
    vectors = self._embeddings[indices]

    return SearchResultBatch(
        query=query, vectors=vectors, distances=distances, indices=indices
    )

bocoel.FaissIndex

FaissIndex(
    embeddings: NDArray,
    distance: str | Distance,
    index_string: str,
    cuda: bool = False,
    batch_size: int = 64,
)

Bases: Index

Faiss index. Uses the faiss library.

Source code in bocoel/corpora/indices/backend/faiss.py
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def __init__(
    self,
    embeddings: NDArray,
    distance: str | Distance,
    index_string: str,
    cuda: bool = False,
    batch_size: int = 64,
) -> None:
    utils.validate_embeddings(embeddings)
    embeddings = utils.normalize(embeddings)
    self._emb = embeddings

    self._batch_size = batch_size
    self._dist = Distance.lookup(distance)
    self._boundary = utils.boundaries(embeddings)

    self._index_string = index_string
    self._init_index(index_string=index_string, cuda=cuda)

search

search(query: ArrayLike, k: int = 1) -> SearchResultBatch

Calls the search function and performs some checks.

Parameters

query: ArrayLike The query vector. Must be of shape [batch, dims].

k: int The number of nearest neighbors to return.

Returns

A SearchResultBatch instance. See SearchResultBatch for details.

Source code in bocoel/corpora/indices/interfaces/indices.py
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def search(self, query: ArrayLike, k: int = 1) -> SearchResultBatch:
    """
    Calls the search function and performs some checks.

    Parameters
    ----------

    `query: ArrayLike`
    The query vector. Must be of shape `[batch, dims]`.

    `k: int`
    The number of nearest neighbors to return.

    Returns
    -------

    A `SearchResultBatch` instance. See `SearchResultBatch` for details.
    """

    query = np.array(query)

    if (ndim := query.ndim) != 2:
        raise ValueError(
            f"Expected query to be a 2D vector, got a vector of dim {ndim}."
        )

    if (dim := query.shape[1]) != self.dims:
        raise ValueError(f"Expected query to have dimension {self.dims}, got {dim}")

    if k < 1:
        raise ValueError(f"Expected k to be at least 1, got {k}")

    results: list[InternalResult] = []
    for idx in range(0, len(query), self.batch):
        query_batch = query[idx : idx + self.batch]
        result = self._search(query_batch, k=k)
        results.append(result)

    indices = np.concatenate([res.indices for res in results], axis=0)
    distances = np.concatenate([res.distances for res in results], axis=0)
    vectors = self._embeddings[indices]

    return SearchResultBatch(
        query=query, vectors=vectors, distances=distances, indices=indices
    )

bocoel.WhiteningIndex

WhiteningIndex(
    embeddings: NDArray,
    distance: str | Distance,
    reduced: int,
    whitening_backend: type[Index],
    **backend_kwargs: Any
)

Bases: Index

Whitening index. Whitens the data before indexing. See https://arxiv.org/abs/2103.15316 for more info.

Source code in bocoel/corpora/indices/whitening.py
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def __init__(
    self,
    embeddings: NDArray,
    distance: str | Distance,
    reduced: int,
    whitening_backend: type[Index],
    **backend_kwargs: Any,
) -> None:
    # Reduced might be smaller than embeddings.
    # In such case, no dimensionality reduction is performed.
    if reduced > embeddings.shape[1]:
        reduced = embeddings.shape[1]
        LOGGER.info(
            "Reduced dimensionality is larger than embeddings. Using full dimensionality",
            reduced=reduced,
            embeddings=embeddings.shape,
        )

    white = self.whiten(embeddings, reduced)
    assert white.shape[1] == reduced, {
        "whitened": white.shape,
        "reduced": reduced,
    }
    self._index = whitening_backend(
        embeddings=white, distance=distance, **backend_kwargs
    )
    assert reduced == self._index.dims

search

search(query: ArrayLike, k: int = 1) -> SearchResultBatch

Calls the search function and performs some checks.

Parameters

query: ArrayLike The query vector. Must be of shape [batch, dims].

k: int The number of nearest neighbors to return.

Returns

A SearchResultBatch instance. See SearchResultBatch for details.

Source code in bocoel/corpora/indices/interfaces/indices.py
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def search(self, query: ArrayLike, k: int = 1) -> SearchResultBatch:
    """
    Calls the search function and performs some checks.

    Parameters
    ----------

    `query: ArrayLike`
    The query vector. Must be of shape `[batch, dims]`.

    `k: int`
    The number of nearest neighbors to return.

    Returns
    -------

    A `SearchResultBatch` instance. See `SearchResultBatch` for details.
    """

    query = np.array(query)

    if (ndim := query.ndim) != 2:
        raise ValueError(
            f"Expected query to be a 2D vector, got a vector of dim {ndim}."
        )

    if (dim := query.shape[1]) != self.dims:
        raise ValueError(f"Expected query to have dimension {self.dims}, got {dim}")

    if k < 1:
        raise ValueError(f"Expected k to be at least 1, got {k}")

    results: list[InternalResult] = []
    for idx in range(0, len(query), self.batch):
        query_batch = query[idx : idx + self.batch]
        result = self._search(query_batch, k=k)
        results.append(result)

    indices = np.concatenate([res.indices for res in results], axis=0)
    distances = np.concatenate([res.distances for res in results], axis=0)
    vectors = self._embeddings[indices]

    return SearchResultBatch(
        query=query, vectors=vectors, distances=distances, indices=indices
    )

bocoel.PolarIndex

PolarIndex(
    embeddings: NDArray,
    distance: str | Distance,
    polar_backend: type[Index],
    **backend_kwargs: Any
)

Bases: Index

Index that uses N-sphere coordinates as interfaces.

https://en.wikipedia.org/wiki/N-sphere#Spherical_coordinates

Source code in bocoel/corpora/indices/polar.py
25
26
27
28
29
30
31
32
33
34
35
36
37
def __init__(
    self,
    embeddings: NDArray,
    distance: str | Distance,
    polar_backend: type[Index],
    **backend_kwargs: Any,
) -> None:
    embeddings = utils.normalize(embeddings)
    self._index = polar_backend(
        embeddings=embeddings,
        distance=distance,
        **backend_kwargs,
    )

search

search(query: ArrayLike, k: int = 1) -> SearchResultBatch

Calls the search function and performs some checks.

Parameters

query: ArrayLike The query vector. Must be of shape [batch, dims].

k: int The number of nearest neighbors to return.

Returns

A SearchResultBatch instance. See SearchResultBatch for details.

Source code in bocoel/corpora/indices/interfaces/indices.py
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def search(self, query: ArrayLike, k: int = 1) -> SearchResultBatch:
    """
    Calls the search function and performs some checks.

    Parameters
    ----------

    `query: ArrayLike`
    The query vector. Must be of shape `[batch, dims]`.

    `k: int`
    The number of nearest neighbors to return.

    Returns
    -------

    A `SearchResultBatch` instance. See `SearchResultBatch` for details.
    """

    query = np.array(query)

    if (ndim := query.ndim) != 2:
        raise ValueError(
            f"Expected query to be a 2D vector, got a vector of dim {ndim}."
        )

    if (dim := query.shape[1]) != self.dims:
        raise ValueError(f"Expected query to have dimension {self.dims}, got {dim}")

    if k < 1:
        raise ValueError(f"Expected k to be at least 1, got {k}")

    results: list[InternalResult] = []
    for idx in range(0, len(query), self.batch):
        query_batch = query[idx : idx + self.batch]
        result = self._search(query_batch, k=k)
        results.append(result)

    indices = np.concatenate([res.indices for res in results], axis=0)
    distances = np.concatenate([res.distances for res in results], axis=0)
    vectors = self._embeddings[indices]

    return SearchResultBatch(
        query=query, vectors=vectors, distances=distances, indices=indices
    )

bocoel.StatefulIndex

StatefulIndex(index: Index)

Bases: Index

An index that tracks states.

Source code in bocoel/corpora/indices/stateful.py
21
22
23
def __init__(self, index: Index) -> None:
    self._index = index
    self._clear_history()

dims property

dims: int

The number of dimensions that the query vector should be.

history property

history: Sequence[SearchResult]

History for looking up the results of previous searches with index handles.

search

search(query: ArrayLike, k: int = 1) -> SearchResultBatch

Calls the search function and performs some checks.

Parameters

query: ArrayLike The query vector. Must be of shape [batch, dims].

k: int The number of nearest neighbors to return.

Returns

A SearchResultBatch instance. See SearchResultBatch for details.

Source code in bocoel/corpora/indices/interfaces/indices.py
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def search(self, query: ArrayLike, k: int = 1) -> SearchResultBatch:
    """
    Calls the search function and performs some checks.

    Parameters
    ----------

    `query: ArrayLike`
    The query vector. Must be of shape `[batch, dims]`.

    `k: int`
    The number of nearest neighbors to return.

    Returns
    -------

    A `SearchResultBatch` instance. See `SearchResultBatch` for details.
    """

    query = np.array(query)

    if (ndim := query.ndim) != 2:
        raise ValueError(
            f"Expected query to be a 2D vector, got a vector of dim {ndim}."
        )

    if (dim := query.shape[1]) != self.dims:
        raise ValueError(f"Expected query to have dimension {self.dims}, got {dim}")

    if k < 1:
        raise ValueError(f"Expected k to be at least 1, got {k}")

    results: list[InternalResult] = []
    for idx in range(0, len(query), self.batch):
        query_batch = query[idx : idx + self.batch]
        result = self._search(query_batch, k=k)
        results.append(result)

    indices = np.concatenate([res.indices for res in results], axis=0)
    distances = np.concatenate([res.distances for res in results], axis=0)
    vectors = self._embeddings[indices]

    return SearchResultBatch(
        query=query, vectors=vectors, distances=distances, indices=indices
    )
stateful_search(query: ArrayLike, k: int = 1) -> Mapping[int, SearchResult]

Search while tracking states.

Parameters

query: ArrayLike The query to search for.

k: int = 1 The number of nearest neighbors to return.

Returns

Mapping[int, SearchResult] A mapping from the index of the search result to the search result. The index is used for retrieving the search result later. Do so by indexing the history property of this object.

Source code in bocoel/corpora/indices/stateful.py
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
def stateful_search(
    self, query: ArrayLike, k: int = 1
) -> Mapping[int, SearchResult]:
    """
    Search while tracking states.

    Parameters
    ----------

    `query: ArrayLike`
    The query to search for.

    `k: int = 1`
    The number of nearest neighbors to return.

    Returns
    -------

    `Mapping[int, SearchResult]`
    A mapping from the index of the search result to the search result.
    The index is used for retrieving the search result later.
    Do so by indexing the `history` property of this object.
    """

    result = self.search(query=query, k=k)
    prev_len = len(self._history)
    splitted = utils.split_search_result_batch(result)
    self._history.extend(splitted)
    return dict(zip(range(prev_len, len(self._history)), splitted))

bocoel.Boundary dataclass

The boundary of embeddings in a corpus. The boundary is defined as a hyperrectangle in the embedding space.

bounds instance-attribute

bounds: NDArray

The boundary arrays of the corpus. Must be of shape [dims, 2], where dims is the number of dimensions. The first column is the lower bound, the second column is the upper bound.

dims property

dims: int

The number of dimensions.

lower property

lower: NDArray

The lower bounds. Must be of shape [dims].

upper property

upper: NDArray

The upper bounds. Must be of shape [dims].

fixed classmethod

fixed(lower: float, upper: float, dims: int) -> Self

Create a boundary with fixed bounds. If lower > upper, a ValueError would be raised.

Parameters

lower: float The lower bound.

upper: float The upper bound.

dims: int The number of dimensions.

Returns

A Boundary instance.

Source code in bocoel/corpora/indices/interfaces/boundaries.py
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
@classmethod
def fixed(cls, lower: float, upper: float, dims: int) -> Self:
    """
    Create a boundary with fixed bounds.
    If `lower > upper`, a `ValueError` would be raised.

    Parameters
    ----------

    `lower: float`
    The lower bound.

    `upper: float`
    The upper bound.

    `dims: int`
    The number of dimensions.

    Returns
    -------

    A `Boundary` instance.
    """

    if lower > upper:
        raise ValueError("Expected lower <= upper")

    return cls(bounds=np.array([[lower, upper]] * dims))

bocoel.Distance

Bases: StrEnum

bocoel.corpora.indices.interfaces.results._SearchResult dataclass

query instance-attribute

query: NDArray

Query vector. If batched, should have shape [batch, dims]. Or else, should have shape [dims].

vectors instance-attribute

vectors: NDArray

Nearest neighbors. If batched, should have shape [batch, k, dims]. Or else, should have shape [k, dims].

distances instance-attribute

distances: NDArray

Calculated distance. If batched, should have shape [batch, k]. Or else, should have shape [k].

indices instance-attribute

indices: NDArray

Index in the original embeddings. Must be integers. If batched, should have shape [batch, k]. Or else, should have shape [k].

bocoel.corpora.SearchResultBatch dataclass

Bases: _SearchResult

A batched version of search result.

query instance-attribute

query: NDArray

Query vector. If batched, should have shape [batch, dims]. Or else, should have shape [dims].

vectors instance-attribute

vectors: NDArray

Nearest neighbors. If batched, should have shape [batch, k, dims]. Or else, should have shape [k, dims].

distances instance-attribute

distances: NDArray

Calculated distance. If batched, should have shape [batch, k]. Or else, should have shape [k].

indices instance-attribute

indices: NDArray

Index in the original embeddings. Must be integers. If batched, should have shape [batch, k]. Or else, should have shape [k].

bocoel.corpora.SearchResult dataclass

Bases: _SearchResult

A non-batched version of search result.

query instance-attribute

query: NDArray

Query vector. If batched, should have shape [batch, dims]. Or else, should have shape [dims].

vectors instance-attribute

vectors: NDArray

Nearest neighbors. If batched, should have shape [batch, k, dims]. Or else, should have shape [k, dims].

distances instance-attribute

distances: NDArray

Calculated distance. If batched, should have shape [batch, k]. Or else, should have shape [k].

indices instance-attribute

indices: NDArray

Index in the original embeddings. Must be integers. If batched, should have shape [batch, k]. Or else, should have shape [k].