Corpus - Bayesian Optimization as a Coverage Tool for Evaluating LLM" > Corpus - Bayesian Optimization as a Coverage Tool for Evaluating LLM" >
Skip to content

Corpus

bocoel.Corpus

Bases: Protocol

Corpus is the entry point to handling the data in this library.

A corpus has 3 main components: - Index: Searches one particular column in the storage.Provides fast retrival. - Storage: Used to store the questions / answers / texts. - Embedder: Embeds the text into vectors for faster access.

An index only corresponds to one key. If search over multiple keys is desired, a new column or a new corpus (with shared storage) should be created.

storage instance-attribute

storage: Storage

Storage is used to store the questions / answers / etc. Can be viewed as a dataframe of texts.

index instance-attribute

Index searches one particular column in the storage into vectors.

bocoel.ComposedCorpus dataclass

Bases: Corpus

Simply a collection of components.

index_storage classmethod

index_storage(
    storage: Storage,
    embedder: Embedder,
    keys: Sequence[str],
    index_backend: type[Index],
    concat: Callable[[Iterable[Any]], str] = " [SEP] ".join,
    **index_kwargs: Any
) -> Self

Creates a corpus from the given storage, embedder, key and index class.

Parameters

storage: Storage Storage is used to store the questions / answers / etc. Can be viewed as a dataframe of texts.

embedder: Embedder Embedder is used to embed the texts into vectors. It should provide the number of dims for the index to look into.

*keys: str The keys to the column to search over.

concat: Callable[..., Any] | None = None Function to concatenate the columns.

index_backend: type[Index] The index class to use. Creates an index from the embeddings generated by the embedder.

**index_kwargs: Any Additional keyword arguments to pass to the index class.

Source code in bocoel/corpora/corpora/composed.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
@classmethod
def index_storage(
    cls,
    storage: Storage,
    embedder: Embedder,
    keys: Sequence[str],
    index_backend: type[Index],
    concat: Callable[[Iterable[Any]], str] = " [SEP] ".join,
    **index_kwargs: Any,
) -> Self:
    """
    Creates a corpus from the given storage, embedder, key and index class.

    Parameters
    ----------

    `storage: Storage`
    Storage is used to store the questions / answers / etc.
    Can be viewed as a dataframe of texts.

    `embedder: Embedder`
    Embedder is used to embed the texts into vectors.
    It should provide the number of dims for the index to look into.

    `*keys: str`
    The keys to the column to search over.

    `concat: Callable[..., Any] | None = None`
    Function to concatenate the columns.

    `index_backend: type[Index]`
    The index class to use.
    Creates an index from the embeddings generated by the embedder.

    `**index_kwargs: Any`
    Additional keyword arguments to pass to the index class.
    """

    def transform(mapping: Mapping[str, Sequence[Any]]) -> Sequence[str]:
        data = [mapping[k] for k in keys]
        return [concat(datum) for datum in zip(*data)]

    return cls.index_mapped(
        storage=storage,
        embedder=embedder,
        transform=transform,
        index_backend=index_backend,
        **index_kwargs,
    )

index_mapped classmethod

index_mapped(
    storage: Storage,
    embedder: Embedder,
    transform: Callable[[Mapping[str, Sequence[Any]]], Sequence[str]],
    index_backend: type[Index],
    **index_kwargs: Any
) -> Self

Creates a corpus from the given storage, embedder, key and index class, where storage entries would be mapped to strings, using the specified batched transform function.

Source code in bocoel/corpora/corpora/composed.py
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
@classmethod
def index_mapped(
    cls,
    storage: Storage,
    embedder: Embedder,
    transform: Callable[[Mapping[str, Sequence[Any]]], Sequence[str]],
    index_backend: type[Index],
    **index_kwargs: Any,
) -> Self:
    """
    Creates a corpus from the given storage, embedder, key and index class,
    where storage entries would be mapped to strings,
    using the specified batched transform function.
    """

    embeddings = embedder.encode_storage(storage, transform=transform)
    return cls.index_embeddings(
        embeddings=embeddings,
        storage=storage,
        index_backend=index_backend,
        **index_kwargs,
    )

index_embeddings classmethod

index_embeddings(
    storage: Storage,
    embeddings: NDArray,
    index_backend: type[Index],
    **index_kwargs: Any
) -> Self

Create the corpus with the given embeddings. This can be used to save time by encoding once and caching embeddings.

Source code in bocoel/corpora/corpora/composed.py
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
@classmethod
def index_embeddings(
    cls,
    storage: Storage,
    embeddings: NDArray,
    index_backend: type[Index],
    **index_kwargs: Any,
) -> Self:
    """
    Create the corpus with the given embeddings.
    This can be used to save time by encoding once and caching embeddings.
    """

    index = index_backend(embeddings, **index_kwargs)
    return cls(index=StatefulIndex(index), storage=storage)