Corpus
bocoel.Corpus
Bases: Protocol
Corpus is the entry point to handling the data in this library.
A corpus has 3 main components: - Index: Searches one particular column in the storage.Provides fast retrival. - Storage: Used to store the questions / answers / texts. - Embedder: Embeds the text into vectors for faster access.
An index only corresponds to one key. If search over multiple keys is desired, a new column or a new corpus (with shared storage) should be created.
storage instance-attribute
storage: Storage
Storage is used to store the questions / answers / etc. Can be viewed as a dataframe of texts.
index instance-attribute
index: StatefulIndex
Index searches one particular column in the storage into vectors.
bocoel.ComposedCorpus dataclass
Bases: Corpus
Simply a collection of components.
index_storage classmethod
index_storage(
storage: Storage,
embedder: Embedder,
keys: Sequence[str],
index_backend: type[Index],
concat: Callable[[Iterable[Any]], str] = " [SEP] ".join,
**index_kwargs: Any
) -> Self
Creates a corpus from the given storage, embedder, key and index class.
Parameters
storage: Storage
Storage is used to store the questions / answers / etc. Can be viewed as a dataframe of texts.
embedder: Embedder
Embedder is used to embed the texts into vectors. It should provide the number of dims for the index to look into.
*keys: str
The keys to the column to search over.
concat: Callable[..., Any] | None = None
Function to concatenate the columns.
index_backend: type[Index]
The index class to use. Creates an index from the embeddings generated by the embedder.
**index_kwargs: Any
Additional keyword arguments to pass to the index class.
Source code in bocoel/corpora/corpora/composed.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
|
index_mapped classmethod
index_mapped(
storage: Storage,
embedder: Embedder,
transform: Callable[[Mapping[str, Sequence[Any]]], Sequence[str]],
index_backend: type[Index],
**index_kwargs: Any
) -> Self
Creates a corpus from the given storage, embedder, key and index class, where storage entries would be mapped to strings, using the specified batched transform function.
Source code in bocoel/corpora/corpora/composed.py
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
|
index_embeddings classmethod
index_embeddings(
storage: Storage,
embeddings: NDArray,
index_backend: type[Index],
**index_kwargs: Any
) -> Self
Create the corpus with the given embeddings. This can be used to save time by encoding once and caching embeddings.
Source code in bocoel/corpora/corpora/composed.py
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
|