Text Embeddings
Embeddings are a numerical representation of the meaning of some data. Text embeddings represent something about the meaning of some text. Embeddings can be used either directly or as input to machine learning models. In this chapter, we will learn how to use Kalosm to create text embeddings and integrate them into a database. We will also learn how to search the embedding database for documents that are similar to a given query.
Creating an Embedding model
First, we need to create an embedding model. An embedding model is a machine learning model that can be used to create embeddings. Kalosm provides a Bert
struct that can be used to create an embedding model.
use kalosm::language::*; let bert = Bert::new().await?;
Bonus: Download progress
If you need to update progress while you are downloading the model, you can use the bert builder with the
build_with_loading_handler
method.let bert = Bert::builder() .build_with_loading_handler(|progress| match progress { ModelLoadingProgress::Downloading { source, start_time, progress, } => { let progress = (progress * 100.0) as u32; let elapsed = start_time.elapsed().as_secs_f32(); println!("Downloading file {source} {progress}% ({elapsed}s)"); } ModelLoadingProgress::Loading { progress } => { let progress = (progress * 100.0) as u32; println!("Loading model {progress}%"); } }) .await?;
Creating Embeddings
Once we have created an embedding model, we can use it to create embeddings. Kalosm provides a Bert
struct that can be used to create embeddings.
let text = "Hello, world!"; let embeddings = bert.embed(text).await?; println!("{:?}", embeddings);
Try different values for the text we are embedding. How does the embedding change?
Creating an Embedding Database
Now that we know how to create embeddings, we can use them to create an embedding database. An embedding database is a data structure that stores embeddings and allows you to search for documents that are similar to a given query. Kalosm provides a DocumentTable
struct that can be used to create an embedding database linked to a table in a Surrealdb database. You can choose a chunk strategy to use when creating the embedding database. A chunk strategy determines how documents are split into chunks before being embedded. In this example, we will use the Sentence
chunk strategy, which splits documents into sentences before embedding them. The bert embedding model tends to work best with single sentence chunks.
use kalosm::language::*; // Create database connection let db = surrealdb::Surreal::new::<surrealdb::engine::local::RocksDb>("./db/temp.db").await?; // Select a specific namespace / database db.use_ns("test").use_db("test").await?; // Create a document table let document_table = db .document_table_builder("documents") // Store the embedding database in the ./db/embeddings.db file .at("./db/embeddings.db") .build() .await?;
Adding Documents to the Embedding Database
Once you have created an embedding database, you can add documents to it with the extend
method. The extend
method takes something that can be turned into documents and adds them to the embedding database. In this example, we will add documents from a RSS feed to the embedding database.
let nyt = RssFeed::new(Url::parse( "https://rss.nytimes.com/services/xml/rss/nyt/US.xml", )?); // Fetch the documents from the feed let documents = nyt.into_documents().await?; // And insert them into the database for document in documents { document_table.insert(document).await?; }
This example uses rss context, but you can also use audio, filesystem, or search engine contextYou can also use a fuzzy search engine with the same api if you prefer traditional search
Searching the Embedding Database
Next, you can use search through the documents you embedded with the search
method. The search
method takes a query and returns a list of documents that are similar to the query. The search
method also takes a limit
parameter that determines how many documents to return.
loop { let user_question = prompt_input("Query: ")?; let user_question_embedding = document_table .embedding_model() .embed(&user_question) .await?; println!( "vector: {:?}", document_table .select_nearest(user_question_embedding, 5) .await? ); }
Conclusion
In this chapter, we learned how to use Kalosm to create embeddings and integrate them into a database. We also learned how to search the embedding database for similar documents.