Embeddings with sqlite-vector

About a year and a half ago I wrote about using sqlite-vss to store and query embedding vectors in a SQLite database. Much has changed since then and I’m working on a project that motivated another pass at querying embeddings on a local system for smallish datasets. The sqlite-vector project seemed like an interesting one to try for this purpose.

I am going to use the same news dataset as last time and the nomic-embed-text-v1.5 model to generate 768-dimensional embeddings.

I also downloaded the vector.dylib file from the sqlite-vector GitHub repo and placed it in my working directory for this example. I’ve tried exercises similar to this one with both the macOS and Linux versions of the library.

To get started, we’ll install the libraries we will need to load the data, create the database and generate embeddings.

pip install -q pandas scikit-learn sentence-transformers

Next, we’ll create the database and table and load the sqlite-vector extension, verifying that it loaded correctly.

import sqlite3
import json
from pathlib import Path
conn = sqlite3.connect('news.db')
conn.enable_load_extension(True)
conn.load_extension('./vector.dylib')
conn.enable_load_extension(False)
version = conn.execute("SELECT vector_version()").fetchone()[0]
print(f"SQLite Vector extension version: {version}")
SQLite Vector extension version: 0.9.37

Looks good! The query shows the same version I downloaded.

If you haven’t taken a look at the dataset yet, here’s a sample:

import pandas as pd
# Load the first few records to see the structure
data_path = Path('data/News_Category_Dataset_v3.json')
articles = []
with open(data_path, 'r') as f:
for i, line in enumerate(f):
articles.append(json.loads(line))
df = pd.DataFrame(articles)
df.head()
linkheadlinecategoryshort_descriptionauthorsdate
0https://www.huffpost.com/entry/covid-boosters-Over 4 Million Americans Roll Up Sleeves For O…U.S. NEWSHealth experts said it is too early to predict…Carla K. Johnson, AP2022-09-23
1https://www.huffpost.com/entry/american-airlinAmerican Airlines Flyer Charged, Banned For Li…U.S. NEWSHe was subdued by passengers and crew when he …Mary Papenfuss2022-09-23
2https://www.huffpost.com/entry/funniest-tweets23 Of The Funniest Tweets About Cats And Dogs …COMEDY”Until you have a dog you don’t understand wha…Elyse Wanshel2022-09-23
3https://www.huffpost.com/entry/funniest-parentThe Funniest Tweets From Parents This Week (Se…PARENTING”Accidentally put grown-up toothpaste on my to…Caroline Bologna2022-09-23
4https://www.huffpost.com/entry/amy-cooper-loseWoman Who Called Cops On Black Bird-Watcher Lo…U.S. NEWSAmy Cooper accused investment firm Franklin Te…Nina Golgowski2022-09-22

From here, we’ll use the same database connection to create a table to store the news dataset as well as embeddings for both the short_description and headline fields of the dataset.

I’m only using the first 1000 entries because it takes a few minutes to compute all the embeddings.

conn.execute("""
CREATE TABLE news_articles (
id INTEGER PRIMARY KEY AUTOINCREMENT,
link TEXT,
headline TEXT,
category TEXT,
short_description TEXT,
authors TEXT,
date TEXT,
headline_embedding BLOB,
description_embedding BLOB
)
""")
for _, article in df.head(1000).iterrows():
conn.execute("""
INSERT INTO news_articles (link, headline, category, short_description, authors, date)
VALUES (?, ?, ?, ?, ?, ?)
""", (
article['link'],
article['headline'],
article['category'],
article['short_description'],
article['authors'],
article['date']
))
conn.commit()
count = conn.execute("SELECT COUNT(*) FROM news_articles").fetchone()[0]
print(f"Loaded {count:,} articles")
print("\nSample articles:")
for row in conn.execute("SELECT id, headline, category FROM news_articles LIMIT 5"):
print(f" [{row[2]}] {row[1][:70]}...")
Loaded 1,000 articles
Sample articles:
[U.S. NEWS] Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Bo...
[U.S. NEWS] American Airlines Flyer Charged, Banned For Life After Punching Flight...
[COMEDY] 23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23)...
[PARENTING] The Funniest Tweets From Parents This Week (Sept. 17-23)...
[U.S. NEWS] Woman Who Called Cops On Black Bird-Watcher Loses Lawsuit Against Ex-E...

With the source data loaded, we’ll now generate the embeddings for the headline and short_description fields, using the nomic-embed-text-v1.5 model and store those in the database as well.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('nomic-ai/nomic-embed-text-v1.5', trust_remote_code=True)
print(f"Model loaded (embedding dimension: {model.get_sentence_embedding_dimension()})")
Model loaded (embedding dimension: 768)
articles = conn.execute("""
SELECT id, headline, short_description
FROM news_articles
""").fetchall()
batch_size = 32
for i in range(0, len(articles), batch_size):
batch = articles[i:i + batch_size]
article_ids = [article[0] for article in batch]
headlines = [article[1] for article in batch]
descriptions = [article[2] if article[2] else "" for article in batch]
headline_embeddings = model.encode(headlines, normalize_embeddings=True)
description_embeddings = model.encode(descriptions, normalize_embeddings=True)
# Store in database as BLOB
for article_id, h_emb, d_emb in zip(article_ids, headline_embeddings, description_embeddings):
h_blob = h_emb.tobytes()
d_blob = d_emb.tobytes()
conn.execute(
"UPDATE news_articles SET headline_embedding = ?, description_embedding = ? WHERE id = ?",
(h_blob, d_blob, article_id)
)
# Commit and show progress every 4000 articles (helpful if you actually load all the data)
if (i + batch_size) % 4000 == 0 or (i + batch_size) >= len(articles):
conn.commit()
print(f"Processed {min(i + batch_size, len(articles)):,}/{len(articles):,} articles")
Processed 1,000/1,000 articles

With the embeddings generated and inserted into the database, we’ll now initialize the vector search for both the headline and short_description fields. According to the documentation

Only tables explicitly initialized using vector_init are eligible for vector search.

So let’s do that:

conn.execute("""
SELECT vector_init(
'news_articles',
'headline_embedding',
'type=FLOAT32,dimension=768,distance=COSINE'
)
""")
conn.execute("""
SELECT vector_init(
'news_articles',
'description_embedding',
'type=FLOAT32,dimension=768,distance=COSINE'
)
""")

We could also quantize the vector for faster and more efficient search, but let’s save that for another time.

With our embeddings generated and initialized, we can now define the search functions over the two fields. Before we do though, we’ll briefly discuss how the search functions work. We start by taking query_text and encoding it using the same model we used to generate the embeddings.

encoded = model.encode(["hello world!"])[0]
encoded[:20] # truncated for brevity
array([ 0.12390442, -0.06160246, -3.9840953 , -0.24632797, -0.3041943 ,
1.5918531 , -0.20516105, -0.6006584 , -0.22573042, -1.319676 ,
0.31299376, 1.2585417 , 0.49875298, 1.381727 , 0.5602437 ,
-1.0751574 , 0.27402204, -1.1879947 , -0.98643625, 0.46151334],
dtype=float32)

Specifically, the size of the vector is equal to the number of dimensions of the embedding model.

encoded.shape
(768,)

Given this approach, let’s show how we can calculate the distance of a query from one of the headlines in the dataset. As we specified in vector_init, we will use cosine distance (distance=COSINE).

from sklearn.metrics.pairwise import cosine_similarity
query = "sports"
headline = "New York City To Let Unvaccinated Athletes Play Home Games"
query_embedding = model.encode([query], normalize_embeddings=True)
headline_embedding = model.encode([headline], normalize_embeddings=True)
similarity = cosine_similarity(query_embedding, headline_embedding)
print(1-similarity)
[[0.34292662]]

Back to analyzing our dataset — since I normalized the embeddings when I encoded them above, we’ll also need to do that here. Finally, we’ll get the output as an array that we can convert to a blob (our column data type of the embeddings) and pass it to the vector_full_scan function.

f"{model.encode(["hello world!"], normalize_embeddings=True).tobytes()[:20]}..."
"b'\\r-\\xb4;\\xa8(3\\xbb\\xdf\\x0b5\\xbej\\x193\\xbc*,]\\xbc'..."
def search_by_headline(conn, query_text, top_k=5):
query_embedding = model.encode([query_text], normalize_embeddings=True)
query_blob = query_embedding.tobytes()
results = conn.execute("""
SELECT a.id, a.headline, a.category, a.short_description, v.distance
FROM news_articles AS a
JOIN vector_full_scan('news_articles', 'headline_embedding', ?, ?) AS v
ON a.id = v.rowid
ORDER BY v.distance ASC
""", (query_blob, top_k)).fetchall()
return results
def search_by_description(conn, query_text, top_k=5):
query_embedding = model.encode([query_text], normalize_embeddings=True)
query_blob = query_embedding.tobytes()
results = conn.execute("""
SELECT a.id, a.headline, a.category, a.short_description, v.distance
FROM news_articles AS a
JOIN vector_full_scan('news_articles', 'description_embedding', ?, ?) AS v
ON a.id = v.rowid
ORDER BY v.distance ASC
""", (query_blob, top_k)).fetchall()
return results

Let’s run some semantic queries and see what we find! This should look familiar — we’re using the same query we embedded and calculated cosine distance for above.

query = "sports"
headline_results = search_by_headline(conn, query, top_k=3)
for i, (article_id, headline, category, description, distance) in enumerate(headline_results, 1):
print(f"{i}. [{category}] {headline} (distance: {distance:.4f})")
1. [U.S. NEWS] New York City To Let Unvaccinated Athletes Play Home Games (distance: 0.3429)
2. [SPORTS] Video Shows NFL Player Hopping Into Stands To Stop Fighting Fans (distance: 0.3828)
3. [POLITICS] NBA Won't Host Election Day Games In Effort To Increase Voter Turnout (distance: 0.4232)

We see an identical result as we calculated above for the first (and closest) headline match (distance: 0.3429).

Now we can run a similar query on the description field:

description_results = search_by_description(conn, query, top_k=3)
for i, (article_id, headline, category, description, distance) in enumerate(description_results, 1):
print(f"{i}. [{category}] {headline} (distance: {distance:.4f})")
print(f" Description: {description}")
1. [SPORTS] Las Vegas Aces Win First WNBA Title, Chelsea Gray Named MVP (distance: 0.3580)
Description: Las Vegas never had a professional sports champion — until Sunday.
2. [SPORTS] Lucius Fox Makes Sickening Exit Just 2 Pitches Into Game (distance: 0.3966)
Description: The Washington Nationals infielder became a hurler for one gross moment — which was caught on video — against the San Francisco Giants.
3. [SPORTS] Diego Maradona's 'Hand Of God' Jersey Sells For Record Sum (distance: 0.4160)
Description: The Argentine soccer superstar wore the shirt when he scored the controversial goal against England in the 1986 World Cup.
headline_results = search_by_headline(conn, "artificial intelligence", top_k=3)
description_results = search_by_description(conn, "artificial intelligence", top_k=3)
for i, (article_id, headline, category, description, distance) in enumerate(headline_results, 1):
print(f"{i}. [{category}] {headline} (distance: {distance:.4f})")
print()
for i, (article_id, headline, category, description, distance) in enumerate(description_results, 1):
print(f"{i}. [{category}] {description} (distance: {distance:.4f})")
1. [TECH] Google Engineer On Leave After He Claims AI Program Has Gone Sentient (distance: 0.3774)
2. [CULTURE & ARTS] How Ani Liu Is Brilliantly Disguising Her Art As Science (distance: 0.5151)
3. [ENTERTAINMENT] Andrew Garfield Confirms Method Acting Is Possible Without 'Being An Asshole' (distance: 0.5217)
1. [TECH] Artificially intelligent chatbot generator LaMDA wants “to be acknowledged as an employee of Google rather than as property," says engineer Blake Lemoine. (distance: 0.4230)
2. [POLITICS] The House Intelligence Committee hearing follows a 2021 report of a possible national security "challenge" from UFOs. (distance: 0.4950)
3. [U.S. NEWS] The American Academy of Pediatrics says it is putting all its guidance under the microscope to eliminate “race-based” medicine and resulting health disparities. (distance: 0.4967)

Embeddings and cosine distance allow us to do search over the headline and short_description fields of the dataset, finding semantically similar results even if the results themselves don’t contain or match the query text or parts of the query text exactly.