Week 11: Vector Databases

DSAN 6000: Big Data and Cloud Computing
Fall 2025

Class Sessions
Author
Affiliation

Amit Arora and Jeff Jacobs

Published

Monday, November 10, 2025

Open slides in new tab →

Vector Spaces: The Geometry of AI!

  • Why vector spaces over keyword matching?
  • What if words have different meanings in different contexts?

Word Counts: Good Enough?

Just four keywords per section:

Section Keywords
U.S. state, court, federal, republican
World government, country, officials, minister
Arts music, show, art, dance
Sports game, league, team, coach
Real Estate home, bedrooms, bathrooms, building

For each article, vote for section with highest keyword count:

Arts Real Estate Sports U.S. News World News Total
Correct 3020 690 4860 1330 1730 11630
Incorrect 750 60 370 1100 590 2870
Accuracy 0.801 0.920 0.929 0.547 0.746 0.802

What’s Missing? Context!

You shall know a word by the company it keeps. (Firth 1957)

Article A (World News):

President Bush said he was trying to convince President Vladimir Putin of Russia that cooperation was “in Russia’s security interests,” even though Defense Secretary Robert M. Gates did not win Mr. Putin’s support during a trip to Moscow last week.

“Bush Steps Up Effort to Persuade Putin on Missile Defense Plan”, NYTimes, 1 May 2007.

Article B (U.S. News):

President Bush began his day yesterday at dawn on the golf course. He began Saturday on the golf course, too. A weekend earlier, the president played two rounds of 18 holes on the course at Andrews Air Force Base just outside Washington.

“White House Letter; Bush Makes Quick Work of Relaxing”, NYTimes, 5 August 2002.

Basic Context Engineering

  • What exactly do we mean when we say “context”?
  • Different training objectives give different types of contexts!
  • Example: Closest vectors to Turing, trained on Wikipedia in two ways:


Dependency Tree 5-Word Window
Same category \(\rightarrow\) Pauling nondeterministic \(\leftarrow\) Topically related
(Paradigmatic) Hotelling computability (Syntagmatic)
Lessing deterministic
Hamming finite-state
Example from Uber AI Labs’ amazing Piero Molina

Pretentious But Helpful Terms!

  • Paradigmatic \(\approx\) Grammatical “substitutes”, drawn from same category/type
  • Syntagmatic \(\approx\) semantic relationships between words in a sentence


Paradigmatic
Paradigmatic
apples Syntagmatic sweet
oranges Syntagmatic sour
These \(\underline{\hspace{32mm}}\) taste very \(\underline{\hspace{32mm}}\) !

(Both terms from Saussure (1916))

How Do We Actually Get These Vectors?

  • High-level goal: Retain information about word-context relationships while reducing the \(M\)-dimensional representations of each word down to 3 dimensions.
  • Low-level goal: Generate rank-\(K\) matrix \(\mathbf{W}\) which best approximates distances between words in \(M\)-dimensional space (rows in \(\mathbf{X}\))

Now We Put These Embeddings (Vectors) into Databases

  • The goal: identify similar object as vectors “close to” the location (in vector space) of a given query

Vector Search Workflow

Unstructured Data \(\leadsto\) Structured Geometric Query Space

Fig 1: Image Source

More Data \(\leadsto\) “Increasing Returns”

Learn about vehicles from one dataset…

Learn about foods from another dataset…

Learn about color from both 🤯

Queries Embedded via the Same Model!

Lab Time!

References

Firth, John Rupert. 1957. Papers in Linguistics, 1934-1951. Oxford University Press.
Saussure, Ferdinand de. 1916. Course in General Linguistics. Open Court.