Vector databases: Shiny object syndrome and the case of a missing unicorn

Join leaders in Boston on March 27 for an exclusive night of networking, insights, and conversation. Request an invite here.


Welcome to 2024, where if you’re not riding the generative AI wave, you might as well be stuck in 2022 – practically ancient history in the AI timeline. Every organization has an AI roadmap now, from AI pillows to AI toothbrushes, and if you still have not hurriedly put a plan together, let me suggest a three-step roadmap for you.

Step 1: Assemble a team that’s completed the Andrew Ng course, because nothing says cutting-edge like a certificate of completion.

Step 2: Get the API keys from OpenAI. No, you cannot call ChatGPT, it is not a thing.

Step 3: Vector database, embeddings, tech sorcery!

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.

Request an invite

Now, let the show begin: Dump all the data into the vector DB, add a bit of RAG architecture, sprinkle in a bit of prompt engineering, and voila! The gen AI wave has officially arrived in your company. Now, sit back, relax and enjoy the suspenseful waiting game for the magic to happen. Waiting, waiting… still waiting. Ah, the sweet anticipation of gen AI greatness!

In the chaotic sprint to embrace gen AI and its seemingly straightforward large language model (LLM) architectures, the hiccup comes when organizations forget about use cases and start chasing technology. When AI is your hammer, every problem appears solvable. 

Figure 1: Word embeddings, the seasoned veterans with a longer history, quietly stand amidst the limelight on LLMs and the distant cousins, vector DBs

And while LLMs and Vector Databases seem to be on-trend (Taylor Swift is trendier), the notion of vector-based representations, crucial in modern natural language processing, has deep roots.

Word Associations: Looking back at “Who wants a million dollars?”

George Miller‘s book Language and Communication, published in 1951 and deriving from his earlier works, expands the concept of distributional semantics. Miller suggested that words appearing in similar contexts likely have similar meanings, laying the foundation for vector-based representations.

He further demonstrated that associations between words have strengths, stating, “On a more molecular level, ‘I’ seems to vary widely in strength from instant to instant. It is a very improbable response to ‘Who was the first king of England?’ and a very probable response to ‘Who wants a million dollars?’” While a dog may elicit an associative response to “animal,” the association from “animal” to “dog” is weak, as Miller concluded: “The association, like a vector, has both magnitude and direction.”

Word associations go back even further, as can be seen in a study conducted by Kent and Rosanoflf in which participants were asked about “the first word that occurs to you other than the stimulus word.” 

Figure 2 (Left) Associated word response and its frequency when the stimulus is “chair” by 1,000 men and women (Kent and Rosanoff, 1910). (Right) The top 10 occupations most closely associated with each ethnic group in the Google News embedding.

Thomas K. Landauer’s work, “A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge” published in 1997, delves into the details of vector-based representation of concepts. Latent semantic analysis (LSA), introduced by Landauer, employs mathematical techniques like singular value decomposition to create vector spaces where words with similar meanings are positioned close together. This facilitates efficient computation of semantic relatedness, contributing to tasks such as information retrieval and document categorization.

In 2003, Yoshua Bengio, Réjean Ducharme and Pascal Vincent published “A Neural Probabilistic Language Model,” introducing a neural network model capable of learning word embeddings. This paper marked a notable shift towards neural network-based approaches for word representation and laid the foundation for word2vec, GloVe, ELMO, BERT and the current suite of embedding models.

Vector-based representations of text aren’t something new and have seen constant evolution, but when does the vector DB show start?

When does the Vector DB show start?

The Vector DB space is getting crowded, and each vendor strives to stand out amidst a sea of features. Performance, scalability, ease of use, and pre-built integrations are just a few of the factors shaping their differentiation. However, the crux lies in relevance — getting the right result in a few seconds, or even minutes, is always better than getting the wrong answer at lightning speed.

Delving into the intricacies of strict vector search (never a good idea, see below) the linchpin is the approximate nearest neighbor (ANN). Vector DBs provide a variety of ANNs, each with its own flavor:

As the terms and the details become fuzzy, the seemingly straightforward LLM architecture doesn’t seem simple anymore. Nonetheless, if the choice was to generate embeddings of your data using OpenAI APIs and retrieve them using the same ANNs such as HSNW, wouldn’t the relevance (or irrelevance) be the same?

“Can you fix my computer?” No, but I can tell you that bananas are berries and strawberries aren’t.

Let’s dig into how someone might use the system and if turning the data into vectors really adds up. Take this scenario: A user types in a straightforward query such as “Error 221” with the intent to find the manuals that may help in resolution. We do the usual — convert the query into its embedding, fetch it using a variation of ANN and score it using cosine similarity. Standard stuff, right? The twist: The results end up giving a document about Error 222 a higher score than the one about Error 221. 

Figure 3 Embeddings were created using the sentence transformer model “all-MiniLM-L6-v2”

Yeah, it’s like saying, “Find Error 221,” and the system goes, “Here’s something about Error 222; hope that helps!” Not exactly what the user signed up for. So, let’s not just dive headfirst into the world of vectors without figuring out if it’s the right move.

Beyond the hype, what’s the deal?

What’s up with vector databases, anyway? They’re all about information retrieval, but let’s be real, that’s nothing new, even though it may feel like it with all the hype around it. We’ve got SQL databases, NoSQL databases, full-text search apps and vector libraries already tackling that job. Sure, vector databases offer semantic retrieval, which is great, but SQL databases like Singlestore and Postgres (with the pgvector extension) can handle semantic retrieval too, all while providing standard DB features like ACID. Full-text search applications like Apache Solr, Elasticsearch and OpenSearch also rock the vector search scene, along with search products like Coveo, and bring some serious text-processing capabilities for hybrid searching. 

But here’s the thing about vector databases: They’re kind of stuck in the middle. They can’t fully replace traditional databases, and they’re still playing catch-up in terms of supporting the text processing features needed for comprehensive search functionality. Milvus considers hybrid search to be merely attribute filtering using boolean expressions! 

“When technology isn’t your differentiator, opt for hype.”

Pinecone’s hybrid search comes with a warning as well as limitations, and while some may argue it was ahead of its time, being early to the party doesn’t matter much if the festivities had to wait for the OpenAI revolution a couple of years later.

It wasn’t that early either — Weaviate, Vespa and Mivlus were already around with their vector DB offerings, and Elasticsearch, OpenSearch and Solr were ready around the same time. When technology isn’t your differentiator, opt for hype. Pinecone’s $100 million Series B funding was led by Andreessen Horowitz, which in many ways is living by the playbook it created for the boom times in tech. And with all the hype around the AI revolution and gen AI, the gen AI enterprise party still hasn’t started. Time will reveal whether Pinecone turns out to be the case of a missing unicorn, but distinguishing itself from other vector databases will pose an increasing challenge.

Shiny object syndrome

Enterprise search is hard. Rarely does the solution involve simply dumping data into a vector store and expecting miracles to happen. From chunking the PDFs to the right size to setting up the right access controls, everything requires meticulous planning and execution to ensure optimal performance and usability. If your organization’s use case revolves around searching a limited number of documents, scalability might not be a pressing concern. Similarly, if your use case leans heavily towards keyword search, as illustrated in Figure 3, diving into vector implementation may backfire. 

Ultimately, the end user isn’t concerned about the intricacies of whether it’s a vector search, keyword search, rule-driven search or even a “phone a friend” search. What matters most to the user is getting the right answer. Rarely does this come from relying solely on one methodology. Understand your use case and validate your test scenarios … and…  don’t be lured by shiny objects just because they’re popular.

Amit Verma is the head of AI labs and engineering and founding member at Neuron7.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers

Source

Leave a Reply

Your email address will not be published. Required fields are marked *