Blog article
See all stories »

Similarity Searches: The Neurons of the Vector Database

I recently wrote a Finextra piece entitled 3 GenAI Use Cases for Capital Markets; The Power of the Vector. In it, I discussed the increasing importance of the so-called vector database and vectors more generally to a whole range of quantitative finance applications.

The term vector database, as I discussed in that piece, carries multiple, overloaded meanings, like the words bat, flat, duck or prompt. With GenAI and so-called Large Language Models (LLMs), the term has come to hold specific meaning of a "memory store" centered around "vector embeddings," model-encoded outputs that follow prescribed mathematical vector formats and dimensions (magnitude, distance, etc) to allow easy indexing and search when run live. However, for me, as someone brought up in the "vectors as a stream of data to manipulate in one operation" which is how I and any R, MATLAB, NumPy, q, or Julia programmer would describe a vector native application (or data type), a vector database can mean something different. However, something vector native, like any of these MATLAB-like applications referenced, can hold the same vector embeddings too with sufficient memory.

But I'm not here to semantically deconstruct the term vector database, not this time at least. I do want to explore what happens within them though when you prompt. Perhaps you ask semantically, "describe my next cat in the style of Demis Roussos?" or "draw a picture of my future spouse standing by a bus stop," Your words are searched across those stored memories - vectors - and indexes of those vectors. It’s like finding a book in a big library, and the catalogs within. Such vectors are beautifully geared, through relatively simple to respond quickly with context (albeit a lot of compute when carried out at scale) to "similarity searches." Then the output gets compiled, the cat description built, lovely picture of your future spouse by the bus stop drawn, music created, code submitted, or whatever.

Thus vector embeddings aim to capture relevant semantic, contextual, or structural information, with embedding models employing techniques and algorithms appropriate to the type of data dealt with and key characteristics of data ultimately represented. Text embeddings, for example, capture semantic meaning of words and their relationships within a language. For example, they can encode semantic similarities between words, such as "king" being closer to "queen" than to "chicken" with "elvis" somewhere in between.

Embeddings technologies are not new, and mathematical vectors themselves are certainly pretty ancient. I will later refer to Euclid, a Greek gentleman from olden times. Thus the technologies on which Generative AI stand can be said to be on shoulders of giants, some ancient ones. A decade ago, and prior to the Christmas 2022 ChatGPT LinkedIn Pokemon-like craze, I ran a Natural Language Processing (NLP) sentiment demo that determined sentiment from Twitter feeds to paramterize trade decisions. I also used NLP to scrutinize the Madoff reports, looking for unusual patterns that signified his fraudulence - over-use of adjectives for example. Under the hood, we made use of the Word2Vec model, which stands for, you guessed it, Word to Vector. This creates dense vector representations that capture semantic relationships by training a neural network to predict words in context. Tools like Ravenpack News Analytics and MarketPsych were frequently used and predicated on such methods - others are and were available, but I recall these best - tested though perhaps not always production deployed (they didn't always work) on many trading desks. Good times were had by many amidst the NLP hype of a decade ago! But that's the same, or similar, vector thing that goes into your new GenAI-type vector database today or vector native processing environment, as I did with MATLAB a decade back. 

Today, large language models (LLMs) offer "pre-trained" meaning, which you just run via a prompt, no need to build locally. There again, they are big, broad, generalized, models, way way way bigger managing more dimensions than the teeny tiny models I ran a decade ago. You can use them directly as you probably do with ChatGPT, or, if appropriately tokenized, take the model output vector embeddings into a vector database. This gives you control, to apply and augment with your own data, manage prompts, facilitate additional embeddings for new data, and, when managed well, apply "guardrails" against those hallucinations everyone warned you about on LinkedIn.

The embeddings and stores of meaning do matter, but for the remainder of this blog I want to focus on the searches that expedite meaning, create information, and, ideally, answer hard questions that add value to your organization. I equate such search and “similarity search”-type processes being like neurons kicking in, infusing the vector database with proper on-the-fly intelligence. The interesting thing here is that the traditional search and similarity search techniques - or neurons as I think of them - are not new to finance, or to anyone who has used a search engine, deployed a tool like ElasticSearch, Solr or the Lucene project that underpins them, or any sort of recommendation engine - think Netflix, Spotify, Amazon.

So let's dive in. Some maths will follow, but hopefully it gets explained simply enough.

As noted, by understanding the similarity between vectors, we understand similarity across the data objects themselves. Similarity measures help to understand relationships, identify patterns, and make informed decisions, for example:

  • Anomaly Detection: Identify deviations from normal patterns
  • Clustering and Classification: Cluster similar data points or classify objects into distinct categories, grouping together similar points
  • Information Retrieval: Using search engines to measure the similarity between user queries and indexed documents to retrieve the most relevant results
  • Recommendation Systems: Find similar items or products to recommend based on user preferences

The similarity measure you choose depends on the nature of the data and the specific application at hand. Your data scientists can best advise. I try to describe three commonly used measures, their strengths and weaknesses, and outline how I see them deployed in financial services. In my world, that's normally, given my experience, in quantitative finance, capital markets, risk management, and fraud detection. I'm not in any way suggesting you pick up a vector database tomorrow and change all your workflows, but I am trying to illuminate and de-mystify some of quite complicated mathematical names to show how, in plain terms, they're sensible, actually quite simple and pretty commonplace already.

1) Euclidean distance assesses the similarity of two vectors by measuring the straight-line distance between the two vector points. Vectors that are more similar will have a shorter absolute distance between them, while dissimilar vectors have a larger distance between one another. It understands distance as a combination of relative magnitude and direction, but when working with vector spaces higher than 2 or 3 dimensions (i.e. more than you can visualize on a regular 3 dimensional plot), there are certain ways, such as the "L2-norm" to help normalize.

Euclidean distance tends to apply to applications like:

  • Clustering Analysis: Clustering, like k-means, groups data points based on their proximity in vector space. Clustering analysis applications are well noted in index calculations and credit scoring, and (with some variability) for ESG analyses. 
  • Anomaly and Fraud Detection: Here, unusual data points get detected through unusually large distances from the centroid of normal transactions. Applications in finance are ubiquitous: they range from anti-money laundering and insider dealing to credit card transaction fraud and fraudulent loan applications.

2) The dot product is a simple measure used to see how aligned two vectors are with one another, a bit like a score. It tells us if the vectors point in the same direction, in opposite directions, or are perpendicular to each other. It is calculated by multiplying the corresponding elements of the vectors and adding up the results to get a single scalar number. It lends itself well to applications such as:

  • Image Retrieval and Matching: Images with similar visual content will have closely aligned vectors, resulting in higher dot product values. This makes dot product a good choice when you want to find images similar to a given query image. Digital activities such as signature verification could be useful.
  • Neural Networks and Deep Learning: In neural networks, fully connected layers use the dot product to combine input features with learnable weights. This captures relationships between features and is helpful for tasks like classification and regression. Their use for financial modeling of multiple types is well documented. My oddball one is identifying cars in supermarket and hotel car parks from satellite images, which we counted, distributing as a data set through alternative data providers and onto hedge funds. Happy though stressful times!
  • Portfolio Recommendation: Dot product similarity helps identify assets with similar characteristics, making it valuable in portfolio recommendation systems. Roboadvisors anyone? 

3) Cosine similarity measures the similarity of two vectors by using the angle between these two vectors. The magnitude of the vectors themselves does not matter and only the angle is considered in this calculation, so if one vector contains small values and the other contains large values, this will not affect the resulting similarity value.

Cosine similarity therefore, with its "similar vectors will likely point in the same direction" contrasts nicely with the Euclidean "as-the-crow-flies" distance. It thus apples well to use cases such as:

  • Topic Modeling: In document embeddings, each dimension can represent a word's frequency. Two documents of different lengths can have drastically different word frequencies yet the same word distribution. Since this places them in similar directions in vector space but not having similar distances, cosine similarity is a great choice. Think of noting sentiment in tweets, like my trading example earlier, and possibly concentration analysis in portfolio management and compliance monitoring from, say, document functional specfications which insists the portfolio stays within certain rules, for example excluding or including particular sectors, types or geographies of assets for example. Word2Vec was a great library for topic modeling and still is. 
  • Document Similarity: Another application of Topic Modeling and also Word2Vec from the good old days!. Similar document embeddings have similar directions but can have different distances. Think of the exaggerated use of adjectives in exaggerated (perhaps fraudulent) financial reporting, like my Madoff example earlier. As it happened, he did not use more adjectives than normal -  we recognized the fraud in valuation related anomalies rather than textual ones - but we tested for it because it is a common characteristic of frauds. Two great related phrases to throw into your next dinner conversation - Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) with the latter particularly prominent for document similarity. Read Baeldung for the details.
  • Collaborative Filtering: An approach in recommendation systems which uses the collective preferences and behaviors of users (or items) to make personalized recommendations based on their interactions. Since overall ratings and popularity can create different distances, the direction of similar vectors remains close, and cosine similarity is often used. Think market infrastructure models and agent-based modeling perhaps.

Now, there's much more to which I will return in a later blog - the role of indexes and index search, and the application of the other types of vectors I alluded to, the sequences of data, like time-series information, that can be operated on for speed, simplicity and efficiency. Some of this, I talk about in 3 GenAI Use Cases for Capital Markets; The Power of the Vector. But I shall return. 

A final comment. It's okay to be confused by this stuff. I spoke with two exceptionally qualified quants this week. Both admitted to being completely overwhelmed by the changes taking place right now in our industry with GenAI. I totally feel the same way. On the flip side, the hype cycle obfuscates, and sometimes what lies beneath is shallower than it might appear. I hope my article helps simplify. Let me know.

 

 

With thanks to my colleagues Nathan Crone and Neil Kanungo. Their great article, How Vector Similarity Drives Contextual Search inspired this one. If there are faults in my interpretation, those faults are mine alone, and any opinions expressed are mine alone and not those of my employer. Thanks also to PJ O’Kane for his thoughtful review.

 

3038

Comments: (0)

Now hiring