In the world of artificial intelligence, vectors are as crucial as gold was to King Midas. However, just as Midas faced challenges with his golden touch, data scientists must be cautious when applying cosine similarity to vectors. While embeddings capture similarities, they can sometimes reflect irrelevant connections, such as matching questions to questions rather than to answers, or focusing on writing style rather than meaning. This article explores the intentional use of similarity metrics for improved outcomes.

Understanding Embeddings

Embeddings transform entities into vectors, which help in charting relationships for machine learning models. Popular embeddings include word2vec and node2vec. However, the focus here is on sentence embeddings from Large Language Models (LLMs), which capture text essence without fine-tuning. Although these embeddings are powerful, their use comes with responsibility concerning data privacy.

Example of Sentence Comparisons

Consider three sentences:
A: “Python can make you rich.”
B: “Python can make you itch.”
C: “Mastering Python can fill your pockets.”

Using string similarity, A and B differ by two characters, while A and C are 21 characters apart. However, semantically, A is closer to C than B. With OpenAI text-embedding-3-large, the cosine similarity results align with this semantic closeness, demonstrating that meaning matters more than spelling.

The Nature of Cosine Similarity

Cosine similarity is a straightforward measure that many data scientists utilize. It calculates the cosine of the angle between vectors, a concept that becomes less intuitive in high-dimensional spaces. Despite its simplicity, it can be misleading, as values typically range between 0 and 1 but do not convey probabilities or meaningful metrics.

Cosine Similarity and Correlation

Pearson correlation and cosine similarity become identical when vectors are both centered and normalized. However, practical applications often avoid this step, relying on dot products instead. When using cosine similarity, it’s crucial to understand the trained model’s compatibility with this measure.

Limitations and Alternatives

While cosine similarity is valid for training machine learning models, issues arise if the cost function during training doesn’t utilize cosine similarity. Models not trained with this metric might produce unreliable results. Moreover, the definition of similarity varies by context, complicating its usage across different domains.

Examples of Misuse

A practical example involves querying “What did I do with my keys?” Using cosine similarity with various embeddings might return another question, rather than an answer. This challenge heightens when dealing with large datasets, exemplifying the need for more robust solutions.

Effective Approaches

The most promising method involves directly using LLM queries for entry comparisons. Although effective, it may be impractical for extensive datasets due to costs and delays. Alternatively, task-specific embeddings can be developed through fine-tuning or transfer learning, optimizing the process for relevant data.

Enhancing Accuracy with Preprocessing

Embedding accuracy improves by preprocessing text, focusing on content rather than style, and extracting essential information from conversations. This technique proves beneficial for numerous projects, offering a structured format that aligns with users’ needs.

Conclusion

Cosine similarity is a valuable tool but requires careful application. Understanding the trained model’s compatibility, task-specific embeddings, and proper text preprocessing are critical to harnessing vector similarity effectively. These strategies ensure meaningful comparisons and robust solutions in real-world applications.

Appreciation for insights and feedback goes to the Warsaw AI Breakfast community, Python Summit 2024 Warsaw, and contributors on LinkedIn. Further exploration on this topic is available in related blog posts.