Analyzing AI and Machine Learning in E-commerce Search Relevance
Leveraging Local LLMs for Search Relevance
In the quest to enhance search relevance in e-commerce, a local Language Model (LLM) is employed for identifying product relevance to specific search queries. By comparing these LLM-generated preferences to human labels from the Wayfair open dataset (WANDS), the goal is to utilize a laptop as a cost-effective search relevance evaluator, bypassing hefty costs associated with OpenAI.
The primary objective is not to replace human labels but to quickly flag anomalies or promising results. This post explores how multiple basic LLM decisions regarding product attributes can culminate in a smarter decision, comparing this approach with using comprehensive product metadata.
Example Prompt for Evaluation
Consider the following example where a laptop search relevance judge evaluates product names for the query “entrance table”:
- Product LHS name: Aleah coffee table
- Product RHS name: Marta coffee table
- Response: Neither
Combining Decisions for Accuracy
Gathering 1000 agent preferences, their effectiveness in predicting relevance is compared with human ratings. The analysis includes various prompt variants:
- Forcing a decision or allowing “Neither / I don’t know”
- Single pass vs. double-checking by swapping LHS and RHS
Prompts crafted for four product attributes include:
- Product name
- Product taxonomic categorization
- Product classification
- Product description
Sample Product Description Prompt
For the same query “entrance table”:
- Product LHS description: This coffee table is perfect for your entrance…
- Product RHS description: You’ll love this table from Lazy Boy…
- Response: LHS
Insights from Permutations
Experimentation with permutations such as fields and double-checking reveals varying levels of precision and recall. The comprehensive table below details these insights:
Creating an Overall Decision System
Integrating individual attribute decisions into a comprehensive decision system is proposed, leveraging an ensemble approach where aggregated votes guide the final decision. The repeated analysis over thousands of query-product pairs reveals systemic patterns, suggesting a machine learning classification problem using decision trees or similar classifiers.
Building a Training Pipeline
A script is used to gather extensive LLM evaluations, serving as features for training models. A basic Scikit-learn decision tree classifier attempts to predict human preferences from these agent preferences, striking a balance between precision and recall via probabilistic thresholds.
Script examples illustrate various feature combinations and their predictive accuracy, opening discussions on the potential need for more sophisticated models like gradient boosting.
Conclusion
This approach demonstrates that rudimentary decision-making by LLMs, when aggregated, can yield intelligent outcomes with traditional machine learning methods. Such local LLMs could serve as feature generators, offering insights into the rationales behind human labels and aiding in the strategic development of search solutions.
Doug Turnbull
Explore more from Doug’s work at OpenSource Connections and the Shopify Engineering Blog.