[RFC] OpenSearch Search Quality Evaluation Framework #15354

jzonthemtn · 2024-08-22T16:13:53Z

[RFC] OpenSearch Search Quality Evaluation Framework

Introduction

User Behavior Insights (UBI) provides OpenSearch users with the ability to capture user behavior data to be used for improving search relevance. Implemented as an OpenSearch plugin, UBI can connect queries with user behaviors per its defined schema. This data allows for insight into judgements that were observed from user behaviors.

This RFC proposes development of an evaluation framework that uses the UBI-collected data to improve search result quality through the calculation of implicit judgments.

Thanks to the following collaborators on this RFC:

Problem Statement

As a search relevance engineer, understanding the quality of search results over time as changes to data, algorithm, and underlying platform occur is extremely difficult, yet also critical to building robust search experiences.This is a common, long-standing problem that is notoriously difficult to solve. This is especially true for small organizations or any organization without a dedicated search team. Collecting the data and making effective use of the data can be a time-consuming activity.

With the collected user data being the source of implicit judgements there are challenges calculating to keep in mind, e.g. position bias (users have the tendency to click on documents presented at the top) or presentation bias (users cannot click on what is not presented, therefore no data is collected).

Proposal

We propose developing a framework for evaluating search quality by calculating implicit judgments based on data collected by UBI to optimize and improve search result quality. Ultimately, we would like for the framework to perform automatic optimization by consuming UBI data, calculating implicit judgments and then providing search tuning without manual interaction. This automation will help make the functionality usable by organizations of all sizes.

We propose modeling implicit judgments on the statistic “Clicks Over Expected Clicks” (COEC) (H. Cheng and E. Cant´lˇs-Paz, 2010). We chose this model for its confirmability and for dealing with position bias - a bias omnipresent in search applications.

For teams that already have an approach to calculate implicit judgements or for teams that want to calculate implicit judgements with a different approach we provide ways to integrate these judgements into the framework. In the future we imagine the support of extensions that enable calculations inside the framework.

Collecting Implicit Judgments

The UBI plugin already captures the information needed to derive implicit judgments, storing the information in two OpenSearch indexes: ubi_queries for search requests and search responses, and ubi_events for events.

Data Transformation and Calculations

The data required includes the query, the position of the search result, whether or not the search result was clicked, and a user-selectable field whose value consistently and uniquely identifies the search result. The data to be used, whether collected by UBI or not, will need to be transformed into this format.

These operations will take place outside of OpenSearch to avoid a tight-coupling with OpenSearch. Not requiring the installation of a plugin will permit users to utilize their existing judgment pipelines if they already have them.

The illustration below shows an overview how implicit judgements are calculated. The behavioral data necessary for implicit judgements comes from users interacting (searching, clicking on results) with the search platform. In the illustration we assume UBI is used as the tool for collecting this data.

The search quality evaluation framework initially retrieves all seen and clicked documents for a configurable amount of historical time from the user-configurable source, which by default will be the UBI indexes but will also support an Amazon S3 bucket. To calculate implicit judgements with COEC as the underlying model two statistics are calculated:

The rank-aggregated clickthrough rate is calculated first and the results stored in an index. This is basically a key-value store with the key being the position (the unique identifier of the document) and the clickthrough rate at this position across all click events.
Next, for every query-doc pair with interactions the clickthrough rate is calculated. The results are stored in a separate index.

With these two intermediate indexes the final judgements are calculated and stored in a third index.

For all three indexes, a new one is created for each of the above mentioned steps. An alias is created for the latest successful calculation. After successfully calculating implicit judgements the previous indexes can be removed when no longer needed. This is configurable to enable OpenSearch users to store implicit judgements calculated from different source data.

Using the Implicit Judgments

The implicit judgments can be used to calculate search metrics (i.e. nDCG), enable offline evaluation, and are designed to be used to train LTR models.

The primary goal of the search quality evaluation framework is to assess the search result quality with the calculated implicit judgements.

As such the framework provides several features:

Create a query sample
Download, upload or change a query sample
Calculate metrics based on a query sample
Track calculated search result quality over time

Create a query sample

To assess the search result quality of a system a subset of real world queries typically is chosen and evaluated. The search quality evaluation framework can take queries stored in the ubi_queries index (or another index with data stored in a compatible way) and apply Probability-Proportional-to-Size sampling (PPTSS) on it to generate a frequency-weighted query sample. The size of the resulting sample is configurable with the default value set to 3,000.

Download, upload or change a query sample

For OpenSearch users who already have a query sample it is possible to upload these to the search quality evaluation framework directly. Changing an existing query sample is possible via downloading and uploading the query sample with the desired changes made. Storing multiple query samples is possible.

Calculate metrics based on a query sample

Having a query sample and a set of implicit judgements enables calculating search result quality metrics. Supported metrics are the classic Information Retrieval metrics such as NDCG, AP, and friends. We are looking at the metrics that are used in the TREC eval project, and will look for a Java library or reimplement the metrics ourselves.

Together with the actual search metric, the search quality evaluation framework calculates statistical measures that let users measure statistical significance by doing a t-test or calculating the p-value. Users specify which runs to compare and the system calculates the t-score and p-value. This is done by passing the unique ids of the test results under which the test results are stored in the corresponding OpenSearch index.

By choosing the query sample to evaluate users can run evaluation jobs on different samples, e.g. to see the effects of changes on specialized query samples.

Track calculated search result quality over time

Every metric that is calculated is stored in an index within OpenSearch. That way it is possible to measure and visualize the progression of search metrics for a query sample over time, e.g. with a dedicated dashboard.

Changes to UBI

We may find that changes in UBI are necessary. We will strive to not change UBI's current formats for data, instead opting to use UBI "out of the box" as much as possible as to not interfere with other users of UBI.

We do NOT require that the OpenSearch UBI plugin be enabled in order to use this tooling. As long as you have data conforming to the UBI schema and provide it to the tooling, then you can use these features. However, having UBI collect the signal data is the most seamless way to get implicit judgements calculated.

GitHub Repository

This work will initially be built outside of the OpenSearch GitHub project but we hope to transfer the repository to “living” inside the OpenSearch GitHub repository as soon as possible. We are tentatively calling the repository search-quality-evaluation.

Roadmap

We will attempt to follow the OpenSearch release schedule with incremental progress implemented in OpenSearch releases. This will enable real-time use and feedback.

Conclusion

The availability of the UBI plugin and the data it collects provides opportunities for improving search result quality. The method presented in this RFC is not the only way the data can be leveraged; rather, it describes one method that has shown success in several industries and types of search environments. We believe this makes it a good method for first implementation. Once executed, we hope this will lead to the implementation of additional "pluggable" methods that give OpenSearch users a choice of how to model their data and improve search result quality.

The text was updated successfully, but these errors were encountered:

msfroh · 2024-08-28T17:00:00Z

I'm wondering if we can leverage some of the join stuff that we're doing to build the indices for the implicit judgement calculations. Can construction of those indices be modeled as a SELECT INTO based on a join? They'd essentially be materialized views.

@penghuo -- you've been integrating the SQL plugin with Spark. Is there something we could do to spin up a Spark job to do these calculations and persist the results back into OpenSearch?

aswath86 · 2024-10-11T14:03:43Z

A question I have is, where will this be implemented and how tightly/loosely coupled is the implementation going to be with OpenSearch.

I understand that UBI is a project and any interested search engine can implement UBI. Currently OpenSearch is the only search engine so far that has the initial implementation for UBI. But technically, I could use UBI that available on OpenSearch even if my current search engine is Solr or something else.

Let's say my search engine is Solr and I have an OpenSearch cluster just for UBI. I can create a dummy index in OpenSearch and send an empty-result query to OpenSearch which is in addition to my actual search query to Solr that servers my user. The empty-result query will be sent to OpenSearch after the Solr has responded with the docIds. Example below code block. Another alternate option is to not use the OpenSearch UBI plugin but send this information to ubi_queries just like how we send the information to ubi_events index.

Why should I do this? For the following reasons,

If I'm using Solr or Vespa or any search engine and I'm interested in UBI project and I don't want to wait and see if UBI will ever be implemented in the search engine I use
If UBI is implemented in my search engine, but my search engine doesn't have native dashboarding feature, eg., Solr/Vespa, then I lose the insights that I can get via Visualization
UBI events can grow significantly and I would not be surprised if it is multiple times larger than my actual inventory index. UBI data on my cluster should not impact my actual search latency. So I could imagine proposing a separate cluster for UBI. In which case I can't use the UBI plugin in OpenSearch anyway. So why should it matter if my actual inventory index is in OpenSearch or Solr or XYZ when I cannot rely on the UBI plugin anyway
I don't know the exact UBI plugin's implementation but would UBI plugin cause an increase in search latency (however negligible it may be)? For super latency sensitive applications, UBI plugin should not contribute to the actual search latency. All the more reason I want the UBI data in a separate UBI-dedicated OpenSearch cluster
Search engines such as Solr may not have cost saving features such as policy based Index rollover, Warm Nodes to archive older data and Index Rollup to reduce the cost for the UBI data. One more reason why OpenSearch should be where the UBI data should exist regardless of the search engine being used
Search engines such as Solr are not as good as OpenSearch in handling timeseries data. UBI is actually timeseries data.

And I'm saying all these without any bias towards OpenSearch and thinking about the recommendations I would give out if I were an independent search consultant.

So going back to my initial question - If this search quality framework can be implemented search engine agnostic, OpenSearch can be used as an UBI hub (for the lack of better word). So the question "what/how this is going to be implemented" takes precedence over "where is this going to be implemented". By "where", I mean, is it going to be an OpenSearch plugin, OR extension of the OpenSearch UBI plugin OR OpenSeach core OR independent repo (as independent as opensearch-benchmark repo

# Example for empty-result Solr query with docids sent to OpenSearch. 
# Another way of doing this is sending the payload to ubi_queries index just like how I would to the ubi_events index
GET /ecommerce/_search
{
  "ext": {
    "ubi": {
      "user_query": "Apache Solr",
      "client_id": "devtools for Solr",
      "query_attributes": {
        "application": "from_solr",
        "query_response_object_ids": [
            "0840056120075",
            "0840056120082",
            "0840056122345"
          ]
      }
    }
  },
  "query": {
    "match": {
      "*": """q=Apache Solr&rows=3&
   json.facet={
     categories:{
       type : terms,
       field : cat,
       sort : { x : desc},
       facet:{
         x : "avg(price)",
         y : "sum(price)"
       }
     }
   }"""
    }
  }
}

smacrakis · 2024-10-11T15:40:41Z

Some quick comments

Re "Currently OpenSearch is the only search engine so far that has the initial implementation for UBI."

The server-side component of UBI collects queries and responses at the server. This has been implemented so far for OpenSearch, Solr, and Elasticsearch.

On the analysis side, you can send the data anywhere you want. We generally recommend sending it to mass storage (S3 in the AWS context) because as you say it can get very voluminous. You can then load it into OpenSearch (or for that matter Redshift if you like) for analysis. We will be building tools on top of OpenSearch, but since the schema is well-defined, others can build tools where they like. We are still working on how exactly to implement analysis functionality.

My current hypothesis is that most analysts will want to use Python as their main tool, querying OpenSearch or a DBMS for bulk data operations. Where exactly the Python code will run is still an open question.

epugh · 2024-10-11T16:49:10Z

@aswath86 thanks for following up, and I think that you asked two big things, one about the value proposition of UBI, and then one about the Search Quality Eval Framework.

To your first question... Yes! In talking about UBI, I've met two large organizations that use Solr as the search engine for their application, and have OpenSearch for all their logs and analytics and other data sources. Both of them are interested in the prospect of using UBI without needing to throw out their existing Solr investment, and leveraging their OpenSearch setup even more! I'm hoping that UBI support will ship in Solr in the near future with a pipeline to send that data to OpenSearch backend. (apache/solr#2452 if you are curious). This can be expanded to many other search engines.

To the second question:

OpenSearch plugin, OR extension of the OpenSearch UBI plugin OR OpenSeach core OR independent repo (as independent as opensearch-benchmark repo

In someways, I think this is a question to be answered by more experienced OpenSearch Maintainers ;-). From my perspective, I see some real tension in the community on this exact question applied to many areas. If you recall, we wanted to ship UBI as part of core OpenSearch, and what we heard is "lets put LESS into core and more into plugins". However, then you look at ML Commons, which while technically is a single "plugin", really looks like a very rich independent ecosystem bundled up into a plugin.

I can see a path that involves us expanding out the UI aspects (dashboards, visualizations, etc) to the existing dashboards-search-relevance project. While we add a new plugin that supports the APIs and other work in a search-relevance project.

We are aiming to wrap the first phase by end of year, so we need to figure out what is shippable by then, while also starting to htink about what we do next year, and how big do we dream?

epugh · 2024-10-22T14:01:32Z

I want to share that we had a good discussion on COEC calculations, and that is closing in on done. We'd like to get the documentation about how to calculate COEC based implicit judgements into the 2.18 release if possible.

jzonthemtn added enhancement Enhancement or improvement to existing feature or request untriaged labels Aug 22, 2024

github-actions bot added the Search:Relevance label Aug 22, 2024

jzonthemtn changed the title ~~[RFC] OpenSearch Search Quality Evaluation Framework RFC~~ [RFC] OpenSearch Search Quality Evaluation Framework Aug 22, 2024

mch2 added RFC Issues requesting major changes discuss Issues intended to help drive brainstorming and decision making feedback needed Issue or PR needs feedback Search:User Behavior Insights and removed untriaged labels Sep 4, 2024

wrigleyDan mentioned this issue Sep 29, 2024

[RFC] Hybrid Search Optimizer opensearch-project/neural-search#934

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] OpenSearch Search Quality Evaluation Framework #15354

[RFC] OpenSearch Search Quality Evaluation Framework #15354

jzonthemtn commented Aug 22, 2024 •

edited

Loading

msfroh commented Aug 28, 2024

aswath86 commented Oct 11, 2024

smacrakis commented Oct 11, 2024

epugh commented Oct 11, 2024

epugh commented Oct 22, 2024

[RFC] OpenSearch Search Quality Evaluation Framework #15354

[RFC] OpenSearch Search Quality Evaluation Framework #15354

Comments

jzonthemtn commented Aug 22, 2024 • edited Loading

[RFC] OpenSearch Search Quality Evaluation Framework

Introduction

Problem Statement

Proposal

Collecting Implicit Judgments

Data Transformation and Calculations

Using the Implicit Judgments

Create a query sample

Download, upload or change a query sample

Calculate metrics based on a query sample

Track calculated search result quality over time

Changes to UBI

GitHub Repository

Roadmap

Conclusion

msfroh commented Aug 28, 2024

aswath86 commented Oct 11, 2024

smacrakis commented Oct 11, 2024

epugh commented Oct 11, 2024

epugh commented Oct 22, 2024

jzonthemtn commented Aug 22, 2024 •

edited

Loading