Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SKG on elasticsearch #196

Open
lschneidpro opened this issue Sep 9, 2024 · 5 comments
Open

SKG on elasticsearch #196

lschneidpro opened this issue Sep 9, 2024 · 5 comments

Comments

@lschneidpro
Copy link

Hi everyone,

I'm currently reading the book but using Elasticsearch instead of Solr. I attempted to reimplement the Semantic Knowledge Graph (SKG) on ES, and developed a custom scoring script for Elasticsearch's significant text aggregation, inspired by the original Solr code found here. So far, I've been able to achieve the same scores as those in the health dataset example. I haven't tested the other cases from the book yet, but I wanted to share my implementation to see if it aligns with the authors' intentions.

script = """
double sigmoid(double x, double offset, double scale) {
    return (x+offset) / (scale + Math.abs(x+offset));
}

double bgProb = params._superset_freq*1.0/params._superset_size;
double num = (params._subset_freq - params._subset_size * bgProb);
double denom = Math.sqrt(params._subset_size * bgProb * (1 - bgProb));
denom = (denom == 0) ? 1e-10 : denom;
double z = num / denom;
double result = 0.2*sigmoid(z, -80, 50)
                + 0.2*sigmoid(z, -30, 30)
                + 0.2*sigmoid(z, 0, 30)
                + 0.2*sigmoid(z, 30, 30)
                + 0.2*sigmoid(z, 80, 50);
return Math.round(result * 1e5)/1e5;
"""

script_heuristic = {
    "script": {
        "lang": "painless",
        "source": script,
    }
}

query_string = "advil"
query = {"match": {"body": query_string}}
aggs = {
    "keywords": {
        "significant_text": {
            "field": "body",
            "min_doc_count": 2,
            "script_heuristic": script_heuristic,
        }
    }
}
resp = client.search(index=alias, query=query, aggs=aggs, size=0)

resulting in:

{'took': 92,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 15, 'relation': 'eq'},
  'max_score': None,
  'hits': []},
 'aggregations': {'keywords': {'doc_count': 15,
   'bg_count': 12892,
   'buckets': [{'key': 'advil',
     'doc_count': 15,
     'score': 0.70986,
     'bg_count': 15},
    {'key': 'motrin', 'doc_count': 9, 'score': 0.59897, 'bg_count': 10},
    {'key': 'aleve', 'doc_count': 4, 'score': 0.4662, 'bg_count': 4},
    {'key': 'ibuprofen', 'doc_count': 13, 'score': 0.38264, 'bg_count': 75},
    {'key': 'alleve', 'doc_count': 2, 'score': 0.36649, 'bg_count': 2},
    {'key': 'tylenol', 'doc_count': 6, 'score': 0.33048, 'bg_count': 23},
    {'key': 'naproxen', 'doc_count': 6, 'score': 0.31226, 'bg_count': 26}]}}}

I appreciate any feedback—thanks!

@lschneidpro
Copy link
Author

I’ve been testing the various cases from Chapter 5, and for the vibranium results, the scores are starting to change but remain quite similar overall (example attached).

query_string = "vibranium"
query = {"match": {"body": query_string}}
aggs = {
    "keywords": {
        "significant_text": {
            "field": "body",
            "min_doc_count": 2,
            "script_heuristic": script_heuristic,
        }
    }
}
alias="stackexchange"
resp = client.search(index=alias, query=query, aggs=aggs, size=0)
resp.body
{'took': 8,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 281, 'relation': 'eq'},
  'max_score': None,
  'hits': []},
 'aggregations': {'keywords': {'doc_count': 281,
   'bg_count': 1950545,
   'buckets': [{'key': 'vibranium',
     'doc_count': 280,
     'score': 0.95473,
     'bg_count': 843},
    {'key': 'wakandan', 'doc_count': 35, 'score': 0.87018, 'bg_count': 122},
    {'key': 'wakanda', 'doc_count': 48, 'score': 0.85652, 'bg_count': 284},
    {'key': "panther's", 'doc_count': 14, 'score': 0.85428, 'bg_count': 25},
    {'key': 'klaue', 'doc_count': 12, 'score': 0.85196, 'bg_count': 19},
    {'key': 'maclain', 'doc_count': 11, 'score': 0.84754, 'bg_count': 17},
    {'key': 'adamantium', 'doc_count': 93, 'score': 0.847, 'bg_count': 1221},
    {'key': 'klaw', 'doc_count': 15, 'score': 0.82973, 'bg_count': 40},
    {'key': 'panther', 'doc_count': 36, 'score': 0.82165, 'bg_count': 254},
    {'key': 'alloy', 'doc_count': 53, 'score': 0.81535, 'bg_count': 592}]}}}

As for the Star Wars content-based recommendation, I wasn’t entirely sure how to approach it. I tried incorporating different tokens from the text into the results, though it’s not ideal since I have to split tokens like ‘Princess Leia.’ Still, I’m seeing similar results (example attached). Let me know your thoughts

parsed_document = ["this", "doc", "contains", "the", "words", "luke", 
            "magneto", "cyclops", "darth", "vader", "princess","leia", 
            "wolverine", "apple", "banana", "galaxy", "force", 
            "blaster", "and", "chloe"]

query_string = "star wars"
query = {
    "match": {
        "body": {
            "query": query_string,
            "operator": "AND",
        }
    }
}
aggs = {
    "keywords": {
        "significant_text": {
            "field": "body",
            "script_heuristic": script_heuristic,
            "include": parsed_document,
        }
    }
}
alias="stackexchange"
resp = client.search(index=alias, query=query, aggs=aggs, size=0)
resp.body
{'took': 446,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 6829, 'relation': 'eq'},
  'max_score': None,
  'hits': []},
 'aggregations': {'keywords': {'doc_count': 6829,
   'bg_count': 1950545,
   'buckets': [{'key': 'luke',
     'doc_count': 1157,
     'score': 0.77982,
     'bg_count': 15452},
    {'key': 'force', 'doc_count': 1960, 'score': 0.76475, 'bg_count': 47672},
    {'key': 'darth', 'doc_count': 923, 'score': 0.73712, 'bg_count': 13985},
    {'key': 'vader', 'doc_count': 936, 'score': 0.72128, 'bg_count': 15980},
    {'key': 'leia', 'doc_count': 533, 'score': 0.70443, 'bg_count': 6048},
    {'key': 'galaxy', 'doc_count': 858, 'score': 0.64305, 'bg_count': 20692},
    {'key': 'blaster', 'doc_count': 211, 'score': 0.51115, 'bg_count': 2572},
    {'key': 'princess', 'doc_count': 225, 'score': 0.38521, 'bg_count': 6076},
    {'key': 'this', 'doc_count': 4136, 'score': 0.19193, 'bg_count': 927850},
    {'key': 'the',
     'doc_count': 6735,
     'score': 0.17519,
     'bg_count': 1801230}]}}}

Based on the Solr documentation and the code, it appears that in the Star Wars case, relatedness is scored by comparing the context filter for each token, if I understand correctly. Let me know if it’s worth exploring the Elasticsearch API further to implement the exact method for calculating relatedness.

@treygrainger
Copy link
Owner

treygrainger commented Sep 14, 2024

That's really cool @lschneidpro ! I won't have time to review this probably for the next month (the book is being released and I'll be traveling to speak at a bunch of conferences), but I'll definitely add this to my list of things to review once I free up.

Out of curiosity, does this (or could it conceivably) handle the multi-level traversals (like the query disambiguation examples in chapter 7)?

I'd definitely be interested in getting code for this working for Elasticsearch and OpenSearch users. If you can get the multi-level aggregations working then and this could work consistently between Solr and Elasticsearch/Opensearch, I think there would be a lot of people interested.

@lschneidpro
Copy link
Author

Hi @treygrainger,

Thanks for your feedback!

I'm about to go on vacation, so no worries. Currently, the implementation doesn't support multi-level traversals. To fully understand the functionalities, I'll need to dive deeper into the SKG academic paper and the SOLR code.

So far, I've been using Elasticsearch's Significant Terms Aggregation and Significant Text Aggregation. These compute foreground and background statistics based on the query, and I use a custom script (your SKG code) to derive a custom score. By the way, in my tests, the SOLR implementation runs faster than the Elasticsearch options.

I’m not an Elasticsearch expert, so I’m unsure how to implement SKG fully without developing a dedicated plugin. I can reach out to Elasticsearch support, or perhaps you or someone on your team with more Elasticsearch expertise could provide some guidance.

Here are the options I see moving forward:

  • Handle multi-level traversal on the client side using a similar approach.
  • Use another Elasticsearch API, such as Term Vectors, or something else I might be overlooking.
  • Develop a custom plugin, though this would require significant effort.

As for query disambiguation, I think sub-aggregations could work. The first aggregation level would target categories, while the second level would apply the classic significant text aggregation within each category bucket. I'll experiment with this when I'm back.

Best,

@lschneidpro
Copy link
Author

@treygrainger any updates? Thanks

@david-albrecht-xometry
Copy link

FWIW I ran this in OpenSearch and got the same results for the health collection and yet another set of similar results for scifi. Thanks for sharing this!

import requests
from engines.opensearch.config import OPENSEARCH_URL


def traverse_skg_opensearch(query, collection):
    script = """
    double sigmoid(double x, double offset, double scale) {
        return (x + offset) / (scale + Math.abs(x + offset));
    }

    double bgProb = params._superset_freq * 1.0 / params._superset_size;
    double num = (params._subset_freq - params._subset_size * bgProb);
    double denom = Math.sqrt(params._subset_size * bgProb * (1 - bgProb));
    denom = (denom == 0) ? 1e-10 : denom;
    double z = num / denom;
    double result = 0.2 * sigmoid(z, -80, 50)
                    + 0.2 * sigmoid(z, -30, 30)
                    + 0.2 * sigmoid(z, 0, 30)
                    + 0.2 * sigmoid(z, 30, 30)
                    + 0.2 * sigmoid(z, 80, 50);
    return Math.round(result * 1e5) / 1e5;
    """
    payload = {
        "query": {
            "match": {
                "body": query
            }
        },
        "aggs": {
            "keywords": {
                "significant_text": {
                    "field": "body",
                    "min_doc_count": 2,
                    "script_heuristic": {
                        "script": {
                            "lang": "painless",
                            "source": script
                        }
                    }
                }
            }
        },
        "size": 0
    }
    response = requests.post(
        f"{OPENSEARCH_URL}/{collection}/_search",
        headers={"Content-Type": "application/json"},
        data=json.dumps(payload)
    )
    assert response.status_code == 200

    return response.json()

traverse_skg_opensearch('advil', 'health')

{'took': 2,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 15, 'relation': 'eq'},
  'max_score': None,
  'hits': []},
 'aggregations': {'keywords': {'doc_count': 15,
   'bg_count': 12892,
   'buckets': [{'key': 'advil',
     'doc_count': 15,
     'score': 0.70986,
     'bg_count': 15},
    {'key': 'motrin', 'doc_count': 9, 'score': 0.59897, 'bg_count': 10},
    {'key': 'aleve', 'doc_count': 4, 'score': 0.4662, 'bg_count': 4},
    {'key': 'ibuprofen', 'doc_count': 13, 'score': 0.38264, 'bg_count': 75},
    {'key': 'alleve', 'doc_count': 2, 'score': 0.36649, 'bg_count': 2},
    {'key': 'tylenol', 'doc_count': 6, 'score': 0.33048, 'bg_count': 23},
    {'key': 'naproxen', 'doc_count': 6, 'score': 0.31226, 'bg_count': 26}]}}}

traverse_skg_opensearch('vibranium', 'scifi')

{'took': 2,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 281, 'relation': 'eq'},
  'max_score': None,
  'hits': []},
 'aggregations': {'keywords': {'doc_count': 281,
   'bg_count': 177547,
   'buckets': [{'key': 'vibranium',
     'doc_count': 281,
     'score': 0.91625,
     'bg_count': 281},
    {'key': 'wakandan', 'doc_count': 39, 'score': 0.73685, 'bg_count': 61},
    {'key': "america's", 'doc_count': 70, 'score': 0.72504, 'bg_count': 214},
    {'key': 'adamantium', 'doc_count': 94, 'score': 0.71726, 'bg_count': 407},
    {'key': 'wakanda', 'doc_count': 51, 'score': 0.69257, 'bg_count': 142},
    {'key': 'alloy', 'doc_count': 63, 'score': 0.67211, 'bg_count': 245},
    {'key': 'maclain', 'doc_count': 15, 'score': 0.63679, 'bg_count': 17},
    {'key': 'klaw', 'doc_count': 16, 'score': 0.63014, 'bg_count': 20},
    {'key': "panther's", 'doc_count': 16, 'score': 0.58142, 'bg_count': 25},
    {'key': 'shield', 'doc_count': 151, 'score': 0.56559, 'bg_count': 2312}]}}}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants