feat(gms): store update events in a new index in ElasticSearch #135

danielkoh94 · 2023-11-07T09:50:33Z

Objective was to capture update events in ElasticSearch for metric-gathering purposes.

This PR makes the following changes:

Creation of a ElasticSearch index, datahub-update-index
Process events that are UPSERTs into JSON format that captures information such as time of the event, actor, aspect and entity urn
Create a request for ElasticSearch to insert the new event into its index.

… setup

…tring

neojunjie · 2023-11-15T00:55:38Z

...-service/restli-servlet-impl/src/main/resources/index/usage-event/update_event_template.json

+        },
+        "type": {
+          "type": "keyword"
+        },


Where are you storing your document? _source?
_source is not indexed and it might be hard for you to do aggr and sorting in the future.
any reason not to store urn, event_type and other impt info as fields?

Fields such as urn and other info are automatically indexed by ES when inserting the document into the index so I did not defined them here. Should I define the fields here so it is more transparent?

are you saying that it is using dynamic-mapping? I think we should define the fields since the schema is known.
I am wondering if we should use static mapping for this index? With dynamic-mapping, there is no control over what is being sent to the ES for this index. Just need 1 person to send 1 erroneous document with many many fields, it will result in fields explosion in this index. And the mapping is not optimized, I believe each field will have 2 types "keyword" and "text" (waste ram and storage).

neojunjie · 2023-11-15T00:58:30Z

...data-jobs/mae-consumer/src/main/java/com/linkedin/metadata/kafka/hook/UpdateIndicesHook.java

+
+    // Generating a hash using event urn, event content and time, to be used as a unique id for ES documents
+    String stringToBeHashed = event.getEntityType().toString() + '_'
+        + event.getAspect().getValue().asAvroString()


why do we need to create a unique id for the ES documents? are we not storing each events related to the dataset?
can we just use the _id generated by ES?

since you are using time and content as the hash, it is almost certain that it will result in a new doc. it is unlikely that there will be any update to the document

why do we need to create a unique id for the ES documents? are we not storing each events related to the dataset? can we just use the _id generated by ES?

The ES client in java does not automatically assign a _id to the document, so I need to create a unique _id for each event if not there is an error in uploading the document to the index.

since you are using time and content as the hash, it is almost certain that it will result in a new doc. it is unlikely that there will be any update to the document

Yes, that is the idea here, where each update event is an individual document so we can track all updates over a time period.

https://www.javadoc.io/doc/org.elasticsearch/elasticsearch/7.8.0/org/elasticsearch/action/index/IndexRequest.html#id()

what if you do not set the id for the document? will ES generate the id for you? Lets see if we can use autogenerated ID

Hmm auto generated id works. I have removed the relevant parts to make use of the auto generated ID

…rate docID

…ic mappings

danielkoh94 added 6 commits November 7, 2023 17:15

added creation of new index datahub_update_event during elasticsearch…

fecad02

… setup

Added function to extract information from update event into a JSON s…

4e228e7

…tring

Added template for datahub_update_event index

b16f90c

Added function to generate SHA256 hashes for document IDs

47a2191

Updated hook to process UPSERT events into datahub_update_index in ES

64fea48

Cleaned up comments and added descriptions for functions

19b80dc

danielkoh94 requested a review from neojunjie November 7, 2023 09:50

github-actions bot added product devops labels Nov 7, 2023

danielkoh94 added 2 commits November 14, 2023 10:09

Removed unused imports and comments

3fa5385

added entityType to the ES update event document

40b9a52

neojunjie requested changes Nov 15, 2023

View reviewed changes

danielkoh94 added 2 commits November 20, 2023 17:11

Removed creation of document hash for document id to let ES auto gene…

56e9cb2

…rate docID

added explicit definition of update index mappings and disabled dynam…

94aef19

…ic mappings

neojunjie approved these changes Nov 22, 2023

View reviewed changes

neojunjie merged commit 743c232 into master Nov 22, 2023
37 of 40 checks passed

danielkoh94 mentioned this pull request Nov 27, 2023

feat(gms): add delete events into es index #138

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gms): store update events in a new index in ElasticSearch #135

feat(gms): store update events in a new index in ElasticSearch #135

danielkoh94 commented Nov 7, 2023 •

edited

Loading

neojunjie Nov 15, 2023

danielkoh94 Nov 20, 2023

neojunjie Nov 20, 2023

neojunjie Nov 15, 2023

neojunjie Nov 15, 2023

danielkoh94 Nov 20, 2023

danielkoh94 Nov 20, 2023

neojunjie Nov 20, 2023

danielkoh94 Nov 20, 2023

feat(gms): store update events in a new index in ElasticSearch #135

feat(gms): store update events in a new index in ElasticSearch #135

Conversation

danielkoh94 commented Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielkoh94 commented Nov 7, 2023 •

edited

Loading