Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gms): store update events in a new index in ElasticSearch #135

Merged
merged 10 commits into from
Nov 22, 2023

Conversation

danielkoh94
Copy link
Collaborator

@danielkoh94 danielkoh94 commented Nov 7, 2023

Objective was to capture update events in ElasticSearch for metric-gathering purposes.

This PR makes the following changes:

  • Creation of a ElasticSearch index, datahub-update-index
  • Process events that are UPSERTs into JSON format that captures information such as time of the event, actor, aspect and entity urn
  • Create a request for ElasticSearch to insert the new event into its index.

},
"type": {
"type": "keyword"
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are you storing your document? _source?
_source is not indexed and it might be hard for you to do aggr and sorting in the future.
any reason not to store urn, event_type and other impt info as fields?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fields such as urn and other info are automatically indexed by ES when inserting the document into the index so I did not defined them here. Should I define the fields here so it is more transparent?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you saying that it is using dynamic-mapping? I think we should define the fields since the schema is known.
I am wondering if we should use static mapping for this index? With dynamic-mapping, there is no control over what is being sent to the ES for this index. Just need 1 person to send 1 erroneous document with many many fields, it will result in fields explosion in this index. And the mapping is not optimized, I believe each field will have 2 types "keyword" and "text" (waste ram and storage).


// Generating a hash using event urn, event content and time, to be used as a unique id for ES documents
String stringToBeHashed = event.getEntityType().toString() + '_'
+ event.getAspect().getValue().asAvroString()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to create a unique id for the ES documents? are we not storing each events related to the dataset?
can we just use the _id generated by ES?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you are using time and content as the hash, it is almost certain that it will result in a new doc. it is unlikely that there will be any update to the document

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to create a unique id for the ES documents? are we not storing each events related to the dataset? can we just use the _id generated by ES?

The ES client in java does not automatically assign a _id to the document, so I need to create a unique _id for each event if not there is an error in uploading the document to the index.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you are using time and content as the hash, it is almost certain that it will result in a new doc. it is unlikely that there will be any update to the document

Yes, that is the idea here, where each update event is an individual document so we can track all updates over a time period.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.javadoc.io/doc/org.elasticsearch/elasticsearch/7.8.0/org/elasticsearch/action/index/IndexRequest.html#id()

what if you do not set the id for the document? will ES generate the id for you? Lets see if we can use autogenerated ID

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm auto generated id works. I have removed the relevant parts to make use of the auto generated ID

@neojunjie neojunjie merged commit 743c232 into master Nov 22, 2023
37 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants