-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(gms): store update events in a new index in ElasticSearch #135
Conversation
}, | ||
"type": { | ||
"type": "keyword" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are you storing your document? _source?
_source is not indexed and it might be hard for you to do aggr and sorting in the future.
any reason not to store urn, event_type and other impt info as fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fields such as urn and other info are automatically indexed by ES when inserting the document into the index so I did not defined them here. Should I define the fields here so it is more transparent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you saying that it is using dynamic-mapping? I think we should define the fields since the schema is known.
I am wondering if we should use static mapping for this index? With dynamic-mapping, there is no control over what is being sent to the ES for this index. Just need 1 person to send 1 erroneous document with many many fields, it will result in fields explosion in this index. And the mapping is not optimized, I believe each field will have 2 types "keyword" and "text" (waste ram and storage).
|
||
// Generating a hash using event urn, event content and time, to be used as a unique id for ES documents | ||
String stringToBeHashed = event.getEntityType().toString() + '_' | ||
+ event.getAspect().getValue().asAvroString() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to create a unique id for the ES documents? are we not storing each events related to the dataset?
can we just use the _id generated by ES?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you are using time and content as the hash, it is almost certain that it will result in a new doc. it is unlikely that there will be any update to the document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to create a unique id for the ES documents? are we not storing each events related to the dataset? can we just use the _id generated by ES?
The ES client in java does not automatically assign a _id to the document, so I need to create a unique _id for each event if not there is an error in uploading the document to the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you are using time and content as the hash, it is almost certain that it will result in a new doc. it is unlikely that there will be any update to the document
Yes, that is the idea here, where each update event is an individual document so we can track all updates over a time period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if you do not set the id for the document? will ES generate the id for you? Lets see if we can use autogenerated ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm auto generated id works. I have removed the relevant parts to make use of the auto generated ID
Objective was to capture update events in ElasticSearch for metric-gathering purposes.
This PR makes the following changes: