Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write metadata cache data to mappings _meta with refresh time update #805

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

seankao-az
Copy link
Collaborator

@seankao-az seankao-az commented Oct 24, 2024

Description

Metadata Cache Writer

For the most part, same as

In addition to the regular metadata storage using FlintIndexMetadataService, we're dual-writing additional fields, defined by FlintMetadataCache, to the index mappings _meta field. It's intended for frontend users to access some crucial metadata for an index quickly without invoking another backend API call.

This PR adds such fields for all indexes, if the spark config spark.flint.metadataCacheWrite.enabled is set to true.

  • _meta.properties.metadataCacheVersion: "1.0"
  • _meta.properties.refreshInterval: Integer. Refresh interval of an index measured in seconds. This field is added only if index refresh type is auto refresh and refresh_interval is set
  • _meta.properties.sourceTables: Array of Strings. For now, it's mocked data. Update coming in later PR.
  • _meta.properties.lastRefreshTime: Long. Timestamp in milliseconds when last refresh happened. This field is added only if index already gets refreshed at least once

Last Refresh Time

Added two new fields in FlintMetadataLogEntry and bumped version of its json doc from 1.0 to 1.1 (because adding new field but not changing existing fields)

  • lastRefreshStartTime: Long. Timestamp when last refresh started
  • lastRefreshCompleteTime: Long. Timestamp when last refresh completed

These are accurate only for manual refresh (full, incremental) and external scheduler for auto refresh.
For internal scheduler, the jobStartTime (or createTime in FlintMetadataLogEntry) is used to track streaming job start time.

I'm not reusing createTime because they should be updated at different times.
For createTime (for internal scheduler) it's during refreshIndex, recoverIndex, updateIndexManualToAuto
But for lastRefreshStartTime and lastRefreshCompleteTime (for manual refresh and external scheduler) it's only updated in refreshIndex

Related Issues

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…rch-project#744)

* write mock metadata cache data to mappings _meta

Signed-off-by: Sean Kao <[email protected]>

* Enable write to cache by default

Signed-off-by: Sean Kao <[email protected]>

* bugfix: _meta.latestId missing when create index

Signed-off-by: Sean Kao <[email protected]>

* set and unset config in test suite

Signed-off-by: Sean Kao <[email protected]>

* fix: use member flintSparkConf

Signed-off-by: Sean Kao <[email protected]>

---------

Signed-off-by: Sean Kao <[email protected]>
Signed-off-by: Sean Kao <[email protected]>
Signed-off-by: Sean Kao <[email protected]>
Signed-off-by: Sean Kao <[email protected]>
Signed-off-by: Sean Kao <[email protected]>
@seankao-az seankao-az self-assigned this Oct 24, 2024
@seankao-az seankao-az added the enhancement New feature or request label Oct 24, 2024
@seankao-az
Copy link
Collaborator Author

add label to backport to the nexus branch.
To be clear, it shouldn't be backported to 0.5.
The 0.5- part in the name 0.5-nexus is obsolete

* Handles refresh for refresh mode AUTO, which is used exclusively by auto refresh index with
* internal scheduler.
*/
private def refreshIndexAuto(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we update for auto refresh?

Copy link
Collaborator Author

@seankao-az seankao-az Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now only track lastRefreshStartTime and lastRefreshCompleteTime for manual refresh and auto refresh with external scheduler.

for streaming job, we use createTime to track the streaming job start time.
there's no mechanism for tracking start/end time for each micro batch update yet, so updating the 2 timestamp in the refresh could be misleading.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some comment

@seankao-az
Copy link
Collaborator Author

Note to any reviewer if curious, the force push only amended commit 2f58f56 and nothing else

Copy link
Collaborator

@ykmr1224 ykmr1224 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we call it metadata Cache? I was not quite sure the indication of cache.

@seankao-az
Copy link
Collaborator Author

seankao-az commented Oct 25, 2024

i do welcome a better name... was kind of struggling to come up with a name. I'm not too convinced that MetadataCache is the best one.
So the main use case for this is when using custom index metadata and metadata log storage, these metadata aren't available in OpenSearch index. And some frontend use case need access to these data without making query to backend executed in spark. I interpret it as a read cache for such users, that'll be updated by us dual-writing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants