Write metadata cache data to mappings _meta with refresh time update #805

seankao-az · 2024-10-24T00:26:06Z

Description

Metadata Cache Writer

For the most part, same as

[0.5-nexus] Write mock metadata cache data to mappings _meta #744

In addition to the regular metadata storage using FlintIndexMetadataService, we're dual-writing additional fields, defined by FlintMetadataCache, to the index mappings _meta field. It's intended for frontend users to access some crucial metadata for an index quickly without invoking another backend API call.

This PR adds such fields for all indexes, if the spark config spark.flint.metadataCacheWrite.enabled is set to true.

_meta.properties.metadataCacheVersion: "1.0"
_meta.properties.refreshInterval: Integer. Refresh interval of an index measured in seconds. This field is added only if index refresh type is auto refresh and refresh_interval is set
_meta.properties.sourceTables: Array of Strings. For now, it's mocked data. Update coming in later PR.
_meta.properties.lastRefreshTime: Long. Timestamp in milliseconds when last refresh happened. This field is added only if index already gets refreshed at least once

Last Refresh Time

Added two new fields in FlintMetadataLogEntry and bumped version of its json doc from 1.0 to 1.1 (because adding new field but not changing existing fields)

lastRefreshStartTime: Long. Timestamp when last refresh started
lastRefreshCompleteTime: Long. Timestamp when last refresh completed

These are accurate only for manual refresh (full, incremental) and external scheduler for auto refresh.
For internal scheduler, the jobStartTime (or createTime in FlintMetadataLogEntry) is used to track streaming job start time.

I'm not reusing createTime because they should be updated at different times.
For createTime (for internal scheduler) it's during refreshIndex, recoverIndex, updateIndexManualToAuto
But for lastRefreshStartTime and lastRefreshCompleteTime (for manual refresh and external scheduler) it's only updated in refreshIndex

Related Issues

[FEATURE] Write metadata to index mappings _meta as read cache for frontend user to access #746

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…rch-project#744) * write mock metadata cache data to mappings _meta Signed-off-by: Sean Kao <[email protected]> * Enable write to cache by default Signed-off-by: Sean Kao <[email protected]> * bugfix: _meta.latestId missing when create index Signed-off-by: Sean Kao <[email protected]> * set and unset config in test suite Signed-off-by: Sean Kao <[email protected]> * fix: use member flintSparkConf Signed-off-by: Sean Kao <[email protected]> --------- Signed-off-by: Sean Kao <[email protected]>

Signed-off-by: Sean Kao <[email protected]>

seankao-az · 2024-10-24T20:18:05Z

add label to backport to the nexus branch.
To be clear, it shouldn't be backported to 0.5.
The 0.5- part in the name 0.5-nexus is obsolete

Signed-off-by: Sean Kao <[email protected]>

ykmr1224 · 2024-10-25T21:59:32Z

flint-spark-integration/src/main/scala/org/opensearch/flint/spark/FlintSpark.scala

+   * Handles refresh for refresh mode AUTO, which is used exclusively by auto refresh index with
+   * internal scheduler.
+   */
+  private def refreshIndexAuto(


Why don't we update for auto refresh?

for now only track lastRefreshStartTime and lastRefreshCompleteTime for manual refresh and auto refresh with external scheduler.

for streaming job, we use createTime to track the streaming job start time.
there's no mechanism for tracking start/end time for each micro batch update yet, so updating the 2 timestamp in the refresh could be misleading.

I'll add some comment

...ration/src/main/scala/org/opensearch/flint/spark/scheduler/util/IntervalSchedulerParser.java

Signed-off-by: Sean Kao <[email protected]>

seankao-az · 2024-10-25T22:50:52Z

Note to any reviewer if curious, the force push only amended commit 2f58f56 and nothing else

ykmr1224

Why do we call it metadata Cache? I was not quite sure the indication of cache.

seankao-az · 2024-10-25T23:39:01Z

i do welcome a better name... was kind of struggling to come up with a name. I'm not too convinced that MetadataCache is the best one.
So the main use case for this is when using custom index metadata and metadata log storage, these metadata aren't available in OpenSearch index. And some frontend use case need access to these data without making query to backend executed in spark. I interpret it as a read cache for such users, that'll be updated by us dual-writing

seankao-az added 17 commits October 10, 2024 15:47

default metadata cache write disabled

38570ce

Signed-off-by: Sean Kao <[email protected]>

remove string literal "external" in index builder

3cea7ea

Signed-off-by: Sean Kao <[email protected]>

track refreshInterval and lastRefreshTime

3d2d095

Signed-off-by: Sean Kao <[email protected]>

add last refresh timestamps to metadata log entry

3e76497

Signed-off-by: Sean Kao <[email protected]>

update metadata cache test case: should pass

9acf7e5

Signed-off-by: Sean Kao <[email protected]>

move to spark package; get refresh interval

830cc2b

Signed-off-by: Sean Kao <[email protected]>

parse refresh interval

7354128

Signed-off-by: Sean Kao <[email protected]>

Merge branch 'main' into write-metadata-cache

cac0af7

Signed-off-by: Sean Kao <[email protected]>

minor syntax fix on FlintSpark.createIndex

83fbe5e

Signed-off-by: Sean Kao <[email protected]>

strategize cache writer interface

8873189

Signed-off-by: Sean Kao <[email protected]>

update refresh timestamps in FlintSpark

8e86912

Signed-off-by: Sean Kao <[email protected]>

add test cases

5b67e96

Signed-off-by: Sean Kao <[email protected]>

IT test for refresh timestamp update

77321fd

Signed-off-by: Sean Kao <[email protected]>

add doc for spark conf

69e675b

Signed-off-by: Sean Kao <[email protected]>

change mock table name

671d5f6

Signed-off-by: Sean Kao <[email protected]>

add IT test at FlintSpark level

681067b

Signed-off-by: Sean Kao <[email protected]>

seankao-az marked this pull request as ready for review October 24, 2024 18:16

seankao-az requested review from dai-chen, rupal-bq, vamsi-amazon, penghuo, anirudha, kaituo, YANG-DB and LantaoJin as code owners October 24, 2024 18:16

seankao-az added 2 commits October 24, 2024 11:16

Merge branch 'main' into write-metadata-cache

5ade12e

test with external scheduler

4823544

Signed-off-by: Sean Kao <[email protected]>

seankao-az self-assigned this Oct 24, 2024

seankao-az added the enhancement New feature or request label Oct 24, 2024

seankao-az added 0.6 backport 0.5-nexus labels Oct 24, 2024

refactor refreshIndex method; add test for modes

79b7b17

Signed-off-by: Sean Kao <[email protected]>

seankao-az requested review from mengweieric, noCharger and ykmr1224 as code owners October 25, 2024 00:01

seankao-az added 4 commits October 24, 2024 17:23

Merge branch 'main' into write-metadata-cache

941d08c

fix typo

b4a9b53

Signed-off-by: Sean Kao <[email protected]>

fix failed test caused by refactoring

2f58f56

Signed-off-by: Sean Kao <[email protected]>

Merge branch 'main' into write-metadata-cache

7a8e1f3

Signed-off-by: Sean Kao <[email protected]>

seankao-az force-pushed the write-metadata-cache branch from 5f3af3b to 7a8e1f3 Compare October 25, 2024 18:29

ykmr1224 reviewed Oct 25, 2024

View reviewed changes

rename method; add comment

d80d34d

Signed-off-by: Sean Kao <[email protected]>

ykmr1224 reviewed Oct 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write metadata cache data to mappings _meta with refresh time update #805

Write metadata cache data to mappings _meta with refresh time update #805

seankao-az commented Oct 24, 2024 •

edited

Loading

seankao-az commented Oct 24, 2024

ykmr1224 Oct 25, 2024

seankao-az Oct 25, 2024 •

edited

Loading

seankao-az Oct 25, 2024

seankao-az commented Oct 25, 2024

ykmr1224 left a comment

seankao-az commented Oct 25, 2024 •

edited

Loading

Write metadata cache data to mappings _meta with refresh time update #805

Are you sure you want to change the base?

Write metadata cache data to mappings _meta with refresh time update #805

Conversation

seankao-az commented Oct 24, 2024 • edited Loading

Description

Metadata Cache Writer

Last Refresh Time

Related Issues

seankao-az commented Oct 24, 2024

ykmr1224 Oct 25, 2024

Choose a reason for hiding this comment

seankao-az Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

seankao-az Oct 25, 2024

Choose a reason for hiding this comment

seankao-az commented Oct 25, 2024

ykmr1224 left a comment

Choose a reason for hiding this comment

seankao-az commented Oct 25, 2024 • edited Loading

seankao-az commented Oct 24, 2024 •

edited

Loading

seankao-az Oct 25, 2024 •

edited

Loading

seankao-az commented Oct 25, 2024 •

edited

Loading