Exposing collection metrics as a Flow collection #175

psFried · 2021-07-28T21:58:40Z

psFried
Jul 28, 2021
Maintainer

The basic idea is for Flow to expose collection-related metrics and meta-data as Flow collection, rather than only via prometheus metrics. This meta-collection would be created and have documents added to it automatically. The next thing we need to decide is what information should be provided, and how it should be organized into automatic collection(s). What follows is an initial proposal to serve as a starting point for the conversation.

Here's a proposed JSON schema of the collection:

type: object
description: unified schema of all collection meta-information
properties: 
  eventType: { enum: [startup, metrics, error] }
  taskType: 
    description: The type of the catalog task
    enum: [capture derivation materialization]
  name: 
    description: The name of the catalog task (without the task type prefix)
    type: string
  shard: 
    description: The portion of the shard FQN that includes key and rclock ranges
    type: string
    regex: '[0-9A-F]{8}-[0-9A-F]{8}'
  ts: 
    description: Timestamp corresponding to the start of the transaction
    type: string
    format: date-time
  startup:
    description: |
      A startup event will be emitted whenever we initialize a new task term. It includes the task
      revision, so that you can detect when a new version of a catalog task is started, versus just
      re-starting an existing task for some other reason.
    type: object
    properties:
      taskRevision: { type: integer }
  metrics:
    type: object
    properties:
      combine:
        type: object
        properties:
          left: { $ref: '#/$defs/combineMetrics' }
          right: { $ref: '#/$defs/combineMetrics' }
          out: { $ref: '#/$defs/combineMetrics' }
      lambda:
        type: object
        properties:
          update: { $ref: '#/$defs/lambdaMetrics' }
          publish: { $ref: '#/$defs/lambdaMetrics' }
      connector:
        type: object
        properties:
          invocations:
            type: integer
            description: Counter of connector invocations
  error: 
    type: object
    properties:
      kind:
        type: string
        description: A constant that specifies the type of error.
      message:
        type: string
        description: A human-readable description of the error
      terminal:
        type: boolean
        description: |
          True if processing for this shard will be halted because of this error. False
          if the error will be retried.
    oneOf:
      - properties: 
          kind: { const: "connector" }
          stderr:
            type: string
            description: The stderr output from the connector, tuncated to some reasonable size.
      - properties: 
          kind: { const: "lambda" }
          lambdaType: { enum: ["update", "publish"] }
          stderr:
            type: string
            description: The stderr output from nodejs, tuncated to some reasonable size.
      - properties: 
          kind: { const: "validation" }
          document:
            type: object
            description: The document that failed validation against the schema.
required: ['eventType', 'taskType', 'name', 'shard', 'ts']
$defs: 
  combineMetrics: 
    type: object
    properties: 
      docs: 
        description: The number of documents processed from or into the given collection capture or materialization
        type: integer
      bytes: 
        description: |
          The number of bytes processed from or into the given collection capture or materialization. Note that for 
          captures and materializations this relates to the size of Flow JSON documents not the representation used by the external system.
        type: integer
    required: [docs, bytes]
  lambdaMetrics:
    type: object
    properties:
      invocations:
        type: integer
        description: The number of invocations during this transaction
      duration:
        type: integer
        description: The total number of milliseconds (or microseconds?) elapsed for all invocations.

There's a few different ideas worth discussing:

This schema models things as if all the events were in a single collection that's logically partitioned on [/eventType, /taskType, /name]. The key of the collection would include those fields, and also /shard and /ts. The thought here is that a logically partitioned collection is essentially as easy to work with as separate collections, and that it feels reasonable to cram everything into one schema. It might not feel so reasonable if we think of a lot more fields we'd want to add, though.

The shard id is essentially decomposed into separate fields for taskType, name, and the key/rclock ranges. The thought there is that it would be common for users to want to read events related to only captures, for example, and that representing them as separate fields would make that easier because we could logically partition on taskType and name. But doing things that way requires the shard key ranges to be represented separately so that you can disambiguate events that come from different shards, and having "shard": "00000000-00000000" feels maybe a little weird. Perhaps making shard more of an opaque id would be better?

One potential source of bloat here would be error types. I think we can keep this pretty minimal, though. The idea is that the only errors that are really actionable are ones that arise from some code that the user potentially wrote (connectors or derivations), or schema validation errors. Everything else seems like our problem, and so probably not something we should expose to the user.

In terms of how all this gets wired up, I like the idea of there being essentially one meta-collection per tenant, that gets created automatically. But I think this might get a little weird because I don't know what we'd name it. If we hard-code a name like flow/metrics, then it violates the assumption that collections exist in a global namespace. Do we need to add a top-level field to our yaml for the user to specify a tenant name, so we can use <tenant-name>/flow/metrics?

Edited to update the schema since I realized the read and write fields didn't actually make any sense 🙃

jgraettinger · 2021-07-30T03:08:14Z

jgraettinger
Jul 30, 2021
Maintainer

Overall, I'm a big fan of the approach! A few scattered opinions:

We should avoid defining any metrics semantics in this schema. Rather, the schema would define the metric name, it's type (count, gauge, etc -- would these drive reduction behaviors?), labels, etc. Perhaps answer "what's the ideal data model that a prometheus remote-write materialization would want?" and work backwards from there.
Really what we're talking about are metrics and logs. Formally framing around those paradigms is helpful. For example, it suggests that error should be encoded into a log level, and a startup taskRevision should be a structured log argument.
We should separate metrics and logs into two collections. They seem different enough to warrant it, and it means a simple materialization is more meaningful.
An additional use case for logs is auditing (in addition to notifications and alerting). "Who changed what, when?"
I don't know that we need to surface the "shard" concept, since key / r-clock ranges offer more information and imply a shard ID.

Do we need to add a top-level field to our yaml for the user to specify a tenant name, so we can use /flow/metrics?

I think in the managed service they'd be named ops/<tenant-name>/metrics, or similar. We create and own the collection under an ops tenant, but use standard AuthZ mechanisms to permission <tenant-name> for reads. I don't believe Flow yamls need know anything about tenants. Rather, the user AuthN is mapped to a tenant role, and from their the control plan assesses their capability to read/write the catalog entities in question.

Not to get too inception-y, but I wonder a bit about AuthZ of the individual log / metric documents (where I'm not allowed to even know metrics of some other team's stuff). Definitely a "not now," but there's a "... but" there. Document-level and even document-location-level access control are big question marks to me. Maybe the former is just a special case of the latter?

1 reply

psFried Aug 6, 2021
Maintainer Author

Great feedback, here. I agree with the bullets and will post an updated set of schemas shortly.

I'm still see a potential issue with the metrics collections names, but with the benefit of hindsight I think it might not be a big deal. The issue I see is that, if we name the collection something like ops/<tenant>/metrics in the managed service, then what do we name it when you run flowctl develop? This may not be relevant if the only use of metrics/logs collections is to materialize them into other systems. But if you want a derivation that reads from metrics or logs collections, then it gets a little more difficult, because you wouldn't be able to run the derivation locally, since the <tenant> would not be known. That A, doesn't seem like a high priority use case, and B, is more of a general issue with the notion of "local development" being in some tension with the idea of collections existing in a global namespace, so I don't think this should block the discussion on metrics.

psFried · 2021-08-18T21:56:31Z

psFried
Aug 18, 2021
Maintainer Author

Here's another crack at collection schemas, taking Johnny's feedback into account.

$defs:
  task:
    description: 'Identifies a task, which may be the source of a log message or metrics'
    type: object
    properties:
      taskType: 
        description: The type of the catalog task
        enum: [capture, derivation, materialization]
      name: 
        description: The name of the catalog task (without the task type prefix)
        type: string
      keyBegin: 
        description: The inclusive beginning of the shard's assigned key range
        type: string
        regex: '[0-9A-F]{8}-[0-9A-F]{8}'
      rClockBegin: 
        description: The inclusive beginning of the shard's assigned rClock range
        type: string
        regex: '[0-9A-F]{8}-[0-9A-F]{8}'
    required: [taskType, name, keyBegin, rClockBegin]
    
  metrics:
    type: object
    properties:
      task: { $ref: '#/$defs/task' }
      ts: 
        description: Timestamp corresponding to the start of the transaction
        type: string
        format: date-time

      errors_total:
        type: integer
        description: Count of failed transactions

      combine:
        type: object
        properties:
          left: { $ref: '#/$defs/combineMetrics' }
          right: { $ref: '#/$defs/combineMetrics' }
          out: { $ref: '#/$defs/combineMetrics' }
      lambda:
        type: object
        properties:
          update: { $ref: '#/$defs/lambdaMetrics' }
          publish: { $ref: '#/$defs/lambdaMetrics' }
      connector:
        type: object
        properties:
          start:
            type: integer
            description: Counter of connector invocations
    required: [task, ts, ]

  log:
    type: object
    properties:
      ts: 
        description: Timestamp corresponding to the start of the transaction
        type: string
        format: date-time
      level:
        enum: [debug, info, warn, error]
      fields:
        description: Map of arbitrary keys and values to log.
        type: object
        additionalProperties: true
      message: { type: string }
    required: [ts, level, message]

  combineMetrics: 
    type: object
    properties: 
      docs_total: 
        description: The number of documents processed from or into the given collection capture or materialization
        type: integer
      bytes_total: 
        description: |
          The number of bytes processed from or into the given collection capture or materialization. Note that for 
          captures and materializations this relates to the size of Flow JSON documents not the representation used by the external system.
        type: integer
    required: [docs_total, bytes_total]

  lambdaMetrics:
    type: object
    properties:
      invocations_total:
        type: integer
        description: The number of invocations during this transaction
      duration_seconds:
        type: number
        description: The total number of seconds elapsed for all invocations.
    required: [invocations_total, duration_seconds]

So log and metrics would each be a separate collection. Events like the startup of a shard would be surfaced as log events. Errors would be surfaced as both a log event and as a counter in metrics. This is just to make it easy to surface error rates in prometheus, while keeping the details in the logs.

My proposal for the tenant name is that we hard-code the tenant name of local, so when you run flowctl develop, the collection names would be ops/local/logs and ops/local/metrics.

Not to get too inception-y, but I wonder a bit about AuthZ of the individual log / metric documents (where I'm not allowed to even know metrics of some other team's stuff). Definitely a "not now," but there's a "... but" there. Document-level and even document-location-level access control are big question marks to me. Maybe the former is just a special case of the latter?

An alternative that I've been thinking about is partition-level AuthZ. If Flow's AuthZ model allowed for permissions on a per-partition basis, then it would allow for logs and metrics to be permissioned on a per-collection basis (user A can only view the collection=foo partition of ops/acmeCo/logs). Still definitely a "not now" thing, but that's just what came to mind when thinking about how we might setup the collections, since it seems pretty natural to logically partition these logs and metrics collections anyway.

2 replies

jgraettinger Aug 19, 2021
Maintainer

So log and metrics would each be a separate collection. Events like the startup of a shard would be surfaced as log events. Errors would be surfaced as both a log event and as a counter in metrics. This is just to make it easy to surface error rates in prometheus, while keeping the details in the logs.

👍

My proposal for the tenant name is that we hard-code the tenant name of local, so when you run flowctl develop, the collection names would be ops/local/logs and ops/local/metrics.

The downside of this is it's unclear how you'd test catalog entities which read from ops/$(tenant)/metrics. Could $(tenant) be drawn from the leading path component of the catalog task ? Then a local dev stack would create ops/$(tenant)/metrics as needed, and in you (in your dev instance) are free to read anything under ops/.

An alternative that I've been thinking about is partition-level AuthZ.

This is brilliant 💡. 100%.

Regarding metrics schema -- I still think we want to schematize a generalized representation of a Prometheus-like metric instance, rather than encoding specific semantic metrics which are included.

Should metrics also be a map, so that multiple metrics can be carried (and reduced) within a single top-level document ?

psFried Aug 20, 2021
Maintainer Author

I think what you're suggesting here is something like:

metrics:
  type: object
  properties:
    task: { $ref: '#/$defs/task' }
    ts: 
      description: Timestamp corresponding to the start of the transaction
      type: string
      format: date-time
    values:
      additionalProperties: { type: number }

where the values map would contain flattened metrics like combine_left_docs_total? Am I understanding that right?

psFried · 2021-08-19T17:08:21Z

psFried
Aug 19, 2021
Maintainer Author

The downside of this is it's unclear how you'd test catalog entities which read from ops/$(tenant)/metrics.

Yeah, agreed, and this is why I had initially suggested adding tenant as a top-level field. I strongly prefer your idea of using the leading path component, but I'm a little hesitant to commit to parsing the tenant name from the collection name, because I'm not sure that the collection name will still be the right representation once we introduce the managed service. I don't know that we'll want to change it, but my intuition is that we may want to introduce the concept of a hostname to collections. So perhaps <tenant>/foo/bar might become <tenant>.flow.dev/foo/bar, which would be represented in the catalog spec as just foo/bar. Something like that might allow collections to have a globally addressable name, and allow it to change when the catalog is appllied to different environments. Maybe this is something we can talk through in person? I don't think we should gate the metrics/logs behind "figure out how globally addressable collections will work", but I do think it might help if we can look just far enough ahead to know how we want to represent the tenant name.

2 replies

jgraettinger Aug 19, 2021
Maintainer

Something like that might allow collections to have a globally addressable name, and allow it to change when the catalog is appllied to different environments

What's an environment here? I don't follow how it's not already globally addressable, or how the hostname helps in that regard.

I don't think we should gate the metrics/logs behind "figure out how globally addressable collections will work"

Agreed, I think we're aligned enough to start work on something that's probably wrong in specifics but right in direction.

psFried Aug 20, 2021
Maintainer Author

What's an environment here? I don't follow how it's not already globally addressable, or how the hostname helps in that regard.

I think we can just ignore this for now and move forward with extracting the tenant name from the collection name. I would like to discuss this more, in the context of how we approach the design of the control plane. But I think these considerations are pretty separable.

jgraettinger · 2021-08-19T18:33:31Z

jgraettinger
Aug 19, 2021
Maintainer

One other question on metrics: should we actually want to represent type:COUNT metrics as deltas, rather than absolutes ?

Prometheus chose lifetime counts because it's 1) pull-based, and 2) at-least-once.

We're push-based, exactly-once, and have a built in notion of reductions. If the collection is keyed on the full metric name, then a simple materialization is already a total count. By keeping metrics as deltas, it makes it easier to express roll-ups when aggregating across metrics (because you can trivially already know how much count mass to add to the roll-up, whereas a total count requires that you keep a register).

3 replies

jgraettinger Aug 19, 2021
Maintainer

I think a materialization to prometheus could get away with accumulating in-memory rolled up counts (without need for persisting to a state store) because prometheus already defines its rate operations to gracefully handle breaks in monotonicity (counters which reset to zero).

psFried Aug 20, 2021
Maintainer Author

I think a materialization to prometheus could get away with accumulating in-memory rolled up counts

Agreed 👍
Also agreed on individual metrics documents only representing deltas (for a single transaction), rather than totals. I left off the reduction annotations from the schemas just for the sake of brevity, but my expectation is that the schema would include them, so that a simple materialization could always have the rolled up totals.

psFried Aug 20, 2021
Maintainer Author

I just re-read your earlier comment:

I still think we want to schematize a generalized representation of a Prometheus-like metric instance, rather than encoding specific semantic metrics which are included

and I think you were meaning that to be superseded by the comments in this thread. So I think you're suggesting that we should include reduction annotations for specific metrics in the JSON schema instead of using a map of generalized prometheus metric ids to values. Is that right?

psFried · 2021-08-30T14:51:43Z

psFried
Aug 30, 2021
Maintainer Author

Update on my thinking regarding error reporting:

The initial idea was for stats to have an errors_total field that would be set to 1 if the transaction failed. The intent was that it would provide an easy way to alert on some threshold of error rate, but I'm realizing that this was a misstep. Stats can be thought of as being fundamentally a record of transaction attempts, or as a record of committed transactions. Either one is useful, and neither is a terrible idea, but I think that the latter concept is what we should go with for now. So I think we should omit the errors_total field on stats.

If stats are fundamentally a record of only committed transactions, then a trivial materialization of stats documents gives you correct statistics about the data that is actually in your collection. If we keep errors_total, then you would need to filter out any documents representing failed transactions before doing any sort of materialization, or else it would include stats for failed transactions. Also, the log seems like a better place to do error reporting, so its inclusion in stats is seeming unnecessary anyway.

0 replies

psFried · 2021-09-21T15:12:24Z

psFried
Sep 21, 2021
Maintainer Author

Logs are fundamentally scoped to a task. For captures and materializations, it's pretty easy to capture the stderr of the driver and associate it with the specific task. But for derivations, this is not possible, because there is a single nodejs process that's shared by all derivations with the same commons id. That's a bit of a bummer, because I'd really like to have any console.log output from the derivation to be added to the logs collection. I'm currently pondering a few options.

One idea would be to just forget about capturing logs from nodejs and focus on capturing them from deno instead, whenever we get around to supporting it. My hunch is that there won't be a compelling reason to support both long-term, anyway, and that deno will simply replace the node runtime altogether. If that turns out to be the case, then this option would seem like a pretty decent idea.

Another option is to spawn speparate nodejs processes for each derivation task. This might be worthwhile, if we don't expect to add deno support anytime soon.

My own disposition at this point is to plan on supporting deno and see where that gets us. I'm curious if anyone else has thoughts on this.

0 replies

psFried · 2021-10-05T19:56:01Z

psFried
Oct 5, 2021
Maintainer Author

The initial logging PR is now merged. With that, the Flow runtime now publishes logs into a separate collection per tenant. The tenant name is the first path component of a task, so a task named acmeCo/foo/bar will have its logs published to the collection ops/acmeCo/logs. All logs collections are partitioned such that each shard has its logs written to its own partition. The logs collections can be used as the source for derivations and materializations, just like any other collection, and partition selectors can be used in order to filter the logs by task type and name. This is intended to be the only mechanism we use to expose logs from production environments to users.

Whenever a task term is initialized, the runtime will create a LogPublisher that is scoped to the specific shard that is starting. This publisher is embedded in the taskTerm, which is itself embedded in the runtime Application impls (Derive, Capture, and Materialize types in the runtime package). This LogPublisher is different from the normal logrus (typically imported as just log) package that we've been using. All logs published using the LogPublisher are implicitly scoped to a specific shard and published to the appropriate Flow collection, while logrus still only prints to stderr. Note that the LogPublisher will also forward to logrus so that we can still see them in stderr.

The takeaway for developers is that you should always use the LogPublisher for any log messages that are scoped to a specific task, which is probably most. There's lots of existing logging that doesn't use the new publisher, so please try to update those as you have the opportunity. We also still need to wire up mechanisms for surfacing log messages from the parser, materialization connectors, and from the extract, combine and derive APIs on the Rust side, but the goal is to get that all working soon. The convention is for any errors to be set as a structured log field called "error", rather than being formatted as part of the message. The message should typically be a brief string literal, since dynamic data can be added as structured fields. These log messages are a primary means that users will rely on to observe their tasks at runtime, and so we should strive to make them clear, helpful, and consistent. So please chime in if there's any other logging conventions or practices that will help with that.

1 reply

psFried Oct 6, 2021
Maintainer Author

I happened to notice this article on hacker news last night, and the suggestions there make a lot of sense to me. I particularly like the guidance that INFO logs should be logging things that have happened, rather than things that are about to happen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exposing collection metrics as a Flow collection #175

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Exposing collection metrics as a Flow collection #175

psFried Jul 28, 2021 Maintainer

Replies: 7 comments · 9 replies

jgraettinger Jul 30, 2021 Maintainer

psFried Aug 6, 2021 Maintainer Author

psFried Aug 18, 2021 Maintainer Author

jgraettinger Aug 19, 2021 Maintainer

psFried Aug 20, 2021 Maintainer Author

psFried Aug 19, 2021 Maintainer Author

jgraettinger Aug 19, 2021 Maintainer

psFried Aug 20, 2021 Maintainer Author

jgraettinger Aug 19, 2021 Maintainer

jgraettinger Aug 19, 2021 Maintainer

psFried Aug 20, 2021 Maintainer Author

psFried Aug 20, 2021 Maintainer Author

psFried Aug 30, 2021 Maintainer Author

psFried Sep 21, 2021 Maintainer Author

psFried Oct 5, 2021 Maintainer Author

psFried Oct 6, 2021 Maintainer Author

psFried
Jul 28, 2021
Maintainer

Replies: 7 comments 9 replies

jgraettinger
Jul 30, 2021
Maintainer

psFried Aug 6, 2021
Maintainer Author

psFried
Aug 18, 2021
Maintainer Author

jgraettinger Aug 19, 2021
Maintainer

psFried Aug 20, 2021
Maintainer Author

psFried
Aug 19, 2021
Maintainer Author

jgraettinger Aug 19, 2021
Maintainer

psFried Aug 20, 2021
Maintainer Author

jgraettinger
Aug 19, 2021
Maintainer

jgraettinger Aug 19, 2021
Maintainer

psFried Aug 20, 2021
Maintainer Author

psFried Aug 20, 2021
Maintainer Author

psFried
Aug 30, 2021
Maintainer Author

psFried
Sep 21, 2021
Maintainer Author

psFried
Oct 5, 2021
Maintainer Author

psFried Oct 6, 2021
Maintainer Author