Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Migrate to vise metrics #2

Merged
merged 15 commits into from
Sep 29, 2023
Merged

Conversation

slowli
Copy link
Contributor

@slowli slowli commented Sep 27, 2023

What ❔

Migrates crates in the repo to use vise metrics.

Why ❔

Uniform metrics handling throughout the codebase.

@slowli slowli force-pushed the aov-bft-333-migrate-to-vise-metrics branch from a727bac to e136bad Compare September 27, 2023 12:59
@@ -12,7 +12,7 @@ on:
env:
CARGO_TERM_COLOR: always
CARGO_INCREMENTAL: "0"
RUSTFLAGS: "-Dwarnings -C linker=clang -C link-arg=-fuse-ld=lld"
RUSTFLAGS: "-Dwarnings -C linker=clang -C link-arg=-fuse-ld=lld -C link-arg=-Wl,-z,nostart-stop-gc"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stupid question: What's the reason of using the ld.lld linker? It requires specifying this additional arg to circumvent an issue with the library that powers metrics registration. In the meanwhile, I'll think about potential registration alternatives.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make CI faster. We can switch to mold if that helps.

@@ -341,7 +341,7 @@
"uid": "ebcc7fad-20b5-44f0-a8ab-7ba2195ef2c0"
},
"editorMode": "code",
"expr": "avg(rate(network_rpc_message_size_sum{test_id=\"$test_id\",type=~\"req_sent|resp_sent\"}[1m])) by (type,method,submethod)",
"expr": "avg(rate(network_rpc_message_size_bytes_sum{test_id=\"$test_id\",type=~\"req_sent|resp_sent\"}[1m])) by (type,method,submethod)",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've edited this file manually, so not 100% sure if it works.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To verify that it works, you can upload this dashboard to grafana, rerun the loadtest from CI and see if this metrics is collected properly.

@@ -59,6 +58,8 @@ time = "0.3.23"
tokio = { version = "1.28.1", features = ["full"] }
tracing = { version = "0.1.37", features = ["attributes"] }
tracing-subscriber = { version = "0.3.16", features = ["env-filter", "fmt"] }
vise = { version = "0.1.0", git = "https://github.com/matter-labs/vise.git", rev = "8322ddc4bb115a7d11127626730b94f93b804cbe" }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JFYI, having a non-whitelisted git source, which probably should fail cargo-deny, actually doesn't. Should I add the sources section to deny.toml, or would it be too restrictive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odd, it should deny by default. Yes, please add.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, doesn't vise have tags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet; I can create one, but since the vise crates are still in active development, I'm not sure it makes sense ATM.

s.spawn_bg(async {
let addr = addr;
MetricsExporter::default()
.with_graceful_shutdown(ctx.canceled_owned()) // FIXME: support non-'static shutdown
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked by matter-labs/vise#12. IMO, the canceled_owned method could be useful in any case.

@slowli slowli marked this pull request as ready for review September 27, 2023 15:14
@slowli
Copy link
Contributor Author

slowli commented Sep 27, 2023

I'm fairly sure that the coverage job failing isn't related to my changes. Locally, I've got cargo nextest run hang sometimes on the consensus module tests as well (the same tests that hang up the coverage job).

Copy link
Member

@brunoffranca brunoffranca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Approved.

Comment on lines 23 to 51
/// Direction of a TCP connection.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, EncodeLabelSet, EncodeLabelValue)]
#[metrics(label = "direction", rename_all = "snake_case")]
pub(crate) enum Direction {
/// Inbound connection.
Inbound,
/// Outbound connection.
Outbound,
}

/// Collection of metrics. Implements prometheus::core::Collector,
/// it can be used to avoid embedding gauges in objects:
/// instead you can register a collector which will on demand gather
/// the metrics from the specificied object.
pub struct Collector<T> {
t: Weak<T>,
descs: Vec<prometheus::core::Desc>,
builders: Vec<Box<Fetcher<T>>>,
/// Metrics reported for TCP connections.
#[derive(Debug, Metrics)]
#[metrics(prefix = "concurrency_net_tcp")]
pub(crate) struct TcpMetrics {
/// Total bytes sent over all TCP connections.
#[metrics(unit = Unit::Bytes)]
pub(crate) sent: Counter,
/// Total bytes received over all TCP connections.
#[metrics(unit = Unit::Bytes)]
pub(crate) received: Counter,
/// TCP connections established since the process started.
pub(crate) established: Family<Direction, Counter>,
/// Number of currently active TCP connections.
pub(crate) active: Family<Direction, Gauge>,
}

impl<T> Collector<T> {
/// Constructs a new noop Collector.
pub fn new(t: Weak<T>) -> Self {
Self {
t,
descs: vec![],
builders: vec![],
}
}
/// Adds a gauge to the Collector.
/// `f` is expected to fetch the current gauge value.
pub fn gauge(
mut self,
name: &'static str,
help: &'static str,
f: impl 'static + Send + Sync + Fn(&T) -> f64,
) -> Self {
self.descs.push(
prometheus::core::Desc::new(name.to_string(), help.to_string(), vec![], HashMap::new())
.unwrap(),
);
self.builders.push(Box::new(move |t| {
let mut mf = prometheus::proto::MetricFamily::default();
mf.set_field_type(prometheus::proto::MetricType::GAUGE);
mf.set_name(name.to_string());
mf.set_help(help.to_string());
mf.mut_metric().push_default().mut_gauge().set_value(f(t));
mf
}));
self
}
/// TCP metrics instance.
#[vise::register]
pub(crate) static TCP_METRICS: vise::Global<TcpMetrics> = vise::Global::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this belong in the network crate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree; I've moved these metrics to the network crate. Can roll back if @pompon0 disagrees with this decision 🙃

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metrics are not consensus-network specific. Also with this change you lose coupling between inbound and outbound tcp streams - you have to manually wrap each end into a MeteredStream and specify the direction.

If you want to consider these metrics as consensus-specific, please change the metric prefix accordingly to avoid confusion which TCP streams are accounted for in those metrics. Also please provide connect/accept calls to avoid specifying the direction manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...So @brunoffranca, should this change be reverted? I'm fine with either option at this point, slightly leaning towards specializing metrics.

Btw, by not being consensus-specific, do you mean potential use cases of the concurrency framework (e.g., in the server codebase eventually), or the current use cases?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean potential use cases.

@slowli slowli force-pushed the aov-bft-333-migrate-to-vise-metrics branch from 44b94c9 to e5685d3 Compare September 28, 2023 08:35
@slowli slowli requested a review from pompon0 September 29, 2023 12:14
Copy link
Collaborator

@pompon0 pompon0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@slowli slowli merged commit 68feb40 into main Sep 29, 2023
4 of 5 checks passed
@slowli slowli deleted the aov-bft-333-migrate-to-vise-metrics branch September 29, 2023 13:53
brunoffranca pushed a commit that referenced this pull request Oct 4, 2023
# What ❔

Migrates crates in the repo to use `vise` metrics.

## Why ❔

Uniform metrics handling throughout the codebase.
brunoffranca pushed a commit that referenced this pull request Oct 4, 2023
# What ❔

Migrates crates in the repo to use `vise` metrics.

## Why ❔

Uniform metrics handling throughout the codebase.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants