-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Migrate to vise
metrics
#2
Conversation
a727bac
to
e136bad
Compare
@@ -12,7 +12,7 @@ on: | |||
env: | |||
CARGO_TERM_COLOR: always | |||
CARGO_INCREMENTAL: "0" | |||
RUSTFLAGS: "-Dwarnings -C linker=clang -C link-arg=-fuse-ld=lld" | |||
RUSTFLAGS: "-Dwarnings -C linker=clang -C link-arg=-fuse-ld=lld -C link-arg=-Wl,-z,nostart-stop-gc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stupid question: What's the reason of using the ld.lld
linker? It requires specifying this additional arg to circumvent an issue with the library that powers metrics registration. In the meanwhile, I'll think about potential registration alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make CI faster. We can switch to mold if that helps.
@@ -341,7 +341,7 @@ | |||
"uid": "ebcc7fad-20b5-44f0-a8ab-7ba2195ef2c0" | |||
}, | |||
"editorMode": "code", | |||
"expr": "avg(rate(network_rpc_message_size_sum{test_id=\"$test_id\",type=~\"req_sent|resp_sent\"}[1m])) by (type,method,submethod)", | |||
"expr": "avg(rate(network_rpc_message_size_bytes_sum{test_id=\"$test_id\",type=~\"req_sent|resp_sent\"}[1m])) by (type,method,submethod)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've edited this file manually, so not 100% sure if it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To verify that it works, you can upload this dashboard to grafana, rerun the loadtest from CI and see if this metrics is collected properly.
@@ -59,6 +58,8 @@ time = "0.3.23" | |||
tokio = { version = "1.28.1", features = ["full"] } | |||
tracing = { version = "0.1.37", features = ["attributes"] } | |||
tracing-subscriber = { version = "0.3.16", features = ["env-filter", "fmt"] } | |||
vise = { version = "0.1.0", git = "https://github.com/matter-labs/vise.git", rev = "8322ddc4bb115a7d11127626730b94f93b804cbe" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JFYI, having a non-whitelisted git source, which probably should fail cargo-deny
, actually doesn't. Should I add the sources
section to deny.toml
, or would it be too restrictive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Odd, it should deny by default. Yes, please add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, doesn't vise have tags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet; I can create one, but since the vise
crates are still in active development, I'm not sure it makes sense ATM.
s.spawn_bg(async { | ||
let addr = addr; | ||
MetricsExporter::default() | ||
.with_graceful_shutdown(ctx.canceled_owned()) // FIXME: support non-'static shutdown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked by matter-labs/vise#12. IMO, the canceled_owned
method could be useful in any case.
I'm fairly sure that the coverage job failing isn't related to my changes. Locally, I've got |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Approved.
node/libs/concurrency/src/metrics.rs
Outdated
/// Direction of a TCP connection. | ||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, EncodeLabelSet, EncodeLabelValue)] | ||
#[metrics(label = "direction", rename_all = "snake_case")] | ||
pub(crate) enum Direction { | ||
/// Inbound connection. | ||
Inbound, | ||
/// Outbound connection. | ||
Outbound, | ||
} | ||
|
||
/// Collection of metrics. Implements prometheus::core::Collector, | ||
/// it can be used to avoid embedding gauges in objects: | ||
/// instead you can register a collector which will on demand gather | ||
/// the metrics from the specificied object. | ||
pub struct Collector<T> { | ||
t: Weak<T>, | ||
descs: Vec<prometheus::core::Desc>, | ||
builders: Vec<Box<Fetcher<T>>>, | ||
/// Metrics reported for TCP connections. | ||
#[derive(Debug, Metrics)] | ||
#[metrics(prefix = "concurrency_net_tcp")] | ||
pub(crate) struct TcpMetrics { | ||
/// Total bytes sent over all TCP connections. | ||
#[metrics(unit = Unit::Bytes)] | ||
pub(crate) sent: Counter, | ||
/// Total bytes received over all TCP connections. | ||
#[metrics(unit = Unit::Bytes)] | ||
pub(crate) received: Counter, | ||
/// TCP connections established since the process started. | ||
pub(crate) established: Family<Direction, Counter>, | ||
/// Number of currently active TCP connections. | ||
pub(crate) active: Family<Direction, Gauge>, | ||
} | ||
|
||
impl<T> Collector<T> { | ||
/// Constructs a new noop Collector. | ||
pub fn new(t: Weak<T>) -> Self { | ||
Self { | ||
t, | ||
descs: vec![], | ||
builders: vec![], | ||
} | ||
} | ||
/// Adds a gauge to the Collector. | ||
/// `f` is expected to fetch the current gauge value. | ||
pub fn gauge( | ||
mut self, | ||
name: &'static str, | ||
help: &'static str, | ||
f: impl 'static + Send + Sync + Fn(&T) -> f64, | ||
) -> Self { | ||
self.descs.push( | ||
prometheus::core::Desc::new(name.to_string(), help.to_string(), vec![], HashMap::new()) | ||
.unwrap(), | ||
); | ||
self.builders.push(Box::new(move |t| { | ||
let mut mf = prometheus::proto::MetricFamily::default(); | ||
mf.set_field_type(prometheus::proto::MetricType::GAUGE); | ||
mf.set_name(name.to_string()); | ||
mf.set_help(help.to_string()); | ||
mf.mut_metric().push_default().mut_gauge().set_value(f(t)); | ||
mf | ||
})); | ||
self | ||
} | ||
/// TCP metrics instance. | ||
#[vise::register] | ||
pub(crate) static TCP_METRICS: vise::Global<TcpMetrics> = vise::Global::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this belong in the network crate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree; I've moved these metrics to the network
crate. Can roll back if @pompon0 disagrees with this decision 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These metrics are not consensus-network specific. Also with this change you lose coupling between inbound and outbound tcp streams - you have to manually wrap each end into a MeteredStream and specify the direction.
If you want to consider these metrics as consensus-specific, please change the metric prefix accordingly to avoid confusion which TCP streams are accounted for in those metrics. Also please provide connect/accept calls to avoid specifying the direction manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...So @brunoffranca, should this change be reverted? I'm fine with either option at this point, slightly leaning towards specializing metrics.
Btw, by not being consensus-specific, do you mean potential use cases of the concurrency framework (e.g., in the server codebase eventually), or the current use cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean potential use cases.
44b94c9
to
e5685d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What ❔
Migrates crates in the repo to use
vise
metrics.Why ❔
Uniform metrics handling throughout the codebase.