Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: improve aggregator event queue efficiency #457

Open
wants to merge 74 commits into
base: main
Choose a base branch
from

Conversation

h33p
Copy link
Contributor

@h33p h33p commented Aug 17, 2023

  • Implement queue drain pressure that is based on channel's capacity. Pressure usually leads to the system reaching equilibrium without piling the events up.
  • Allow configuring channel's target fill capacity, before flushing.
  • Lower event buffer capacity, because the current value let the buffer grow into gigabytes in size.

Details

This reverts #235, because I found that limit being unreasonably high - adding the following code would lead to tens of gigabytes of size events being piled up to the event queue:

    let spam = tokio::spawn(async {
        loop {
            let _ = tokio::spawn(async {
                futures::future::join_all((0..100).map(|_| {
                    tokio::spawn(async {
                        let now = tokio::time::Instant::now();
                        while now.elapsed() < tokio::time::Duration::from_secs(1) {
                            tokio::task::yield_now().await;
                        }
                    })
                }))
                .await;
            })
            .await;
        }
    });

The problem I found in testing is that the aggregator task would consistently fall behind the number of events being created and eventually the event queue would fill up, regardless of its size (increasing the size merely delays the problem). The reason why it fills up is rather subtle recv().now_or_never() eventually returns Poll::Pending due to task coop budget being used up. My immediate fix was to wrap the future with tokio::task::unconstrained, but I figured there's a reason why it's not doing try_recv() and that is to avoid livelocking the task.

What this PR does is expose the "fullness" level of the queue to the shared vars. This allows the aggregator task to have a secondary timer, alongside regular ticking, that adds pressure to clear out the event queue at the same rate as it's filling up.

Another important thing to note is that event queue filling up faster than it's being processed will lead to time dilation - the rate of events in the console would appear slower than it actually is. Adding pressure eventually leads to an equilibrium where events are processed at roughly equal rate.

One key change was adding a customizable variable event_buffer_flush_fraction - the higher the fraction the sooner this equilibrium will be reached. I set it to 10 for now (before the effective value was 2), which, with the reduction of event buffer capacity, should lead to 50x+ faster reaction by the aggregator task, while keeping 9x extra space for random spikes. Note that this variable may be changed to something slightly easier to understand, like event_buffer_flush_percentage, but that's just a user-facing detail I have no strong opinion about.

I believe this should fix #184, but I haven't done an actual long-term test in production, will see in a couple days. EDIT: it doesn't seem to fix it

hawkw and others added 30 commits August 5, 2022 12:36
This branch updates the console crates' dependency on `tonic` to v0.8.0
and `prost` to 0.11.0.

In addition, I've added a [`cargo xtask`][xtask] command for manually
regenerating the generated protobuf bindings.

This is necessary as the current approach, regenerating the bindings in
an integration test, does not work when the protos fail to compile
(which they do after the Tonic update). Since running the crate's tests
requires compiling the crate, if the proto bindings don't compile, we
can't re-run the test.

[xtask]: https://github.com/matklad/cargo-xtask
This changes the console CLI's error handling so that GitHub issues are
only suggested for panics, not for recoverable errors (such as "no
config file found", "couldn't connect to remote host", etc).
This way, if running a subcommand panics/errors, we still get nice
`color-eyre` reports.
This commit adds `color_eyre`'s default backtrace frame filters, so we
skip stuff that's not relevant.
…rs#365)

This adds a dump of all the console's config options to the issue
metadata for GitHub issues generated using `color_eyre`'s GitHub issue
generation on panics.

<details>

<summary>Example issue Markdown:</summary>

## Error
```
lol
```

## Metadata
|key|value|
|--|--|
|**version**|0.1.6|
|**config.subcmd**|`None`|
|**config.target_addr**|`Some(http://127.0.0.1:6669/)`|
|**config.env_filter**|`None`|
|**config.log_directory**|`Some("/tmp/tokio-console/logs")`|
|**config.retain_for**|`None`|
|**config.view_options.no_colors**|`false`|
|**config.view_options.lang**|`Some("en_US.UTF-8")`|
|**config.view_options.ascii_only**|`Some(false)`|
|**config.view_options.truecolor**|`Some(true)`|
|**config.view_options.palette**|`Some(All)`|
|**config.view_options.toggles.color_durations**|`Some(false)`|
|**config.view_options.toggles.color_terminated**|`Some(false)`|
|**location**|tokio-console/src/main.rs:36:5|

</summary>
<a name="0.4.0"></a>
## 0.4.0 (2022-08-10)

#### Breaking Changes

*  Update `tonic` to `0.8` (tokio-rs#364) ([40e2f6f](40e2f6f))

#### Features

*  Update `tonic` to `0.8` (tokio-rs#364) ([40e2f6f](40e2f6f))
<a name="0.1.7"></a>
## 0.1.7 (2022-08-10)

#### Features

*  Update `tonic` to `0.8` (tokio-rs#364) ([40e2f6f](40e2f6f))
*  Update `console-api` to `0.4` (tokio-rs#364) ([40e2f6f](40e2f6f))
<a name="0.1.7"></a>
## 0.1.7 (2022-08-10)

#### Features

*  include config options in autogenerated issues (tokio-rs#365)
   ([fcb54df](fcb54df))
*  filter out boring frames in backtraces (tokio-rs#365) ([523a44a](523a44a))
*  init error handling before subcmds (tokio-rs#365) ([6646568](6646568))
*  only suggest opening issues for panics (tokio-rs#365) ([23cb6bf](23cb6bf))
*  update `tonic` to `0.8` (tokio-rs#364) ([40e2f6f](40e2f6f))
*  update `console-api` to `0.4` (tokio-rs#364) ([40e2f6f](40e2f6f))
Needed for tokio-rs#374.

This configures clippy to ignore most of the generated code in
`console-api`.
Due to a change in the unstable task builder APIs, this no longer
compiles with the latest version of Tokio. Fortunately, it's a simple
fix.
<a name="0.1.7"></a>
## 0.1.8 (2022-09-04)

#### Bug Fixes

*  fix build on tokio 1.21.0 (tokio-rs#374) ([0106407](0106407))
Fixes: tokio-rs#372

`--ascii-only true` shows ascii
`--ascii-only false` shows emojis

if `--ascii-only` is not passed, default value will be true, shows emojis
)

Currently, the documented `cargo-run` in README no longer works, as now
there are two binaries:

```console
$ cargo run
error: `cargo run` could not determine which binary to run. Use the `--bin` option t
o specify a binary, or the `default-run` manifest key.
available binaries: tokio-console, xtask
```

This branch declares the `tokio-console` binary as the [`default-run`]
value in the `Cargo.toml`, so it is now run by default by `cargo run`.

Alternatively, we can just update the README line with `cargo run --bin
tokio-console`

[`default-run`]:
    (https://doc.rust-lang.org/cargo/reference/manifest.html#the-default-run-field)
Found via these commands:

    codespell .
    markdownlint *.md --disable MD013
…o-rs#387)

Currently, the help text for the `r` and `t` keystrokes to switch to the
Resources and Tasks views is shown on the Task Details and Resource
Details views, as well as on the Task List and Resource List views.

See this screenshot for an example:
![image](https://user-images.githubusercontent.com/2345750/202058962-4a8f062c-78fd-47eb-a3b1-5de470976aa8.png)

However, the keystrokes are not actually handled while the console is
showing a details view.

This branch changes the console to handle the `r` and `t` keystrokes on
all views. This also simplifies the keyboard input code somewhat.
Build and add to the released binaries for `aarch64-unknown-linux-gnu`,
`aarch64-apple-darwin` and `aarch64-pc-windows-msvc`
Hey I noticed this out of date in my `Cargo.lock`, and since I didn't
see any PRs/issues open doing so already, figured I'd open one!
Signed-off-by: hi-rustin <[email protected]>

Signed-off-by: hi-rustin <[email protected]>
The durations in the tokio-console UI are shown with a unit, so the
digits after the decimal separator are generally not relevant. Since
space is at a premium in a terminal UI, it makes sense to cut down where
possible.

This change removes all digits after the decimal separator in the tasks
table view. In the task detail view, 2 decimal places are kept.

Additionally, 4 new duration formats are added to represent
minutes-with-secondsi, hours-with-minutes, days-with-hours and days.
These are only applied once the leading unit is greater than zero.

For example, 59 seconds will be shown as seconds: `99s` and then 60
seconds will be shown as minutes-with-seconds: `1m00s`. New colors have
been chosen for each format.

NOTE: I had to make a small unrelated change in
`task::Details::make_percentiles_widget` to make clippy happy.

Fixes: tokio-rs#224
…kio-rs#404)

The `color-eyre` crate was previously on version 0.5. This version
depends on `tracing-subscriber` 0.2, meaning that we had 2 different
veresions of `tracing-subscriber` in our dependency tree.

This change updates `color-eyre` to 0.6, which depends on
`tracing-subscriber` 0.3, the same as `console-subscriber`.
Tokio Console generates its own sequential Id for internal tracking and
indexing of objects (tasks, resources, etc.). However, this Id will be
recreated if Console is restarted.

In order to provide more useful information to the user, the task Id
generated by Tokio can be used in the task list and task details screens
instead. If used in this way, the ID field in the task list and task
detail views will be stable across restarts of Console (assuming the
monitored application is not restarted).

This also frees up horizontal space, as the `task.id` attribute
doesn't need to be displayed separately.

The disadvantage of using Tokio's task Id is that it is not guaranteed
to be present by the type system.

To avoid problems with missing task Ids, the Tokio task Id is store in
addition to the `span::Id` (used to communicate with the
`console-subscriber`) and the sequential console `Id` (used in the
`store`). If a task is missing the `task.id` field for whatever reason
it will still appear, but with an empty ID. If multiple runtimes are
active, then duplicate ID values will appear.

Fixes: tokio-rs#385

Co-authored-by: Eliza Weisman <[email protected]>
Add support for console connections that use Unix domain sockets rather
than TCP.

Closes tokio-rs#296.

Co-authored-by: Eliza Weisman <[email protected]>
The Console API specifies sending task busy duration only for completed
polls, it doesn't include the time spent in the current poll if the task
is active.

Tokio Console then calculates the busy time including the time spent in
the current poll - based on the last poll start and poll end times sent
by the Console Subscriber.

However, there was an error in the logic which determined when a task is
being polled for the purpose of calculating the busy time. The logic
only considered the first poll, when there was no recorded end poll time
at all.

This change adapts the logic so that it also considers the case where
the last recorded poll start is later than the last recorded poll end.
This indicates that the task is being polled.

Co-authored-by: Eliza Weisman <[email protected]>
…s#408)

There are 2 widgets which display the poll times for a task in the
detail view. The poll times percentiles are always displayed and if
UTF-8 is available, then a sparkline histogram is also shown to the
right.

The logic for displaying these two widgets is quite long and is
currently interspersed within the `render` function for the task detail
view plus helper functions. Additionally, it is not easy to add a second
set of widgets showing the time between waking and being polled for a
task which is planned for tokio-rs#409.

This change factors out that logic into separate widgets.

There was already a separate widget `MiniHistogram`. Some of the logic
that was previously in the task detail view has been moved here.

A new widget `Percentiles` has been added to encapsulate the logic for
preparing and displaying the percentiles.

A top level `Durations` widget occupies the entire width of the task
detail view and control the horizontal layout of the `Percentiles` and
`MiniHistogram` widgets. The new widget will also supress the histogram
if there isn't at least enough room to display the legend at the bottom
Each task displays the sum of the time it has been idle and busy, as
well as the total. The idle time includes the time between when a task
is woken, and when the runtime actually polls that task.

There are cases where a task may be scheduled for a long time after
being woken, before it is polled. Especially if many tasks are woken at
the same time and don't yield back to the runtime quickly.

To add visibility to this, the total time that a task is scheduled
before being polled has been added. Additionally, a new task state
`Scheduled` has been added. This is displayed in both the tasks table
and in the task detail view.

In the `console-api`, the total `scheduled_time` for the task has been added to
the `TaskStats` message in `tasks.proto`.

This is the first of two parts. In the second part (tokio-rs#409), a histogram for
scheduled time will be added, the equivalent of the poll time histogram
which is already available on the task detail screen.

To show a pathological case which may lead to needing to see the
scheduled time, a new example has been added to the `console-subscriber`

## PR Notes

This PR does something adjacent to what is described in tokio-rs#50, but not quite.

The unicode character used for a `SCHED` task is ⏫.

The second PR (tokio-rs#409) will record a scheduled time histogram the same as it
recorded for poll times. These two changes should go in together (and certainly
shouldn't be released separately). However, this PR is already quite big, so they'll
be separated out.

The idea is that this PR isn't merged until the next one has also been reviewed
and approved. It would be good to get some feedback at this stage though.

The task list view with the new column for `Sched` time:

<img width="1032" alt="a tasks table view for the long-scheduled example" src="https://user-images.githubusercontent.com/89589/232456977-2921f884-4673-420f-ba4f-3646627d44db.png">

The `Task` block in the task detail view showing the new `Scheduled` time entry.

<img width="510" alt="The task block on the task detail view for the rx task in the long-scheduled example" src="https://user-images.githubusercontent.com/89589/232457332-e455e086-9468-42c9-8fda-7965d8d1e6f8.png">
hds and others added 22 commits June 1, 2023 22:16
Self-wakes were not being detected and displayed in the console. The
`burn` task in the `app` example - which deliberately has many
self-wakes - was not registering any.

It appears that a logic error was present in the self-wake counting in
`console-subscriber` since it was added in tokio-rs#238. When a self wake was
detected, the `wakes` count was incremented a second time (the `wakes`
count is incremented for all wakes before checking for a self wake),
instead of increamenting the `self_wakes` count.

This PR fixes the logic so that when a self wake is detected, the
`self_wakes` count is incremented.
Each view in tokio-console has a widget up the top that lists the
available controls for that view. There was a common implementation of
this for table based views (tasks, resources, and async_ops) and
separate implementations for the task and resource views. The resource
view included two controls widgets, one for the Resource details at the
top of the view, and another above the table of Async Ops at the bottom
of the view.

This change centralises the logic for the creation of this controls
widget. This change is mostly a precursor to also displaying the
controls in the help view (at which point we can revisit whether the
entire list needs to be shown at the top of the screen).

Controls (an action and the key or keys used to invoke it) are defined
in structs so that their definition can be separated from the display
logic (which includes whether or not UTF-8 is supported).

This allows the problem of the text in the controls widget wrapping in
the middle of a control definition to be fixed. Previously a subset of
the controls would have wrapped like this:

```text
controls: select column (sort) = ←→ or h, l, scroll = ↑↓ or k, j, view details
= ↵, invert sort (highest/lowest) = i,
```

Notice how "view details = ↵," was split across multiple lines. The same
list of controls will now wrap at a full control definition.

```text
controls: select column (sort) = ←→ or h, l, scroll = ↑↓ or k, j,
view details = ↵, invert sort (highest/lowest) = i,
```

Additionally, the list of controls on the Resource view has been
consolidated up the top of the screen.

Universal controls, those that are available in all views, are also
defined centrally. As well as the quit action, using the space bar to
pause has been added to that list. This was previously somewhat of an
undocumented feature.
We often receive questions about the meaning of certain columns in the
different views in Tokio Console.

This change adds additional documentation to the `tokio-console` README
- which is also used in the docs.rs main page - describing each column
in the tasks, resources, and async ops tables. Where there are a fixed
set of values possible in the column, those values are also described.

Additionally, the 4 views are enumerated and separated into sections to
aid readability and allow deep linking.

Co-authored-by: Eliza Weisman <[email protected]>
…o-rs#443)

* fix(console): remove trailing space from task/resource location

When a file location is formatted, an extra space is added at the end.
This appears to be the result of some refactoring from a case where we
needed the extra space at the end.

Currently, the extra space results in there being 2 spaces between the
Location column and the Fields column in the tasks list view, and the
same between the Location column and the Attributes column in the
resources list view.

It is also causing issues when attempting to truncate the first part of
the location in tokio-rs#441 (during which this extra space was discovered).

This change removes the trailing space.
…o-rs#440)

When entering and exiting a span the old code was also updating the parent
stats. This was causing excessive polling being reported for the parent tasks.
See issue tokio-rs#378 for more details. The regression was introduced by the refactor
in tokio-rs#238.
This fixes the issue by limiting updates to the current span.

Closes tokio-rs#378

Co-authored-by: Hayden Stainsby <[email protected]>
Co-authored-by: Eliza Weisman <[email protected]>
This closes tokio-rs#411.

This is how it looks like.

<img width="514" alt="Screenshot 2023-07-03 at 12 22 24 PM" src="https://github.com/tokio-rs/console/assets/41150432/878a43e6-1512-49c0-85fd-0e245cc6ee04">

Co-authored-by: Hayden Stainsby <[email protected]>
…-rs#453)

This PR adds an explicit dependency to futures-core for `console-api` to
fix broken builds. Without this fix I get the following errors:

```
error[E0433]: failed to resolve: use of undeclared crate or module `futures_core`
   --> /home/weiznich/.cargo/registry/src/index.crates.io-6f17d22bba15001f/console-api-0.5.0/src/generated/rs.tokio.console.instrument.rs:275:34
    |
275 |         type WatchUpdatesStream: futures_core::Stream<
    |                                  ^^^^^^^^^^^^ use of undeclared crate or module `futures_core`

error[E0433]: failed to resolve: use of undeclared crate or module `futures_core`
   --> /home/weiznich/.cargo/registry/src/index.crates.io-6f17d22bba15001f/console-api-0.5.0/src/generated/rs.tokio.console.instrument.rs:289:38
    |
289 |         type WatchTaskDetailsStream: futures_core::Stream<
    |                                      ^^^^^^^^^^^^ use of undeclared crate or module `futures_core`

error[E0433]: failed to resolve: use of undeclared crate or module `futures_core`
   --> /home/weiznich/.cargo/registry/src/index.crates.io-6f17d22bba15001f/console-api-0.5.0/src/generated/rs.tokio.console.trace.rs:212:27
    |
212 |         type WatchStream: futures_core::Stream<
    |                           ^^^^^^^^^^^^ use of undeclared crate or module `futures_core`

```

The somewhat funny thing is that only happened after I've updated some
unrelated dependencies. It worked before, but after looking at the
relevant code I'm really surprised that it ever compiled without error.
The `ConsoleLayer` builder provides the user with a console layer and
a server, which is used to start the gRPC server.

However, it may be desireable to expose the instrumentation server together
with other services on the same Tonic router. This was requested
explicitly in tokio-rs#449 428.

Additionally, to add tests which make use of the instrumentation server
(as part of improving test coverage for tokio-rs#450), more flexibility is
needed than what is provided by the current API. Specifically we would
like to connect a client and server via an in memory channel, rather
than a TCP connection.

This change adds an additional method to `console_subscriber::Server`
called `into_parts` which allows the user to access the
`InstrumentServer` directly. The `Aggregator` is also returned and
must be set to run for at least as long as the instrument server. This
allows the aggregator to be spawned wherever the user wishes.

To facilitate the addition of functionality which would result in more
"parts" in the future, `into_parts` returns a non-exhaustive struct,
rather than a tuple of parts.

Closes: tokio-rs#428
…okio-rs#449)

Now it is possible while navigate in `async_ops_table` to go directly to
details of the task that you are pointing on the list.

This closes tokio-rs#448
The current logic present in `IdData::drop_closed` marks an item (task,
resource, and async op stats) to be dropped in the case that the item
**is** dirty and there **are** watchers: `(dirty && has_watchers)`.

This causes a case where if an item is first received and then completes
in between the aggregator push cycle, it will be discarded immediately
and never sent.

This logic has been in place since the concepts of watchers and dirty
items was introduced in tokio-rs#77. However since an item that is created and
then dropped within a single update cycle isn't likely to be missed in
the UI, it may never have been noticed.

Instead the logic should be to **retain** an item if **any** of the
following is true:
* there are watchers and the item is dirty: `(dirty && has_watchers)`
* item has been dropped less time than the retention period:
  `dropped_for <= retention`.
I just came across this typo in the Readme when using `tokio-console` for the first time. 
This is a great tool 👍
- Implement queue drain pressure that is based on channel's capacity.
  Pressure usually leads to the system reaching equilibrium without
  piling the events up.
- Allow configuring channel's target fill capacity, before flushing.
- Lower event buffer capacity, because the current value let the buffer
  grow into gigabytes in size.
@h33p h33p requested a review from a team as a code owner August 17, 2023 11:54
@hawkw
Copy link
Member

hawkw commented Aug 17, 2023

Thank you for working on this! The current queue sizes and other parameters were chosen pretty arbitrarily in the interest of "just getting something working", so I had been hoping to improve the queue performance for some time. I'm excited to give this PR a review!

console-subscriber/src/builder.rs Outdated Show resolved Hide resolved
console-subscriber/src/builder.rs Outdated Show resolved Hide resolved
console-subscriber/src/builder.rs Outdated Show resolved Hide resolved
console-subscriber/src/lib.rs Outdated Show resolved Hide resolved
console-subscriber/src/aggregator/mod.rs Outdated Show resolved Hide resolved
Comment on lines 174 to 175
let fill_percentage = self.shared.fill_percentage.load(Relaxed) as u32;
let sleep = tokio::time::sleep(self.publish_interval * (100 - fill_percentage) / 100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this actually flush more frequently than the default publish interval? it seems like if the fill percentage is 100, this will be the same as the publish interval, otherwise it will be greater. is that the intention here, or is this supposed to fire more frequently than the publish interval when the queue is more full?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If fill percentage is 100, then the interval will be publish_interval * 0, meaning there will not be any real sleeping happening.

@h33p
Copy link
Contributor Author

h33p commented Aug 29, 2023

@hawkw I have pushed requested changes (can squash it later, but I think it's easier to see changes during review)

@h33p h33p requested a review from hawkw September 4, 2023 06:53
@hds
Copy link
Collaborator

hds commented Sep 13, 2023

@h33p Just letting you know that we haven't forgotten this PR!

I've got a test case that I'd like to run your changes against to compare to the current behaviour. It pretty easily produces a situation where the channel between the subscriber and the aggregator grows forever (at least on my local computer using a single thread).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Continuous memory leak with console_subscriber