-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better handling of outliers #273
Comments
With "visualization", do you refer the image output ( I do understand the problem and it's there in both of the outputs but will differ a little bit in nature due to the image output having more resolution available for use and in some cases may not feel totally as useless in the situation that there are outliers. The image output could also better support having logarithmic scaling as the scale is visible unlike in the text based outputs. Having some sort of cut happen (with a + sign to indicate something exceeded the chart) could be doable in both outputs but that could then result in the maximum value not being visible. I'm aware of this issue as I do see it time to time myself too in at least the image outputs but it hasn't so far annoyed enough to result me stopping to think for a better but still easily readable solution long enough to come with a concrete plan. Partially the problem starts from even having to define what an outlier really is. For example, is the rx at 18 in the following an outlier or not? From my prespective, in a way it is but in the way of ruining the representation it isn't. It does result in everything else getting scaled far down but the trend is otherwise still visible well enough that I wouldn't want to have any special formatting rule getting applied in that case. Another example but this time the same from the 5 minute resolution output: Now this is more annoying since especially the rx part becomes pretty much a flat zero except for that spike just before 19. The extra problem here is that this isn't a single outlier but 3 separate entries next to each other. Does the rest of the data matter enough that these three should get a special treatment? How many of these entries would need to exist for the "outlier remover algorithm" to not touch anything? It may also be dependent of the viewer, some may especially want to see the scale based on that spike. However, this at least would be easy to solve by leaving the current behaviour the default if a new alternative is introduced. |
I agree with the problems of defining and detecting outliers. Maybe a threshold applied to the logarithmic scale? Like if it is x times the average, it is an outlier? Also, there are probably some definitions from statistics that might also be a metric accurate enough, especially if it can be configured a bit. I had the problem with vnstati and had not tested terminal output (I assumed it would have some ascii bars with similar problems).
Also, the point is not to remove the outliers (otherwise one could just remove them from the data), but to adjust the visualization so that all the data remains readable. Some kind of "+" or maybe a zigzag line through the middle of the bar is also a good idea to indicate the clipped peaks for the "too large" outliers. |
Do you have simple code for testing myself? I think plain logarithmic works only for specific cases. Some more can be covered by having constants/different bases, but in the end outliers are a different problem that is only partly alleviated by scaling everything logarithmic. It would still be useful for some use-cases, probably. I think outliers should probably have some visualization like this. The other bars can then be scaled to the right. |
The quick proof of concept didn't require much changes. Only line 254 of src/image_support.c needs changing from and then compile with Note that this change will affect only the list format image outputs, nothing else. |
When the statistics have an outlier with much more traffic than usual, the visualization of the other data points becomes almost useless. If you have a month with 1-10 GB of traffic per day, and then you have a day with 200 GB of traffic, the visualization of all other days will shrink to a single line or even become invisible.
On the one hand, it would be interesting to have an option for logarithmic scaling, but on the other hand, sometimes you just want to handle outliers, e.g. by rescaling only the days with high traffic, or by cutting off the visualization of the outliers and not the ones of the median days.
The text was updated successfully, but these errors were encountered: