-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heatmap visualization #577
base: main
Are you sure you want to change the base?
Conversation
ashwin-patil
commented
Nov 17, 2022
- new visualization Heatmap
@ashwin-patil - you have a bunch of linting errors to fix but looks like it's otherwise good to go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to have simple unit tests for this.
You can follow examples for some of the other plots - just supply or use some canned data and make sure the params are exercised.
Have a look at this one -
msticpy/tests/vis/test_timeline.py
Line 89 in 2dee9c4
TIMELINE_COMMON_ARGS: List[Tuple[Dict[str, List[Any]], List[Any]]] = [ |
There's an example that uses test parameters to supply different expected values for most of the params. Just to make sure that the code is exercised.
Using
pytest tests/vis/test_heatmap --cov:msticpy.vis.heatmap --cov-report:html
will produce a report in htmlcov folder that will show you which bits of the code have been run or missed by the tests.
Also, a short documentation page would be very helpful - otherwise people won't know it exists :-).
{where_clause} | ||
| where {timestampcolumn} >= startofday(start) | ||
| where {timestampcolumn} <= startofday(end) | ||
| extend HourOfLogin = hourofday({timestampcolumn}), DayNumberofWeek = dayofweek({timestampcolumn}) , Date = format_datetime(TimeGenerated, "yyyy-MM-dd") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename output fields for consistency and remove reference to "Login"
HourOfLogin => HourOfDay?
DayofWeek ==> DayOfWeek? (consistent caps)
msticpy/vis/heatmap.py
Outdated
plot.axis.major_label_standoff = 0 | ||
plot.xaxis.major_label_orientation = pi / 3 | ||
|
||
def _sort_days_hours(data: pd.DataFrame, week_column: str, hour_column: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor - should "week_column" be better named "day_column"?
msticpy/vis/heatmap.py
Outdated
|
||
mapper, color_bar = _create_colorbar(data, param) | ||
|
||
plot.rect(x=param.y_column, y=param.x_column, width=1, height=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are x and y inverted? - e.g. x=param.y_column
msticpy/vis/heatmap.py
Outdated
ticker=BasicTicker(desired_num_ticks=len(param.color_pallette)), | ||
formatter=PrintfTickFormatter(format="%d"), | ||
label_standoff=6, border_line_color=None) | ||
return mapper, color_bar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that we could add a function that will calculate the same data format from any DataFrame.
I thought this would take me a few minutes to knock something up but actually consumed a whole afternoon.
from typing import Optional
X_TIME_DIVS = ["day", "hour"]
Y_TIME_DIVS = ["month", "day"]
y_time_division = "month"
x_time_division = "day"
time_index = pd.Series(pd.date_range(start='2018-01-01', freq='D', periods=7))
day_name_df = (
pd.DataFrame(time_index.dt.day_name(), columns=["day_name"])
.reset_index()
)
time_index = pd.Series(pd.date_range(start='2018-01-01', freq='M', periods=12))
month_name_df = (
pd.DataFrame(time_index.dt.month_name(), columns=["month_name"])
.reset_index()
)
_hour_index = pd.DataFrame(pd.Series(list(range(24)))).rename(columns={0: "value"})
_day_index = pd.DataFrame(pd.Series(list(range(1, 32)))).rename(columns={0: "value"})
def calculate_heatmap(
data: pd.DataFrame,
time_column: str = "TimeGenerated",
x_time_division: str = "hour",
y_time_division: str = "day",
agg_func: str = "count",
agg_column: Optional[str] = None
) -> pd.DataFrame:
"""
Return heatmap pivot from DataFrame
Parameters
----------
data : pd.DataFrame
Input DataFrame.
time_column : str, optional
Name of timestamp column, by default "TimeGenerated"
x_time_division : str, optional
Time division to use for X axis, by default "hour"
y_time_division : str, optional
Time division to use for Y axis, by default "day"
agg_func : str, optional
The aggregate function to apply to the data, by default "count"
agg_column : Optional[str], optional
The value column to use to apply the aggregate function.
This is not needed for the "count" function, it
defaults to the `time_column` if not supplied explicitly.
Returns
-------
pd.DataFrame
The pivoted DataFrame to use for Heatmap plot.
Raises
------
ValueError
Invalid time division type or x and y time division
are the same.
NotImplementedError
Time division type not yet supported.
"""
if x_time_division not in X_TIME_DIVS:
raise ValueError(f"x_time_division must be one of {', '.join(X_TIME_DIVS)}")
if y_time_division not in Y_TIME_DIVS:
raise ValueError(f"y_time_division must be one of {', '.join(Y_TIME_DIVS)}")
if x_time_division != "hour":
raise NotImplementedError("day/month chart not yet supported")
data["x_division"] = data[time_column].dt.hour
data = (
data.merge(_hour_index, left_on="x_division", right_on="value", how="right")
.drop(columns=["x_division"])
.rename(columns={"value": x_time_division})
)
if y_time_division != "day":
raise NotImplementedError("day/month chart not yet supported")
if y_time_division == x_time_division:
raise ValueError(f"x and y cannot be the same time division {y_time_division}")
data["y_division"] = data[time_column].dt.dayofweek
y_name_map = day_name_df
agg_column = agg_column or time_column
return (
data[["y_division", x_time_division, agg_column]]
.groupby(["y_division", x_time_division])
.agg(agg_func)
.reset_index()
.merge(y_name_map, left_on="y_division", right_on="index", how="right")
.drop(columns=["index", "y_division"])
.rename(columns={agg_column: "Total"})
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started to build it so that it would accept month/day as well as day/hour but realized that we couldn't plot that current anyway.
msticpy/vis/mp_pandas_plot.py
Outdated
The Bokeh plot | ||
|
||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a wrapper for the calculate_heatmap function here as well.
Then use could do"
df.mp_plot.calculate_heatmap(time_column="TimeGenerated").mp_plot.heatmap(x="hour", y="day")