-
Notifications
You must be signed in to change notification settings - Fork 0
/
requirements.tex
100 lines (86 loc) · 4.83 KB
/
requirements.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
\section{Requirements for a Solution}
\label{s:requirements}
%(from above: In summary the diversity and volume of data, distribution of expertise, risks
%around publication and challenges of discovery have limited our ability to
%extract useful insights from the data we collect. In the next section we explore
%the requirements for a solution to address these constraints.)
Much data is - or can be - collected, but often by different
people and for a specific purpose, and in formats chosen to suit the
data source or a specific use.
For example, log data frequently originates
as messages emitted by some software and is thus stored as
text files with a timestamped line per entry, while job records
are typically stored in database tables with well-defined fields.
Making data presentable and accessible beyond the specific needs of those
collecting it is difficult and time-consuming and, consequently,
infrequently performed.
This implies that to support extraction of useful insights from available data
we need to be agnostic towards the format and storage of the data
\textbf{(requirement 1)}.
Furthermore, staff are often unaware of the full suite of data collection
activities at their site, let alone farther afield. This leads to missed
opportunities for gaining insights from what is already collected and
to redundant collections, so \textbf{requirement 2} is to support
discovery with no a priori knowledge of other collection efforts.
An effective solution within a site is
use of a tool like LogStash~\textcolor{red}{cite?}
to convert everything to a common format and collect it in a single
centralized location. This however has some limitations:
\begin{itemize}
\item It requires all collection activities to ensure their data can
be converted and stored in the centralized format. Many ad-hoc
collection activities cannot justify the effort required to
comply, so the centralized collection fails to capture the full
suite of available data.
\item The diversity of security domains poses a challenge: should data
be captured to a higher-security domain, filtered to a lower-security
domain or should each security domain have its own storage?
\item A centralized solution requires a significant commitment to
centralized maintenance, and trust that the incoming data is
in a useful format, at a useful cadence and
tractable volume?)
\end{itemize}
Therefore we would like our solution to be decentralized
\textbf{(requirement 3)}.
Access to log datasets is an obvious requirement, but it is well-known in the research
community that very few datasets are released for researchers
due to the presence of sensitive data such as user login information,
the difficulty of log anonymization and the limited cost-benefit
trade-off between helping the community and the risk of mistakenly
releasing sensitive information.
The effort and risk in a release paradigm of ``carefully redact
sensitive information before releasing data'' is a roadblock for
publishing datasets so a solution should promote the opposite
paradigm of ``select some non-sensitive data and release that''
instead \textbf{(requirement 4)}.
Expert commentary on the meaning of log messages has been identified as
a desirable~\cite{CUG2016BoF} resource but expertise is domain-specific
and distributed across many individuals, all of whom have other, primary
responsibilities. The solution must therefore allow for
domain experts to independently contribute advice in the format they
already use \textbf{(requirement 5)} - for example a
spreadsheet of log message definitions or a ticketing system with
maintenance requests and notes.
Failure analysis research often looks for relationships between
and sequences in log records, and might include comparisons against
``control'' logs from a different period or system. Operational
troubleshooting seeks to identify possibly-contributing events in the lead
up to an identified failure.
Failures at one component may be triggered by events
at a connected component so an ideal search would extend to logs from
related components. These relationships may not be a simple hierarchy
- for example the failure of a link might affect two ``peer'' nodes.
But an overly-wide search will return an intractable volume of data, so
the solution should support some means of filtering data as well as
of discovering non-obviously-related data \textbf{(requirement 6)}.
% key para:
So in summary, our solution should:
\begin{enumerate}
\item Be agnostic towards the format and storage of data.
\item Not require a priori knowledge of existing collections.
\item Be decentralized.
\item Allow publishing of data to be low effort and low risk.
\item Allow domain experts to independently contribute advice
in a format convenient for them.
\item Support data discovery and filtering across related components
\end{enumerate}