-
Notifications
You must be signed in to change notification settings - Fork 0
/
intro.tex
105 lines (83 loc) · 5.59 KB
/
intro.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
\section{Introduction}
\label{s:intro}
As systems have become larger and more complex, the volume and variety of
log data has become formidable. Also increasing is the need to find insights
in that data to troubleshoot system problems and identify root causes and
propagation paths of faults, in support of improving system resilience.
To aid efforts to extract meaningful information from log data, the Holistic Measurement Driven Resilience (HMDR)~\cite{HMDRweb} project has as a goal
the publication of annotated datasets for resilience
research. Making log datasets available to researchers is itself a significant
challenge: the data is large, and diverse in terms of format, storage
and access. Data beyond system logs is also valuable: contextual information
provided by batch system history, maintenance logs and user error reports.
In this paper we address this challenge by first exploring the context of the
problem to understand why it is difficult. From this we infer
the requirements of a solution, outlined in section~\ref{s:requirements}. Our
solution is comprised of a machine-readable vocabulary for describing
relevant data, a schema for collating and using expert and
machine-generated annotations about log data, and some tools to make these
accessible to researchers. We describe this in detail in section~\ref{s:solution}.
We describe methods used to popluate a sample annotation database in
section~\ref{s:proof}, and use it for some illustrative case studies described
in section ~\ref{s:examples}.
\section{Context}
\label{s:context}
To illustrate the opportunities and challenges posed by the assortment of
log-and-related data, consider the NERSC and ACES (LANL/SNL) large-scale computational environments, represented by the authors:
NERSC's Cori system is a Cray XC40 with 2388 Xeon nodes, 9688 Xeon Phi nodes,
a host of service nodes providing I/O forwarding, a Datawarp
burst-buffer filesystem, Lustre networking and
system management, and an Aries high-speed network in a Dragonfly topology. Cori
has a large external Lustre filesystem also cross-mounted on Edison - another
large Cray system at NERSC - via Infiniband. It has external login nodes and
shares GPFS filesystems with other NERSC servers. There are air and water cooling
components and UPS power circuits. Along with Cray system software and
programming environments, Cori runs multiple compilers and MPI stacks and
hundreds of software packages. Its 7,000 users run tens of thousands of jobs per
day via the Slurm batch scheduler.
ACES Trinity is a Cray XC40
with 9436 Xeon and over 9500 Xeon Phi nodes, a DataWarp burst buffer, Aries HSN
and a Lustre filesystem. Trinity serves the needs of the National Nuclear
Security Administration and has far stricter access limitations than does Cori.
The associated centers support multiple production and test-bed systems including
bleeding-edge technologies and with many staff members who support
various aspects of each facility, the several hundred projects and the several
thousand users they collectively support.
Data such as system logs, environmental metrics, job history, filesystem state,
outage notices and user reports are already routinely collected, but the
extraction of useful insights from these requires a customized solution for
each investigation and is limited by several constraints:
\begin{itemize}
\item Data is collected by different people in different security domains - for
example, system logs for each subsystem are typically available only to
the system administrators for that subsystem.
\item Data is collected in different formats, usually based on the output format
of specific data sources and the needs of a specific investigation.
\item Data is stored in different places with different access mechanisms - flat
text files, binary formats including HDF5, SQL and
NoSQL databases, JSON via a RESTful interface, etc.
\item Not all data is suitable for publication. Log files intersperse entries
relating to events contributing to some failure with unrelated entries
that may contain sensitive information such as user login details. The
difficulty of log anonymization and risk of mistakenly releasing sensitive
information discourages data owners from exposing data to the wider
community.
\item The volume of data is very large, requiring an analyst to comb through
many unrelated log entries to identify the significant entries.
\item Much of the collected data requires domain expertise to interpret, and
this expertise is distributed across many people. For example: at the
2016 Cray User's Group Monitoring BoF~\cite{CUG2016BoF}, the community,
comprised largely of people with substantial Cray experience, concluded
that it would be valuable to have authoritative, descriptive annotations
of significant log messages to aid in discover and understanding of system
events.
\item ``Failure'' is often vaguely defined (``my job took 30\% longer today than
usual'') and research into failures and resilience is often exploratory.
\item Users and staff are not necessarily aware of what data is being collected
within a center, let alone between centers. Data and expertise that might
positively impact failure analyses therefore goes overlooked.
\end{itemize}
In summary the diversity and volume of data, distribution of expertise, risks
around publication and challenges of discovery have limited our ability to
extract useful insights from the data we collect. In the next section we explore
the requirements for a solution to address these constraints.