Skip to content
This repository has been archived by the owner on Apr 2, 2024. It is now read-only.

Logging Recommendations

Mike Trinkala edited this page Feb 28, 2014 · 2 revisions

Working on Mozilla's Heka project I am seeing more and more user log data and I have a few recommendations before you even begin processing it in Heka, Hadoop, Splunk, Storm or whatever. These recommendations may seem obvious but in most cases they aren't being followed.

  1. Think about what you log, why you log it, and what it will be used for.

    Most logging has little thought beyond its use to debug a subsystem. However, when used to analyze how an entire system is working all the subsystem messages have to be stitched together to determine if the behavior is correct. Think about what the system is trying to achieve from an overall perspective and answer those questions as directly as possible in the logs. e.g. Counting the number of errors generated in a particular 'session' will require the monitoring system to have knowledge of what a session is and what subsystem messages constitute an error (forcing the monitoring system to duplicate some of the application state). Instead, if the application produced a single summary session log message with the session id and error count in it, no stitching or detailed system knowledge is required.

    Constantly monitor your log output for duplication, bugs (mal-formed data, inconsistent naming), and unused data (if you have messages not being consumed why are they being logged?). Add data only as needed and document the need. At my previous company we were generating about one and a half billion log messages per day but we only leveraged about ten percent of them. The other ninety percent contained duplicate, possibly useful, and almost useless data. The worst part is that any of it may have been used by some specialized downstream consumers forcing us to warehouse everything.

  2. Define a log data schema and strict naming conventions then leverage it across your organization; it will go a long way to solving your data transport, transformation, and analysis problems.

  3. Optimize your data for the main consumer (software) stop using text logs and json output. You are going to be processing the data programmatically so why are you optimizing the output for human readability? Storing logs in a binary representation is more about speed than size. Most of the work Heka performs is data conversion. Parsing text logs is slow and parsing JSON logs is even worse. Timestamp and numeric conversions, data transformations, and all the memory allocation that goes along with it drags throughput down. Chances are the data was in a nicely structured binary format before something spewed it out in human readable format; leverage the structure and map it to your schema. Common misconceptions for using text logs:

    • Text logs make it easier to debug since they can just be looked at/grepped.

      When developing or testing locally use a utility to dump the binary logs out to human readable format if you really need to see/grep them. If you are loading the data into a warehouse or search index use those tools for viewing/debugging.

    • If there was data corruption I could not see the logs.

      With a well defined binary format it should be easier to recover data from a corrupt log than from its text counterpart. The dump tool referenced above should output the offsets of any corruption to alert you of the issue and aid in debugging it.

  4. Fix your data at the source not during the transformation. I see a lot of data correction and normalization happening in real-time. Things like "ios", "IOS", "iphone", "apple" in whatever field all being transformed to "iPhone" for every message going through the system. If you control the data source fix it there don't spend the CPU cycles fixing it thousands of times a second.

  5. Timestamps.

    • Use a high resolution int64 timestamp (milli, micro. or nano seconds since the Unix epoch).
    • If you have to use time-stamp strings use a single standard [rfc3339] and ideally log everything in UTC.