Docs

zix99 · Jan 3, 2025 · c0d3d62 · c0d3d62
1 parent 57d1cfc
commit c0d3d62
Show file tree

Hide file tree

Showing 10 changed files with 156 additions and 27 deletions.
diff --git a/README.md b/README.md
@@ -20,6 +20,7 @@ See [rare.zdyn.net](https://rare.zdyn.net) or the [docs/ folder](docs/) for the
 ## Features
 
  * Multiple summary formats including: filter (like grep), histogram, bar graphs, tables, heatmaps, reduce, and numerical analysis
+ * Parse using regex (`-m`) or dissect tokenizer (`-d`)
  * File glob expansions (eg `/var/log/*` or `/var/log/*/*.log`) and `-R`
  * Optional gzip decompression (with `-z`)
  * Following `-f` or re-open following `-F` (use `--poll` to poll, and `--tail` to tail)

diff --git a/cmd/helpers/extractorBuilder.go b/cmd/helpers/extractorBuilder.go
@@ -211,7 +211,7 @@ func getExtractorFlags() []cli.Flag {
 			Name:     "ignore-case",
 			Aliases:  []string{"I"},
 			Category: cliCategoryMatching,
-			Usage:    "Augment regex to be case insensitive",
+			Usage:    "Augment matcher to be case insensitive",
 		},
 		&cli.IntFlag{
 			Name:     "batch",

diff --git a/docs/cli-help.md b/docs/cli-help.md
@@ -67,6 +67,8 @@ Filter incoming results with search criteria, and output raw matches
 
 **--batch-buffer**="": Specifies how many batches to read-ahead. Impacts memory usage, can improve performance (default: 6)
 
+**--dissect, -d**="": Dissect expression create match groups to summarize on
+
 **--extract, -e**="": Expression that will generate the key to group by. Specify multiple times for multi-dimensions or use {$} helper (default: [{0}])
 
 **--follow, -f**: Read appended data as file grows
@@ -75,7 +77,7 @@ Filter incoming results with search criteria, and output raw matches
 
 **--ignore, -i**="": Ignore a match given a truthy expression (Can have multiple)
 
-**--ignore-case, -I**: Augment regex to be case insensitive
+**--ignore-case, -I**: Augment matcher to be case insensitive
 
 **--line, -l**: Output source file and line number
 
@@ -113,6 +115,8 @@ Summarize results by extracting them to a histogram
 
 **--csv, -o**="": Write final results to csv. Use - to output to stdout
 
+**--dissect, -d**="": Dissect expression create match groups to summarize on
+
 **--extra, -x**: Alias for -b --percentage
 
 **--extract, -e**="": Expression that will generate the key to group by. Specify multiple times for multi-dimensions or use {$} helper (default: [{0}])
@@ -123,7 +127,7 @@ Summarize results by extracting them to a histogram
 
 **--ignore, -i**="": Ignore a match given a truthy expression (Can have multiple)
 
-**--ignore-case, -I**: Augment regex to be case insensitive
+**--ignore-case, -I**: Augment matcher to be case insensitive
 
 **--match, -m**="": Regex to create match groups to summarize on (default: .*)
 
@@ -167,6 +171,8 @@ Create a 2D heatmap of extracted data
 
 **--delim**="": Character to tabulate on. Use {$} helper by default (default: \x00)
 
+**--dissect, -d**="": Dissect expression create match groups to summarize on
+
 **--extract, -e**="": Expression that will generate the key to group by. Specify multiple times for multi-dimensions or use {$} helper (default: [{0}])
 
 **--follow, -f**: Read appended data as file grows
@@ -175,7 +181,7 @@ Create a 2D heatmap of extracted data
 
 **--ignore, -i**="": Ignore a match given a truthy expression (Can have multiple)
 
-**--ignore-case, -I**: Augment regex to be case insensitive
+**--ignore-case, -I**: Augment matcher to be case insensitive
 
 **--match, -m**="": Regex to create match groups to summarize on (default: .*)
 
@@ -223,6 +229,8 @@ Create rows of sparkline graphs
 
 **--delim**="": Character to tabulate on. Use {$} helper by default (default: \x00)
 
+**--dissect, -d**="": Dissect expression create match groups to summarize on
+
 **--extract, -e**="": Expression that will generate the key to group by. Specify multiple times for multi-dimensions or use {$} helper (default: [{0}])
 
 **--follow, -f**: Read appended data as file grows
@@ -231,7 +239,7 @@ Create rows of sparkline graphs
 
 **--ignore, -i**="": Ignore a match given a truthy expression (Can have multiple)
 
-**--ignore-case, -I**: Augment regex to be case insensitive
+**--ignore-case, -I**: Augment matcher to be case insensitive
 
 **--match, -m**="": Regex to create match groups to summarize on (default: .*)
 
@@ -273,6 +281,8 @@ Create a bargraph of the given 1 or 2 dimension data
 
 **--csv, -o**="": Write final results to csv. Use - to output to stdout
 
+**--dissect, -d**="": Dissect expression create match groups to summarize on
+
 **--extract, -e**="": Expression that will generate the key to group by. Specify multiple times for multi-dimensions or use {$} helper (default: [{0}])
 
 **--follow, -f**: Read appended data as file grows
@@ -281,7 +291,7 @@ Create a bargraph of the given 1 or 2 dimension data
 
 **--ignore, -i**="": Ignore a match given a truthy expression (Can have multiple)
 
-**--ignore-case, -I**: Augment regex to be case insensitive
+**--ignore-case, -I**: Augment matcher to be case insensitive
 
 **--match, -m**="": Regex to create match groups to summarize on (default: .*)
 
@@ -317,6 +327,8 @@ Numerical analysis on a set of filtered data
 
 **--batch-buffer**="": Specifies how many batches to read-ahead. Impacts memory usage, can improve performance (default: 6)
 
+**--dissect, -d**="": Dissect expression create match groups to summarize on
+
 **--extra, -x**: Displays extra analysis on the data (Requires more memory and cpu)
 
 **--extract, -e**="": Expression that will generate the key to group by. Specify multiple times for multi-dimensions or use {$} helper (default: [{0}])
@@ -327,7 +339,7 @@ Numerical analysis on a set of filtered data
 
 **--ignore, -i**="": Ignore a match given a truthy expression (Can have multiple)
 
-**--ignore-case, -I**: Augment regex to be case insensitive
+**--ignore-case, -I**: Augment matcher to be case insensitive
 
 **--match, -m**="": Regex to create match groups to summarize on (default: .*)
 
@@ -367,6 +379,8 @@ Create a 2D summarizing table of extracted data
 
 **--delim**="": Character to tabulate on. Use {$} helper by default (default: \x00)
 
+**--dissect, -d**="": Dissect expression create match groups to summarize on
+
 **--extra, -x**: Display row and column totals
 
 **--extract, -e**="": Expression that will generate the key to group by. Specify multiple times for multi-dimensions or use {$} helper (default: [{0}])
@@ -377,7 +391,7 @@ Create a 2D summarizing table of extracted data
 
 **--ignore, -i**="": Ignore a match given a truthy expression (Can have multiple)
 
-**--ignore-case, -I**: Augment regex to be case insensitive
+**--ignore-case, -I**: Augment matcher to be case insensitive
 
 **--match, -m**="": Regex to create match groups to summarize on (default: .*)
 
@@ -421,6 +435,8 @@ Aggregate the results of a query based on an expression, pulling customized summ
 
 **--csv, -o**="": Write final results to csv. Use - to output to stdout
 
+**--dissect, -d**="": Dissect expression create match groups to summarize on
+
 **--extract, -e**="": Expression that will generate the key to group by. Specify multiple times for multi-dimensions or use {$} helper (default: [{@}])
 
 **--follow, -f**: Read appended data as file grows
@@ -431,7 +447,7 @@ Aggregate the results of a query based on an expression, pulling customized summ
 
 **--ignore, -i**="": Ignore a match given a truthy expression (Can have multiple)
 
-**--ignore-case, -I**: Augment regex to be case insensitive
+**--ignore-case, -I**: Augment matcher to be case insensitive
 
 **--initial**="": Specify the default initial value for any accumulators that don't specify (default: 0)
 

diff --git a/docs/index.md b/docs/index.md
@@ -18,6 +18,7 @@ Supports various CLI-based graphing and metric formats (filter (grep-like), hist
 ## Features
 
  * Multiple summary formats including: filter (like grep), histogram, bar graphs, tables, heatmaps, sparklines, reduce, and numerical analysis
+ * Parse using regex (`-m`) or dissect tokenizer (`-d`)
  * File glob expansions (eg `/var/log/*` or `/var/log/*/*.log`) and `-R`
  * Optional gzip decompression (with `-z`)
  * Following `-f` or re-open following `-F` (use `--poll` to poll, and `--tail` to tail)

diff --git a/docs/usage/dissect.md b/docs/usage/dissect.md
@@ -0,0 +1,68 @@
+# Dissect Syntax
+
+*Dissect* is a simple token-based search algorithm, and can
+be up to 10x faster than regex (and 40% faster than PCRE).
+
+It works by searching for for constant delimiters in a string
+and extracting the text between the tokens as named keys.
+
+*rare* implements a subset of the full dissect algorithm.
+
+**Syntax Example:**
+```
+prefix %{name} : %{value} - %{?ignored}
+```
+
+## Syntax
+
+- Anything in a `%{}` is a variable token.
+- A blank token, or a token that starts with `?` is skipped. eg `%{}` or `%{?skipped}`
+- Tokens are extracted by both name and index (in the order they appear).
+- Index `{0}` is the full match, including the delimiters
+- Patterns don't need to match the entire line
+
+## Examples
+
+### Simple
+
+```
+prefix %{name} : %{value}
+```
+
+Will match:
+```
+prefix bob : 123
+```
+
+And will extract two keys:
+```
+name=bob
+value=123
+```
+
+### Nginx Logs
+
+As a simple example, to parse nginx logs that look like:
+
+```
+104.238.185.46 - - [19/Aug/2019:02:26:25 +0000] "GET / HTTP/1.1" 200 546 "-" "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/98 Safari/537.4 (StatusCake)"
+```
+
+The following dissect expression can be used:
+
+```
+%{ip} - - [%{timestamp}] "%{verb} %{path} HTTP/%{?http-version}" %{status} %{size} "-" "%{useragent}"
+```
+
+Which, as json, will return:
+```json
+{
+    "timestamp": "12/Dec/2019:17:54:13 +0000",
+    "verb": "POST",
+    "path": "/temtel.php",
+    "status": 404,
+    "size": 571,
+    "useragent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36",
+    "ip": "203.113.174.104"
+}
+```
diff --git a/docs/usage/examples.md b/docs/usage/examples.md
@@ -183,10 +183,10 @@ Matched: 1,035,666 / 1,035,666 (R: 8; C: 61)
 **NOTE:** For stacking (`-s`), the results will be color-coded (not shown here)
 
 ```sh
-$ rare bars -z -m "\[(.+?)\].*\" (\d+)" -e "{buckettime {1} year}" -e "{2}" testdata/*
+$ rare bars -z -m "\[(.+?)\].*\" (\d+)" -e "{buckettime {1} year}" -e "{2}" -s testdata/*
 
-        | 200  | 206  | 301  | 304  | 400  | 404  | 405  | 408
-2019  |||||||||||||||||||||||||||||||||||||||  3,741,444
-2020  |||||||||||||||||||||||||||||||||||||||||||||||||  4,631,884
-Matched: 8,373,328 / 8,383,717
+        0 200  1 206  2 301  3 304  4 400  5 404  6 405  7 408
+2019  000000000555555555555555555555555555555  3,742,444
+2020  0000000000000000004455555555555555555555555555555  4,631,884
+Matched: 8,374,328 / 8,384,811
 ```
diff --git a/docs/usage/expressions.md b/docs/usage/expressions.md
@@ -16,7 +16,7 @@ The basic syntax structure is as follows:
  * Characters can be escaped with `\`, including `\{` or `\n`
  * Expressions are surrounded by `{}`.
  * An integer in an expression denotes a matched value from the regex (or other input) eg. `{2}`. The entire match will always be `{0}`
- * A string in an expression is a special key or a named regex group eg. `{src}` or `{group1}`
+ * A string in an expression is a special key or a named regex/dissect group eg. `{src}` or `{group1}`
  * When an expression has space(s), the first literal will be the name of a helper function.
    From there, the logic is nested. eg `{coalesce {4} {3} notfound}`
  * Quotes in an argument create a single argument eg. `{coalesce {4} {3} "not found"}`
@@ -59,7 +59,7 @@ rare histo \
 	-b access.log
 ```
 
-The above parses the method `{1}`, url `{2}`, status `{3}`, and response size `{4}` in the regex.
+The above parses the method `{1}`, url `{2}`, status `{3}`, and response size `{4}` in the matcher.
 
 It extracts the `<method> <url> <bytesize bucketed to 10k>`. It will ignore `-i` if response size `{4}` is less-than `1024*1024` (1MB).
 

diff --git a/docs/usage/extractor.md b/docs/usage/extractor.md
@@ -3,14 +3,54 @@
 The main component of *rare* is the extractor (or matcher).  There are
 three fundamental concepts around the parser:
 
- * Each line of an input (separated by `\n`) is matched to a regex
- * A regex is used to parse a line into a match (and optionally, groups)
+ * Each line of an input (separated by `\n`) is matched to a matcher
+ * A matcher is used to parse a line into a match (and optionally, groups)
  * An expression (see: [expression](expressions.md)) is used to format an
-   output from a regex group
- * Optionally, one or more ignore filter can be applied to silent matches
+   output from a matched groups
+ * Optionally, one or more ignore expressions can be applied to silent matches
    that satisfy a truthy-comparison
 
-## Decomposing a Filter
+## Matcher Types
+
+If no matcher is specified, by default, the entire line is always matched
+and passed-through to the expression-stage.
+
+More than one matcher can **not** be specified at the same time.
+
+### Regex
+
+A regex express is specified with `--match` or `-m`, and follows common
+[regex syntax](regexp.md).
+
+When matching a regex, groups and keys are extracted both index and
+by-name if specified.
+
+Set ignore-case with `-I` or `--ignore-case`.
+
+**Example:**
+
+```bash
+rare filter -m '"(\w{3,4}) ([A-Za-z0-9/.@_-]+)' access.log
+```
+
+### Dissect
+
+A dissect expression is specified with `--disect` or `-d`, and follows
+[dissect syntax](dissect.md).
+
+Like regex, groups are extracted by both index and name.
+
+Set ignore-case with `-I` or `--ignore-case`.
+
+**Example:**
+
+```bash
+rare filter -d 'HTTP/1.1" %{code} ${size}' -e '{code}' access.log
+```
+
+## Examples
+
+### Decomposing a Matcher
 
 The most primitive way use rare is to filter lines in an input.  We'll
 be using an example nginx log for our example.
@@ -34,7 +74,7 @@ If you want it to only output the matched portion, you can add `-e "{0}"`
 Lastly, lets say we want to ignore all paths that equal "/", we could do that by adding
 an ignore pattern: `-i {eq {1} /}`
 
-## Histograms
+### Histograms
 
 Histograms are like filters, but rather than outputting every match, it will
 create an aggregated count based on the extracted expression.
@@ -48,4 +88,5 @@ rare histogram -m '"(\w{3,4}) ([A-Za-z0-9/.@_-]+)' -e '{1} {2}' -b access.log
 
 ## See Also
 
-* [Regular Expressions](regexp.md)
+* [Regular Expressions](regexp.md)
+* [Examples](examples.md)
diff --git a/docs/usage/overview.md b/docs/usage/overview.md
@@ -23,11 +23,11 @@ Read more at:
 
 ## Extraction (Matching)
 
-Extraction is denoted with `-m` (match) and is the process of reading a line in
-a file or set of files and parsing it with a regular expression into the
-match-groups denoted by the regex.
+Extraction is denoted with `-m` (regex) or `-d` (dissect) and is the process of reading
+a line in a file or set of files and parsing it with a regular expression into the
+match-groups denoted by the matcher.
 
-If the regex doesn't match, the line is discarded (a non-match)
+If the expression doesn't match, the line is discarded (a non-match)
 
 These match groups are then fed into the next stage, the expression.
 
@@ -62,6 +62,7 @@ Aggregator types:
 * `histogram` will count instances of the extracted key
 * `table` will count the key in 2 dimensions
 * `heatmap` will generate a 2D visualization using colored blocks to denote value
+* `sparkline` will generate a 2D visualization with the results being a sparkline
 * `bargraph` will create either a stacked or non-stacked bargraph based on 2 dimensions
 * `analyze` will use the key as a numeric value and compute mean/median/mode/stddev/percentiles
 * `reduce` allows evaluating data using expressions, and grouping/sorting the output

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -19,6 +19,7 @@ nav:
       - JSON: usage/json.md
       - Funcs File: usage/funcsfile.md
       - Regular Expressions: usage/regexp.md
+      - Dissect Expressions: usage/dissect.md
     - CLI Docs: cli-help.md
   - Benchmarks: benchmarks.md
   - Contributing: contributing.md