Skip to main content
Version: Next

summarize

Groups events and applies aggregate functions on each group.

Synopsis

summarize <[field=]aggregation>... 
[by <extractor>... [resolution <duration>]]
[timeout <duration>]
[update-timeout <duration>]

Description

The summarize operator groups events according to a grouping expression and applies an aggregation function over each group. The operator consumes the entire input before producing an output.

Fields that neither occur in an aggregation function nor in the by list are dropped from the output.

[field=]aggregation

Aggregation functions compute a single value of one or more columns in a given group. Syntactically, aggregation has the form f(x) where f is the aggregation function and x is a field.

By default, the name for the new field aggregation is its string representation, e.g., min(timestamp). You can specify a different name by prepending a field assignment, e.g., min_ts=min(timestamp).

The following aggregation functions are available:

  • sum: Computes the sum of all grouped values.
  • min: Computes the minimum of all grouped values.
  • max: Computes the maximum of all grouped values.
  • any: Computes the disjunction (OR) of all grouped values. Requires the values to be booleans.
  • all: Computes the conjunction (AND) of all grouped values. Requires the values to be booleans.
  • mean: Computes the mean of all grouped values.
  • approximate_median: Computes the apprximate median of all grouped values with a T-Digest algorithm.
  • stddev: Computes the standard deviation of all grouped values.
  • variance: Computes the variance of all grouped values.
  • distinct: Creates a sorted list of all unique grouped values that are not null.
  • collect: Creates a list of all grouped values that are not null.
  • sample: Takes the first of all grouped values that is not null.
  • count: Counts all grouped values that are not null.
  • count_distinct: Counts all distinct grouped values that are not null.

by <extractor>

The extractors specified after the optional by clause partition the input into groups. If by is omitted, all events are assigned to the same group.

resolution <duration>

The resolution option specifies an optional duration value that specifies the tolerance when comparing time values in the by section. For example, 01:48 is rounded down to 01:00 when a 1-hour resolution is used.

NB: we introduced the resolution option as a stop-gap measure to compensate for the lack of a rounding function. The ability to apply functions in the grouping expression will replace this option in the future.

timeout <duration>

The timeout option specifies how long an aggregation may take, measured per group in the by section from when the group is created, or if no group exists from the time when first event arrived at the operator.

If values occur again after the timeout, a new group with an independent aggregation will be created.

update-timeout <duration>

The update-timeout functions just like the timeout option, but instead of measuring from the first event of a group the timeout refreshes whenever an element is added to a group.

Examples

Group the input by src_ip and aggregate all unique dest_port values into a list:

summarize distinct(dest_port) by src_ip

Same as above, but produce a count of the unique number of values instead of a list:

summarize count_distinct(dest_port) by src_ip

Compute minimum, maximum of the timestamp field per src_ip group:

summarize min(timestamp), max(timestamp) by src_ip

Compute minimum, maximum of the timestamp field over all events:

summarize min(timestamp), max(timestamp)

Create a boolean flag originator that is true if any value in the group is true:

summarize originator=any(is_orig) by src_ip

Create 1-hour groups and produce a summary of network traffic between host pairs:

summarize sum(bytes_in), sum(bytes_out) by ts, src_ip, dest_ip resolution 1 hour