Skip to main content
Version: VAST v3.0

Monitor

Minimal overhead

Collecting metrics is optional and incurs minimal overhead. We recommend enabling the accountant unless disk space is scarce or every last bit of performance needs to be made available to other components of VAST.

VAST keeps detailed track of system metrics that reflect runtime state, such as ingestion performance, query latencies, and resource usage.

Components send their metrics to a central accountant that relays the telemetry to a configured sink. The accountant is disabled by default and waits for metrics reports from other components. It represents telemetry as regular vast.metrics events with the following schema:

metrics:
record:
- ts: timestamp
- version: string
- actor: string
- key: string
- value: string
- metadata:
map:
key: string
value: string

The ts field is always displayed in Coordinated Universal Time (UTC) without a timezone offset. In case you want to correlate metrics data with a VAST log messages you need to add the local timezone offset to arrive at the correct time window for the matching logs.

The version field is the version of VAST.

Enable metrics collection

Enable the accountant to collect metrics collection in your configuration:

vast:
enable-metrics: true

Alternatively, pass the corresponding command-line option when starting VAST: vast --enable-metrics start.

Write metrics to a file or UNIX domain socket

VAST also supports writing metrics to a file or UNIX domain socket (UDS). You can enable them individually or at the same time:

vast:
metrics:
# Configures if and where metrics should be written to a file.
file-sink:
enable: false
real-time: false
path: /tmp/vast-metrics.log
# Configures if and where metrics should be written to a socket.
uds-sink:
enable: false
real-time: false
path: /tmp/vast-metrics.sock
type: datagram # possible values are "stream" or "datagram"
# Configures if and where metrics should be written to VAST itself.
self-sink:
enable: false

Both the file and UDS sinks write metrics as NDJSON and inline the metadata key-value pairs into the top-level object. VAST buffers metrics for these sinks to batch I/O operations. To enable real-time metrics reporting, set the options vast.metrics.file-sink.real-time or vast.metrics.uds-sink.real-time respectively in your configuration file.

Self Sink ❤️ Pipelines

The self-sink routes metrics as events into VAST's internal storage engine, allowing you to work with metrics using VAST's pipelines. The schema for the self-sink is slightly different, with the key being embedded in the schema name:

vast.metrics.<key>:
record:
- ts: timestamp
- version: string
- actor: string
- key: string
- value: string
- <metadata...>

Here's an example that shows the start up latency of VAST's stores, grouped into 10 second buckets and looking at the minimum and the maximum latency, respectively, for all buckets.

vast export json '#type == "vast.metrics.passive-store.init.runtime"
| select ts, value
| summarize min(value), max(value) by ts resolution 10s'
{"ts": "2023-02-28T17:21:50.000000", "min(value)": 0.218875, "max(value)": 107.280125}
{"ts": "2023-02-28T17:20:00.000000", "min(value)": 0.549292, "max(value)": 0.991235}
// ...

Reference

The following list describes all available metrics keys:

KeyDescriptionUnitMetadata
accountant.startupThe first event in the lifetime of VAST.constant 0
accountant.shutdownThe last event in the lifetime of VAST.constant 0
archive.rateThe rate of events processed by the archive component.#events/second
arrow-writer.rateThe rate of events processed by the Arrow sink.#events/second
ascii-writer.rateThe rate of events processed by the ascii sink.#events/second
csv-reader.rateThe rate of events processed by the CSV source.#events/second
csv-writer.rateThe rate of events processed by the CSV sink.#events/second
exporter.processedThe number of processed events for the current query.#events🔎
exporter.resultsThe number of results for the current query.#events🔎
exporter.runtimeThe runtime for the current query in nanoseconds.nanoseconds🔎
exporter.selectivityThe rate of results out of processed events.#events-results/#events-processed🔎
exporter.shippedThe number of shipped events for the current query.#events🔎
importer.rateThe rate of events processed by the importer component.#events/second
index.memory-usageThe rough estimate of memory used by the index#bytes
ingest.rateThe ingest rate keyed by the schema name.#events/second🗂️
ingest-total.rateThe total ingest rate of all schemas.#events/second
json-reader.invalid-lineThe number of invalid NDJSON lines.#events
json-reader.rateThe rate of events processed by the JSON source.#events/second
json-reader.unknown-layoutThe number if NDJSON lines with an unknown layout.#event
json-writer.rateThe rate of events processed by the JSON sink.#events/second
catalog.lookup.candidatesThe number of candidate partitions considered for a query.#partitions🔎🪪
catalog.lookup.runtimeThe duration of a query evaluation in the catalog.#milliseconds🔎🪪
catalog.lookup.hitsThe number of results of a query in the catalog.#events🔎🪪
catalog.memory-usageThe rough estimate of memory used by the catalog#bytes
catalog.num-partitionsThe number of partitions registered in the catalog per schema.#partitions🗂️#️⃣
catalog.num-eventsThe number of events registered in the catalog per schema.#events🗂️#️⃣
catalog.num-partitions-totalThe sum of all partitions registered in the catalog.#partitions
catalog.num-events-totalThe sum of all events registered in the catalog.#events
node_throughput.rateThe rate of events processed by the node component.#events/second
null-writer.rateThe rate of events processed by the null sink.#events/second
partition.events-writtenThe number of events written in one partition.#events🗂
partition.lookup.runtimeThe duration of a query evaluation in one partition.#milliseconds🔎🪪💽
partition.lookup.hitsThe number of results of a query in one partition.#events🔎🪪💽
pcap-reader.discard-rateThe rate of packets discarded.#events-dropped/#events-received
pcap-reader.discardThe number of packets discarded by the reader.#events
pcap-reader.drop-rateThe rate of packets dropped.#events-dropped/#events-received
pcap-reader.dropThe number of packets dropped by the reader.#events
pcap-reader.ifdropThe number of packets dropped by the network interface.#events
pcap-reader.rateThe rate of events processed by the PCAP source.#events/second
pcap-reader.recvThe number of packets received.#events
pcap-writer.rateThe rate of events processed by the PCAP sink.#events/second
rebuilder.partitions.remainingThe number of partitions scheduled for rebuilding.#partitions
rebuilder.partitions.rebuildingThe number of partitions currently being rebuilt.#partitions
rebuilder.partitions.completedThe number of partitions rebuilt in the current run.#partitions
scheduler.backlog.customThe number of custom priority queries in the backlog.#queries
scheduler.backlog.lowThe number of low priority queries in the backlog.#queries
scheduler.backlog.normalThe number of normal priority queries in the backlog.#queries
scheduler.backlog.highThe number of high priority queries in the backlog.#queries
scheduler.partition.current-lookupsThe number of partition lookups that are currently running.#workers
scheduler.partition.lookupsQuery lookups executed on individual partitions.#partition-lookups
scheduler.partition.materializationsPartitions loaded from disk.#partitions
scheduler.partition.pendingThe number of queued partitions.#partitions
scheduler.partition.remaining-capacityThe number of partition lookups that could be scheduled immediately.#workers
scheduler.partition.scheduledThe number of scheduled partitions.#partitions
active-store.lookup.runtimeThe number of results of a query in an active store.#events🔎🪪💾
active-store.lookup.hitsThe number of results of a query in an active store.#events🔎🪪💾
passive-store.lookup.runtimeThe number of results of a query in a passive store.#events🔎🪪💾
passive-store.lookup.hitsThe number of results of a query in a passive store.#events🔎🪪💾
passive-store.init.runtimeTime until the store is ready serve queries.nanoseconds💾
posix-filesystem.checks.failedThe number of failed file checks since process start.
posix-filesystem.checks.successfulThe number of successful file checks since process start.
posix-filesystem.erases.bytesThe number of bytes erased since process start.#bytes
posix-filesystem.erases.failedThe number of failed file erasures since process start.
posix-filesystem.erases.successfulThe number of successful file erasures since process start.
posix-filesystem.mmaps.bytesThe number of bytes memory-mapped since process start.#bytes
posix-filesystem.mmaps.failedThe number of failed file memory-maps since process start.
posix-filesystem.mmaps.successfulThe number of successful file memory-maps since process start.
posix-filesystem.moves.failedThe number of failed file moves since process start.
posix-filesystem.moves.successfulThe number of successful file moves since process start.
posix-filesystem.reads.bytesThe number of bytes read since process start.#bytes
posix-filesystem.reads.failedThe number of success file reads since process start.
posix-filesystem.reads.successfulThe number of success file reads since process start.
posix-filesystem.writes.bytesThe number of bytes written since process start.#bytes
posix-filesystem.writes.failedThe number of failed file writes since process start.
posix-filesystem.writes.successfulThe number of successful file writes since process start.
source.startTimepoint when the source started.nanoseconds since epoch
source.stopTimepoint when the source stopped.nanoseconds since epoch
syslog-reader.rateThe rate of events processed by the syslog source.#events/second
test-reader.rateThe rate of events processed by the test source.#events/second
zeek-reader.rateThe rate of events processed by the Zeek source.#events/second

The metadata symbols have the following meaning:

SymbolKeyValue
🔎queryA UUID to identify the query.
🪪issuerA human-readable identifier of the query issuer.
💽partition-typeOne of "active" or "passive".
#️⃣partition-versionThe internal partition version.
💾store-typeOne of "parquet", "feather" or "segment-store".
🗂️schemaThe schema name.

For all keys that show throughput rates in #events/second, e.g., <component>.rate, the keys <component>.events and <component>.duration are dividend and divisor respectively. They are not listed explicitly in the above table.

Generally, counts are reset after a telemetry report is sent out by a component. E.g., the total number of invalid lines the JSON reader encountered is reflected by the sum of all json-reader.invalid-line events.