Expression Documentation
This section documents Bosun’s expression language, which is used to define the trigger condition for an alert. At the highest level the expression language takes various time series and reduces them them a single number. True or false indicates whether the alert should trigger or not; 0 represents false (don’t trigger an alert) and any other number represents true (trigger an alert). An alert can also produce one or more groups which define the alert’s scope or dimensionality. For example could you have one alert per host, service, or cluster or a single alert for your entire environment.
There are three data types in Bosun’s expression language:
{}
is still a group.In the vast majority of your alerts you will getting seriesSets back from your time series database and reducing them into numberSets.
Groups are generally provided by your time series database. We also sometimes refer to groups as “Tags”. When you query your time series database and get multiple time series back, each time series needs an identifier. So for example if I make a query with some thing like host=*
then I will get one time series per host. Host is the tag key, and the various various values returned, i.e. host1
, host2
, host3
…. are the tag values. Therefore the group for a single time series is something like {host=host1}
. A group can have multiple tag keys, and will have one tag value for each key.
Each group can become its own alert instance. This is what we mean by scope or dimensionality. Thus, you can do things like avg(q("sum:sys.cpu{host=ny-*}", "5m", "")) > 0.8
to check the CPU usage for many New York hosts at once. The dimensions can be manipulated with our expression language.
Various metrics can be combined by operators as long as one group is a subset of the other. A subset is when one of the groups contains all of the tag key-value pairs in the other. An empty group {}
is a subset of all groups. {host=foo}
is a subset of {host=foo,interface=eth0}
, and neither {host=foo,interface=eth0}
nor {host=foo,partition=/}
are a subset of the other. Equal groups are considered subsets of each other.
The standard arithmetic (+
, binary and unary -
, *
, /
, %
), relational (<
, >
, ==
, !=
, >=
, <=
), and logical (&&
, ||
, and unary !
) operators are supported. Examples:
q("q") + 1
, which adds one to every element of the result of the query "q"
-q("q")
, the negation of the results of the query5 > q("q")
, a series of numbers indicating whether each data point is more than five6 / 8
, the scalar value three-quartersIf you combine two seriesSets with an operator (i.e. q(..)
+ q(..)
), then operations are applied for each point in the series if there is a corresponding datapoint on the right hand side (RH). A corresponding datapoint is one which has the same timestamp (and normal group subset rules apply). If there is no corresponding datapoint on the left side, then the datapoint is dropped. This is a new feature as of 0.5.0.
From highest to lowest:
()
and the unary operators !
and -
*
, /
, %
+
, -
==
, !=
, >
, >=
, <
, <=
&&
||
Numbers may be specified in decimal (e.g., 123.45
), octal (with a leading zero like 072
), or hex (with a leading 0x like 0x2A
). Exponentials and signs are supported (e.g., -0.8e-2
).
alert haproxy_session_limit { template = generic $notes = This alert monitors the percentage of sessions against the session limit in haproxy (maxconn) and alerts when we are getting close to that limit and will need to raise that limit. This alert was created due to a socket outage we experienced for that reason $current_sessions = max(q("sum:haproxy.frontend.scur{host=*,pxname=*,tier=*}", "5m", "")) $session_limit = max(q("sum:haproxy.frontend.slim{host=*,pxname=*,tier=*}", "5m", "")) $query = ($current_sessions / $session_limit) * 100 warn = $query > 80 crit = $query > 95 warnNotification = default critNotification = default }
We don’t need to understand everything in this alert, but it is worth highlighting a few things to get oriented:
haproxy_session_limit
This is the name of the alert, an alert instance is uniquely identified by its alertname and group, i.e haproxy_session_limit{host=lb,pxname=http-in,tier=2}
$notes
This is a variable. Variables are not smart, they are just text replacement. If you are familiar with macros in C, this is a similar concept. These variables can be referenced in notification templates which is why we have a generic one for notesq("sum:haproxy.frontend.scur{host=*,pxname=*,tier=*}", "5m", "")
is an OpenTSDB query function, it returns N series, we know each series will have the host, pxname, and tier tag keys in their group based on the query.max(...)
is a reduction function. It takes each series and reduces it to a number (See the Data types section above).$current_sessions / $session_limit
these variables represent numbers and will have subset group matches so there for you can use the / operator between them.warn = $query > 80
if this is true (non-zero) then the warnNotification
will be triggered.These functions are considered preview as of August 2018. The names, signatures, and behavior of these functions might change as they are tested in real word usage.
The Azure Monitor datasource queries Azure for metric and resource information. These functions are available when AzureMonitorConf is defined in the system configuration.
These requests are subject to the Azure Resource Manager Request Limits so when using the az
and azmulti
functions you should be mindful of how many API calls your alerts are making given your configured check interval. Also using the historical testing feature to query multiple intervals of time could quickly eat through your request limit.
Currently there is no special treatment or instrumentation of the rate limit by Bosun, other then errors are expected once the rate limit is hit and warning will be logged when a request responses with less than 100 reads remaining.
PrefixKey is a quoted string used to query Azure with different clients from a single instance of Bosun. It can be passed as a prefix to Azure query functions as in the example below. If there is no prefix used then the query will be made on default Azure client.
$resources = ["foo"]azrt("Microsoft.Compute/virtualMachines")
$filteresRes = azrf($resources, "client:.*")
["foo"]azmulti("Percentage CPU", "", $resources, "max", "5m", "1h", "")
az queries the Azure Monitor REST API for time series data for a specific metric and resource. Responses will include at least to tags: name=<resourceName>,rsg=<resourceGroupName>
. If the metric support multiple dimensions and tagKeysCSV is non-empty additional tag keys are added to the response.
namespace
is the Azure namespace that the metric lives under. Supported metric with Azure montior contains a list of those namespaces, for example Microsoft.Cache/redis
and Microsoft.Compute/virtualMachines
.metric
is the name of the metric under the corresponding namespace
that you want to query, for example Percentage CPU
.tagKeysCSV
is comma-separated list of dimension keys that you want the response to group by. For example, the Per Disk Read Bytes/sec
metric under Microsoft.Compute/virtualMachines
has a SlotId metric, so if you pass "SlotId"
for this argument SlotId
will become a tag key in the response with the values corresponding to each slot (i.e 0
)rsg
is the name of the Azure resource group that the resource is inresName
is the name of the resourceagType
is the type of aggregation to use can be avg
, min
, max
, total
, or count
. If an empty string then the default is avg
.interval
is the Azure timegrain to use without “PT” and in lower case (ISO 8601 duration format). Common supported timegrains are 1m
, 5m
, 15m
, 30m
, 1h
, 6h
, 12h
, and 1d
.startDuration
and endDuration
set the time window from now - see the OpenTSDB q() function for more detailsExamples:
az("Microsoft.Compute/virtualMachines", "Percentage CPU", "", "myResourceGroup", "myFavoriteVM", "avg", "5m", "1h", "")
az("Microsoft.Compute/virtualMachines", "Per Disk Read Bytes/sec", "SlotId", "myResourceGroup", "myFavoriteVM", "max", "5m", "1h", "")
azrt (Azure Resources By Type) gets a list of Azure Resources that exist for a certain type. For example, azrt("Microsoft.Compute/virtualMachines")
would return all virtualMachine resources. This list of resources can then be passed to azrf()
(Azure Resource Filter) for additional filtering or to a query function that takes AzureResources as an argument like azmulti()
.
An error will be returned if you attempt to pass resources fetched for an Azure client with a different client. In other words, if the resources call (e.g. azrt()
) uses a different prefix from the time series query (e.g. azmulti()
)).
The underlying implementation of this fetches all resources and caches that information. So additional azrt calls within scheduled check cycle will not result in additional calls to Azure’s API.
azrf (Azure Resource Filter) takes a resource list and filters it to less resources based on the filter. The resources argument would usually be an azrt()
call or another azrf
call.
The filter argument supports filter supports joining terms in ()
as well as the AND
, OR
, and !
operators. The following query terms are supported and are always in the format of something:something. The first part of each term (the key) is case insensitive.
name:<regex>
where the resource name matches the regular expression.rsg:<regex>
where the resource group of the resource matches the resource.otherText:<regex>
will match resources based on Azure tags. otherText
would be the tag key and the regex will match against the tag’s value. If the tag key does not exist on the resource then there will be no match.Regular expressions use Go’s regular expressions which use the RE2 syntax. If you want an exact match and not a substring be sure to anchor the term with something like rsg:^myRSG$
.
Example:
$resources = azrt("Microsoft.Compute/virtualMachines")
# Filter resources to those with client azure tag that has any value
$filteresRes = azrf($resources, "client:.*")
azmulti("Percentage CPU", "", $filteredRes, "max", "5m", "1h", "")
Note that azrf()
does not take a prefix key since it is filtering resources that have already been retrieved. The resulting azureResources will still be associated with the correct client/prefix.
azmulti (Azure Multiple Query) queries a metric for multiple resources and returns them as a single series set. The arguments metric, tagKeysCSV, agType, interval, startDuration, and endDuration all behave the same as in the az
function. Also like the az
functions the result will be tagged with rsg
, name
, and any dimensions from tagKeysCSV.
The resources argument is a list of resources (an azureResourcesType) as returned by azrt
and azrf
.
Each resource queried requires an Azure Monitor API call. So if there are 20 items in the set from return of the call, 20 calls are made that count toward the rate limit. This function exists because most metrics do not have dimensions on primary attributes like the machine name.
Example:
$resources = azrt("Microsoft.Compute/virtualMachines")
azmulti("Percentage CPU", "", $resources, "max", "PT5M", "1h", "")
Queries for Azure Application Insights use the same system configuration as the Azure Monitor Query Functions. Therefore these functions are available when AzureMonitorConf is defined in the system configuration. However, a different API is used to query these metrics. In order for these to work you will have to have AAD Auth setup for the client user.
Currently only Application Insights metrics are supported and events are not supported.
These queries share the same Prefix Key as Azure Montitor queries.
aiapp (Application Insights Apps) gets a list of Azure application insights applications/resources to query. This can be passed to the ai()
function, or filtered to a subset of applications using the aippf()
function, which can then also be passed to the ai()
function.
The implementation for getting the list of applications uses the Azure components/list REST API.
aiappf (Application Insights Apps Filter) filters a list of applications from aiapp()
to a subset of applications based on the filter
string. The result can then be passed to the ai()
function. The filter behaves in a similar way to the way azrf()
filters resources.
The filter argument supports filter supports joining terms in ()
as well as the AND
, OR
, and !
operators. The following query terms are supported and are always in the format of something:something. The first part of each term (the key) is case insensitive.
name:<regex>
where the resource name of the insights application matches the regular expression.otherText:<regex>
will match insights applications based on the Azure tags on the insights application resource. otherText
would be the tag key and the regex will match against the tag’s value. If the tag key does not exist on the resource then there will be no match.Regular expressions use Go’s regular expressions which use the RE2 syntax. If you want an exact match and not a substring be sure to anchor the term with something like name:^myApp$
.
ai (Application Insights) queries application insights metrics from multiple application insights applications, tagging the values with the app=AppName
key-value pair where AppName is the name of the Application Insights resource. The response will also be tagged by segments if any are requested.
metric
is the name of the metric you wish to query. A list of “Default Metrics” is listed in the API Documentation. You can also use the aimd()
function to see what metrics are available.segmentsCSV
is a comma-separated is comma-separated list of “segments” that you want the response to group by. For example with the default metric requests/count
you might have client/countryOrRegion,cloud/roleInstance
. You can also use the aimd()
function to see what segments/dimensions are available.filter
is an odata filter than can be used to refine results. See more information below.apps
is a list of azure applications to query returned by aiapp()
or aiappf()
.agType
is the aggregation type to use. Common values are avg
, min
, max
, sum
, or count
. If the aggregation type is not available the error will indicate what types are. You can use the aimd()
function to see what aggregations are available.interval
is the Azure timegrain to use without “PT” and in lower case (ISO 8601 duration format). Common supported timegrains are 1m
, 5m
, 15m
, 30m
, 1h
, 6h
, 12h
, and 1d
. If empty the value will be 1m
.startDuration
and endDuration
set the time window from now - see the OpenTSDB q() function for more details.Regarding the filter
argument it seems Azure’s documentation is not clear on supported OData operations. That being said here are some observations:
startswith
and contains
are valid string operations in the filter.segmentsCSV
.These requests are subject to a different rate limit.
Using Azure Active Directory for authentication, throttling rules are applied per AAD client user. Each AAD user is able to make up to 200 requests per 30 seconds, with no cap on the total calls per day.
A HTTP request is made per application. Unlike azmulti()
these requests are serial and not parallelized since the ratelimit is of a relatively short duration (30 seconds). That means you can expect this query to be slow relative to the number of applications you are querying.
Example:
$selectedApps = aiappf(aiapp(), "environment:prd")
$filter = "startswith(operation/name, 'POST')"
ai("requests/duration", "cloud/roleInstance", $filter, $selectedApps, "avg", "1h", "3d", "")
aimd (Application Insights Metadata) return metrics and their related aggregations and dimensions/segments per application. The list of applications should be provided with aiapp()
or aiappf()
. For most use cases filtering to a single app is ideal since the metadata object for each application is generally fairly large.
This is not meant to be used in normal expression workflow (e.g. not for alerting or templates), but rather exists so in the Bosun’s expression editor UI, you can get a list of what can be queried with the ai()
function.
Performs a graphite query. the duration format is the internal bosun format (which happens to be the same as OpenTSDB’s format). Functions pretty much the same as q() (see that for more info) but for graphite. The format string lets you annotate how to parse series as returned by graphite, as to yield tags in the format that bosun expects. The tags are dot-separated and the amount of “nodes” (dot-separated words) should match what graphite returns. Irrelevant nodes can be left empty.
For example:
groupByNode(collectd.*.cpu.*.cpu.idle,1,'avg')
returns seriesSet named like host1
, host2
etc, in which case the format string can simply be host
.
collectd.web15.cpu.*.cpu.*
returns seriesSet named like collectd.web15.cpu.3.idle
, requiring a format like .host..core..cpu_type
.
For advanced cases, you can use graphite’s alias(), aliasSub(), etc to compose the exact parseable output format you need. This happens when the outer graphite function is something like “avg()” or “sum()” in which case graphite’s output series will be identified as “avg(some.string.here)”.
Like band() but for graphite queries.
Queries InfluxDB.
All tags returned by InfluxDB will be included in the results.
db
is the database name in InfluxDBquery
is an InfluxDB select statement
NB: WHERE clauses for time
are inserted automatically, and it is thus an error to specify time
conditions in query.startDuration
and endDuration
set the time window from now - see the OpenTSDB q() function for more details
They will be merged into the existing WHERE clause in the query
.groupByInterval
is the time.Duration
window which will be passed as an argument to a GROUP BY time() clause if given. This groups values in the given time buckets. This groups (or in OpenTSDB lingo “downsamples”) the results to this timeframe. Full documentation on Group by.fill(none)
to filter out any nil rows.'''
) for many queries. When using single quotes in triple single quotes, you may need a space. So for example '''select max(value) from "my.measurement" where key = 'val''''
is not valid but '''select max(value) from "my.measurement" where key = 'val' '''
is.These influx and opentsdb queries should give roughly the same results:
influx("db", '''SELECT non_negative_derivative(mean(value)) FROM "os.cpu" GROUP BY host''', "30m", "", "2m")
q("sum:2m-avg:rate{counter,,1}:os.cpu{host=*}", "30m", "")
Querying graphite sent to influx (note the quoting):
influx("graphite", '''select sum(value) from "df-root_df_complex-free" where env='prod' and node='web' ''', "2h", "1m", "1m")
Elasitc replaces the deprecated logstash (ls) functions. It only works with Elastic v2+. It is meant to be able to work with any elastic documents that have a time field and not just logstash. It introduces two new types to allow for greater flexibility in querying. The ESIndexer type generates index names to query (based on the date range). There are now different functions to generate indexers for people with different configurations. The ESQuery type is generates elastic queries so you can filter your results. By making these new types, new Indexers and Elastic queries can be added over time.
You can view the generated JSON for queries on the expr page by bring up miniprofiler with Alt-P.
PrefixKey is a quoted string used to query different elastic cluster and can be passed as a prefix to elastic query functions mentioned below. If not used the query will be made on default cluster.
Querying foo cluster:
$index = esindices("timestamp", "errors")
$filter = esquery("nginx", "POST")
crit = max(["foo"]escount($index, "host", $filter, "1h", "30m", "")) > 2
escount returns a time bucked count of matching documents. It uses the keystring, indexRoot, interval, and durations to create an elastic Date Histogram Aggregation.
indexIndexer
will always be a function that returns an ESIndexer, such as esdaily
.keyString
is a csv separated list of fields. The fields will become tag keys, and the values returned for fields become the correspond tag values. For example host,errorCode
. If an empty string is given, then the result set will have a single series and will have an empty tagset {}
. These keys become terms filters for the date histogram.filter
will be a funtion that returns an ESQuery. The queries further refine the results. The fields you filter on can match the fields in the keyString, but don’t have too. If you don’t want to filter you results, use esall()
here.bucketDuration
is an opentsdb duration string. It sets the the span of time to bucket the count of documents. For example, “1m” will give you the count of documents per minute.startDuration
and endDuration
set the time window from now - see the OpenTSDB q() function for more details.estat returns various summary stats per bucket for the specified field
. The field must be numeric in elastic. rStat can be one of avg
, min
, max
, sum
, sum_of_squares
, variance
, std_deviation
. The rest of the fields behave the same as escount.
esdaily is for elastic indexes that have a date name for each day. Based on the timeframe of the enclosing es function (i.e. esstat and escount) to generate which indexes should be included in the query. It gets all indexes and won’t include indices that don’t exist. The layout specifier uses’s Go’s time specification format. The timeField is the name of the field in elastic that contains timestamps for the documents.
esmonthly is like esdaily except that it is for monthly indices. It is expect the index name is the first of every month.
esindices takes one or more literal indices for the enclosing query to use. It does not check for existance of the index, and passes back the elastic error if the index does not exist. The timeField is the name of the field in elastic that contains timestamps for the documents.
esls is a shortcut for esdaily(“@timestamp”, indexRoot+”-“, “2006.01.02”) and is for the default daily format that logstash creates.
esall returns an elastic matchall query, use this when you don’t want to filter any documents.
esregexp creates an elastic regexp query for the specified field.
esquery creates a full-text elastic query string query.
esand takes one or more ESQueries and combines them into an elastic bool query where all the queries “must” be true.
esor takes one or more ESQueries and combines them into an elastic bool query so that at least one must be true.
esnot takes a query and inverses the logic using must_not from an elastic bool query.
esexists is true when the specified field exists.
###esgt(field string, value Scalar) ESQuery
esgt takes a field (expected to be numeric field in elastic) and returns results where the value of that field is greater than the specified value. It creates an elastic range query.
esgt takes a field (expected to be numeric field in elastic) and returns results where the value of that field is greater than or equal to the specified value. It creates an elastic range query.
esgt takes a field (expected to be numeric field in elastic) and returns results where the value of that field is less than the specified value. It creates an elastic range query.
esgt takes a field (expected to be numeric field in elastic) and returns results where the value of that field is less than or equal to the specified value. It creates an elastic range query.
Query functions take a query string (like sum:os.cpu{host=*}
) and return a seriesSet.
Generic query from endDuration to startDuration ago. If endDuration is the empty string (""
), now is used. Support d( units are listed in the docs. Refer to the docs for query syntax. The query argument is the value part of the m=...
expressions. *
and |
are fully supported. In addition, queries like sys.cpu.user{host=ny-*}
are supported. These are performed by an additional step which determines valid matches, and replaces ny-*
with ny-web01|ny-web02|...|ny-web10
to achieve the same result. This lookup is kept in memory by the system and does not incur any additional OpenTSDB API requests, but does require scollector instances pointed to the bosun server.
Band performs num
queries of duration
each, period
apart and concatenates them together, starting period
ago. So band("avg:os.cpu", "1h", "1d", 7)
will return a series comprising of the given metric from 1d to 1d-1h-ago, 2d to 2d-1h-ago, etc, until 8d. This is a good way to get a time block from a certain hour of a day or certain day of a week over a long time period.
Note: this function wraps a more general version bandQuery(query string, duration string, period string, eduration string, num scalar) seriesSet
, where eduration
specifies the end duration for the query to stop at, as with q()
.
Over’s arguments behave the same way as band. However over shifts the time of previous periods to be now, tags them with duration that each period was shifted, and merges those shifted periods into a single seriesSet, which includes the most recent period. This is useful for displaying time over time graphs. For example, the same day week over week would be over("avg:1h-avg:rate:os.cpu{host=ny-bosun01}", "1d", "1w", 4)
.
Note: this function wraps a more general version overQuery(query string, duration string, period string, eduration string, num scalar) seriesSet
, where eduration
specifies the end duration for the query to stop at, as with q
. Results are still shifted to end at current time.
shiftBand’s behaviour is very similar to over
, however the most recent period is not included in the seriesSet. This function could be useful for anomaly detection when used with aggr
, to calculate historical distributions to compare against.
Change is a way to determine the change of a query from startDuration to endDuration. If endDuration is the empty string (""
), now is used. The query must either be a rate or a counter converted to a rate with the agg:rate:metric
flag.
For example, assume you have a metric net.bytes
that records the number of bytes that have been sent on some interface since boot. We could just subtract the end number from the start number, but if a reboot or counter rollover occurred during that time our result will be incorrect. Instead, we ask OpenTSDB to convert our metric to a rate and handle all of that for us. So, to get the number of bytes in the last hour, we could use:
change("avg:rate:net.bytes", "60m", "")
Note that this is implemented using the bosun’s avg
function. The following is exactly the same as the above example:
avg(q("avg:rate:net.bytes", "60m", "")) * 60 * 60
Count returns the number of groups in the query as an ungrouped scalar.
Window performs num
queries of duration
each, period
apart, starting
period
ago. The results of the queries are run through funcName
which
must be a reduction function taking only one argument (that is, a function
that takes a series and returns a number), then a series made from those. So
window("avg:os.cpu{host=*}", "1h", "1d", 7, "dev")
will return a series
comprising of the average of given metric from 1d to 1d-1h-ago, 2d to
2d-1h-ago, etc, until 8d. It is similar to the band function, except that
instead of concatenating series together, each series is reduced to a number,
and those numbers created into a series.
In addition to supporting Bosun’s reduction functions that take on argument, percentile operations may be be done by setting funcName
to p followed by number that is between 0 and 1 (inclusively). For example, "p.25"
will be the 25th percentile, "p.999"
will be the 99.9th percentile. "p0"
and "p1"
are min and max respectively (However, in these cases it is recommended to use "min"
and "max"
for the sake of clarity.
Prometheus query functions query Prometheus TSDB(s) using the using the Prometheus HTTP v1 API. When PromConf
in the system configuration is added these functions become available.
There are currently two types of functions: functions that return time series sets (seriesSet) and information functions that are meant to be used interactively in the expression editor for information about metrics and tags.
The PrefixKey is a quoted string used to query different promthesus backends in PromConf
in the system configuration. If the PrefixKey is missing (there are no brackets before the function), then “default” is used. For example the prefix in the following is ["it"]
:
["it"]prom("up", "namespace", "", "sum", "5m", "1h", "")
In the case of promm
and promratem
, the prefix may have multiple keys separated by commas to allow for querying multiple prom datasources at once, for example:
["it,default"]promm("up", "namespace", "", "sum", "5m", "1h", "")
When a Prometheus query is made all time series in the response do not have to have the same set of tag keys. For example, when making a PromQL request that has group by (host,interface)
results may be included in the response that contain only host
, only interface
, or no tag keys at all. Bosun requires that the tag keys be consistent for each series within a seriesSet. Therefore, these results are removed from the responses when using functions like prom
, promrate
, promm
, and promratem
.
Note
This behavior may change in the future to an alternative design. Instead of dropping these series, the series could be retained but the missing tag keys would be added to the response with some sort of value to represent that the tag is missing.
prom queries a Promethesus TSDB for time series data. It accomplishes this by generating a PromQL query from the given arguments.
metric
is the name of the to query. To get a list of available metrics use the prommetrics()
function.groupByTags
is a comma separated list of tag keys to aggregate the response by.filter
filters to results using Prometheus Time Series Selectors. This functions analogous to a WHERE
clause in SQL. For example: job=~".*",method="get"
. Operators are =
, !=
, =~
, and !~
for equals, not equals, RE2 match, and not RE2 match respectively. This string is inserted into the generated promQL query directly.agType
is the the aggregation function to perform such as "sum"
or "avg"
. It can be any Prometheus Aggregation operator.stepDuration
is Prometheus’s evaluation step duration. This is like downsampling, except that takes the datapoint that is most recently before (or matching) the step based on the start time. If there are no samples in that duration, the sample will be repeated. See [Prometheus Docs Issue #699].(https://github.com/prometheus/docs/issues/699).startDuration
and endDuration
determain the start and end time based on the current time (or currently selected time in the expression/rule editor). They are then used to send an absolute time range for the Prometheus request.Example:
$metric = "up"
$groupByTags = "namespace"
$filter = ''' service !~ "kubl.*" '''
$agg = "sum"
$step = "1m"
prom($metric, $groupByTags, $filter, $agg, $step, "1h", "")
The above example would generate a PromQL query of sum( up { service !~ "kubl.*" } ) by ( namespace )
. The time range and step are sent via HTTP query parameters.
promrate is like prom
function, except that is for rate per-second calculations on metrics that are counters. It therefore includes the extra rateStepDuration
argument which is for calculating the step of the rate calculation. The stepDuration
is then for the step of the aggregation operation that is on top of the calculated rate. This is performed using the rate()
function in PromQL.
Example:
$metric = "container_memory_working_set_bytes"
$groupByTags = "container_name,namespace"
$filter = ''' container_name !~ "pvc-.*$" '''
$agg = "sum"
$rateStep = "1m"
$step = "5m"
promrate($metric, $groupByTags, $filter, $agg, $rateStep, $step, "1h", "")
The above example would generate a PromQL query of sum(rate( container_memory_working_set_bytes { container_name !~ "pvc-.*$" } [1m] )) by ( container_name,namespace )
. The time range and step are sent via HTTP query parameters.
promm (Prometheus Multiple) is like the prom
function, except that it queries multiple Prometheus TSDBs and combines the result into a single seriesSet. A tag key of bosun_prefix
with the tag value set to the prefix is added to the results to ensure that series are unique in the result.
Example:
$metric = "container_memory_working_set_bytes"
$groupByTags = "container_name,namespace"
$filter = ''' container_name !~ "pvc-.*$" '''
$agg = "sum"
$step = "5m"
$q = ["it,default"]promm($metric, $groupByTags, $filter, $agg, $step, "1h", "")
max($q)
# You could use the aggr function to aggregate across clusters if you like
# aggr($q, $groupByTags, $agg)
In the above example $q
will be a seriesSet with the tag keys of container_name
, namespace
, and bosun_prefix
. The values for the bosun_prefix
key will be either it
or default
for each series in the set.
promratem (Prometheus Rate Multiple) is like the promm
function is to the prom
function. It allows you to do a per-second rate query against multiple Prometheus TSDBs and combines the result into a single seriesSet – adding the bosun_prefix
tag key to the result. It behaves the same as the promm
function, but like promrate
, it has the extra rateStepDuration
argument.
Instead of building a promql query like the prom
and promrate
functions, promras (Prometheus Raw Aggregate Series) allows you to query Prometheus using promql with some restrictions:
by
clause.Example:
promras(''' sum(rate(container_fs_reads_total[1m]) + rate(container_fs_writes_total[1m])) by (namespace) ''', "2m", "2h", "")
prommras (Prometheus Multiple Raw Aggregate Series) is like the promras
function excepts that it queries multiple prometheus instances and adds the “bosun_prefix” tag to the results like the promm
and prommrate
functions.
Example:
# You can still use string interpolation of $variables in promras and prommras
$step = 1m
$reads = container_fs_reads_total[$step]
$writes = container_fs_writes_total[$step]
["default,it"]prommras(''' sum(rate($reads) + rate($writes)) by (namespace) ''', "2m", "2h", "")
prommetrics returns a list of metrics that are available in the Prometheus TSDB. This is not meant to be used in alerting, it is for use in the expression editor for getting information to build queries. For you example, you might open up another expression tab in bosun and use the output as a reference. This function supports a prefix so examples would be prommetrics()
and ["it"]prommetrics()
.
It gets the list of metrics by using the Prometheus Label Values HTTP API to get the values for the __name__
value.
promtags returns various tag information for the metric (“tag” ~= “Label” in Prometheus terminology). It does a raw query (querying the metric only) for the provided duration and returns the tag information for the metric in that given time period. This is not meant to be used in alerting, it is for use in the expression editor for getting information to build queries.
The result has the following Properties:
Examples: promtags("up", "10", "")
, ["it"]promtags("container_memory_working_set_bytes")
.
These functions are available when cloudwatch is enabled via Bosun’s configuration. Query syntax is potentially subject to change in later releases
The parameters are as follows:
region
The amazon region(s) for the service metrics you are interested in. e.g. eu-west-1,eu-central-1
namespace
The CloudWatch namespace which the metric you want to query exists under e.g AWS/S3
metric
The CloudWatch metric you wish to query, e.g. NumberOfObjects
dimension
A string containing dimension key value pairs separated by :period
size of bucket to use for grouping data-points expressed as a time string e.g. 1m
statistic
Which aggregator to use to combine the datapoints in each bucket. e.g. Sum
startDuration
and endDuration
set the time window from now - see the OpenTSDB q() function for more detailsA complete example returning the counts of infrequent access objects in our s3 bucket over the last hour.
$region = "eu-west-1"
$namespace = "AWS/S3"
$metric = "NumberOfObjects"
$period = "1m"
$statistics = "Average"
$dimensions = "BucketName:my-s3-bucket,StorageType:STANDARD_IA"
$objectCount = cw($region, $namespace, $metric, $period, $statistics, $dimensions, "1h" ,"")
You can use * as a wildcard character in dimensions to match multiple series
$region = "eu-west-1,eu-central-1"
$namespace = "AWS/ELB"
$metric = "HealthyHostCount"
$period = "5m"
$statistics = "Minimum"
$dimensions = "LoadBalancerName:web-*,AvailabilityZone:*"
$cpuUsage = cw($region, $namespace, $metric, $period, $statistics, $dimensions, "7d" ,"")
PrefixKey is a quoted string used to query different aws accounts by passing the name of the profile from the amazon credentials file. If omitted the query will be made using the default credentials chain.
Credentials file example:
[prod]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
[test]
aws_access_key_id=BKyfyfIAIDNN7EXAMPLE
aws_secret_access_key=Ays6tnFEMI/ASD7D6/bPxRfiCYEXAMPLEKEY
Example of querying using multiple accounts
$region = "eu-west-1"
$namespace = "AWS/EC2"
$metric = "CPUUtilization"
$period = "1m"
$statistics = "Average"
$prodDim = "InstanceId":"i-1234567890abcdef0"
$testDim = "InstanceId":"i-0598c7d356eba48d7"
$p = ["prod"]cw($region, $namespace, $metric, $period, $statistics, $prodDim, "1h" ,"")
$t = ["test"]cw($region, $namespace, $metric, $period, $statistics, $testDim, "1h" ,"")
These function are available when annotate is enabled via Bosun’s configuration.
For the following annotation functions, filter
is a string with the following specification.
Items in a filter are in the format keyword:value
. The value is either a glob pattern or literal string to match, or the reserved word empty
which means that the value of the field is an empty string.
Possible keywords are: owner
, user
, host
, category
, url
, and message
.
All items can be combined in boolean logic by using paranthesis groupging, !
as not, AND
as logical and, and OR
as logical or.
For example, "owner:sre AND ! user:empty"
would show things that belong to sre, and have a username specified. When annotations are created by a process, we don’t specify a user.
Antable is meant for shoowing annotations in a Grafana table, where Grafana’s “To Table Transform” under options is set to type “Table”.
See Annotation Filters above to understand filters. FieldsCSV is a list of columns to display in the table. They can be in any order. The possible columns you can include are: start
, end
, owner
, user
, host
, category
, url
, link
message
, duration
. At least one column must be specified.
link
is unlike the others in that it actually returns the HTML to construct a link, whereas url
is the the text of the link. This is so when using a Grafana table and Grafana v3.1.1 or later, you can have a link in a table as long as you enable sanitize HTML within the Grafana Column Styles.
For example: antable("owner:sre AND category:outage", "start,end,user,owner,category,message", "8w", "")
will return a table of annotations with the selected columns in FieldCSV going back 8 weeks from the time of the query.
ancounts returns a series representing the number of annotations that matched the filter for the specified period. One might expect a number instead of a series, but by having a series it has a useful property. We can count outages that span’d across the requested time frame and count them as fractional outages.
If an annotation’s timespan is contained entirely within the request timespan, or the timespan of the request is within the the timespan of the annotation, a 1 is added to the series.
If an annotation either starts before the requested start time, or ends after the requested start time then it is counted as a fractional outage (Assuming the annotation ended or started respectively with the requested time frame).
If there are no annotations within the requested time period, then the value NaN
will be returned.
For example:
The following request is made at 2016-09-21 14:49:00
.
$filter = "owner:sre AND category:outage"
$back = "1n"
$count = ancounts($filter, $back, "")
# TimeFrame of the Fractional annotation: "2016-09-21T14:47:56Z", "2016-09-21T14:50:53Z" (Duration: 2m56 sec)
$count
Returns:
{
"0": 1,
"1": 1,
"2": 0.3615819209039548
}
The float values means that 36% of the annotation fell with the requested time frame. Once can get the sum of these by doing sum($count)
, result of 2.36...
to get the fractional sum, or len($count)
, result 3
to get the count.
Note: The index values above, 0, 1, and 2 are disregarded and are just there so we can use the same underlying type as a time series.
andurations behaves in a similiar way to ancounts. The difference is that the values you returned will be the duration of annotation in seconds.
If the duration spans part of the requested time frame, only the duration of the annotation that falls within the timerange will be returns as a value for that annotation. If the annotation starts before the request and ends after the request, the duration of the request timeframe will be returned.
If there are no annotations within the requested time period, then the value NaN
will be returned.
For example, a identical query to the example in ancounts but using andurations instead:
$filter = "owner:sre AND category:outage"
$back = "1n"
$durations = andurations($filter, $back, "")
# TimeFrame of the Fractional Outage: "2016-09-21T14:47:56Z", "2016-09-21T14:50:53Z",
$durations
Returns:
{
"0": 402,
"1": 758,
"2": 64
}
All reduction functions take a seriesSet and return a numberSet with one element per unique group.
Average (arithmetic mean).
Returns the change count which is the number of times in the series a value was not equal to the immediate previous value. Useful for checking if things that should be at a steady value are “flapping”. For example, a series with values [0, 1, 0, 1] would return 3.
Standard deviation.
Diff returns the last point of each series minus the first point.
Returns the first (least recent) data point in each series.
Returns the number of seconds until a linear regression of each series will reach y_val.
Linelr return the linear regression line from the end of each series to end+duration (an OpenTSDB duration string). It adds regression=line
to the group / tagset. It is meant for graphing with expressions, for example:
$d = "1w"
$q = q("avg:1h-avg:os.disk.fs.percent_free{}{host=ny-tsdb*,disk=/mnt*}", "2w", "")
$line = linelr($q, "3n")
$m = merge($q, $line)
$m
Returns the last (most recent) data point in each series.
Returns the length of each series.
Returns the maximum value of each series, same as calling percentile(series, 1).
Returns the median value of each series, same as calling percentile(series, .5).
Returns the minimum value of each series, same as calling percentile(series, 0).
Returns the value from each series at the percentile p. Min and Max can be simulated using p <= 0
and p >= 1
, respectively.
Returns the number of seconds since the most recent data point in each series.
Returns the length of the longest streak of values that evaluate to true for each series in the set (i.e. max amount of contiguous non-zero values found). A single true value in the series returns 1.
This is useful to create an expression that is true if a certain number of consecutive observations exceeded a threshold - as in the following example:
$seriesA = series("host=server01", 0,0, 60,35, 120,35, 180,35, 240,5)
$seriesB = series("host=server02", 0,0, 60,35, 120, 5, 180, 5, 240,5)
$sSet = merge($seriesA, $seriesB)
$isAbove = $sSet > 30
$consecutiveCount = streak($isAbove)
# $consecutiveCount: a numberSet where server01 has a value of 3, server02 has a value of 1
# Are there 3 or more adjacent/consecutive/contiguous observations greater than 30?
$consecutiveCount >= 3
Sum returns the sum (a.k.a. “total”) for each series in the set.
Aggregation functions take a seriesSet, and return a new seriesSet.
Takes a seriesSet and combines it into a new seriesSet with the groups specified, using an aggregator to merge any series that share the matching group values. If the groups argument is an empty string, all series are combined into a single series, regardless of existing groups.
The available aggregator functions are: "avg"
(average), "min"
(minimum), "max"
(maximum), "sum"
and "pN"
(percentile) where N is a floating point number between 0 and 1 inclusive. For example, "p.25"
will be the 25th percentile, "p.999"
will be the 99.9th percentile. "p0"
and "p1"
are min and max respectively (However, in these cases it is recommended to use "min"
and "max"
for the sake of clarity.
The aggr function can be particularly useful for removing anomalies when comparing timeseries over periods using the over function.
Example:
$weeks = over("sum:1m-avg:os.cpu{region=*,color=*}", "24h", "1w", 3)
$agg = aggr($weeks, "region,color", "p.50")
The above example uses over
to load a 24 hour period over the past 3 weeks. We then use the aggr function to combine the three weeks into one, selecting the median (p.50
) value of the 3 weeks at each timestamp. This results in a new seriesSet, grouped by region and color, that represents a “normal” 24 hour period with anomalies removed.
An error will be returned if a group is specified to aggregate on that does not exist in the original seriesSet.
The aggr function expects points in the original series to be aligned by timestamp. If points are not aligned, they are aggregated separately. For example, if we had a seriesSet,
Group | Timestamp | Value |
---|---|---|
{host=web01} | 1 | 1 |
{host=web01} | 2 | 7 |
{host=web01} | 1 | 4 |
and applied the following aggregation:
aggr($series, "host", "max")
we would receive the following aggregated result:
Group | Timestamp | Value | Timestamp | Value |
---|---|---|---|---|
{host=web01} | 1 | 4 | 2 | 7 |
aggr also does not attempt to deal with NaN values in a consistent manner. If all values for a specific timestamp are NaN, the result for that timestamp will be NaN. If a particular timestamp has a mix of NaN and non-NaN values, the result may or may not be NaN, depending on the aggregation function specified.
Group functions modify the OpenTSDB groups.
Accepts a series and a set of tags to add to the set in Key1=NewK1,Key2=NewK2
format. This is useful when you want to add series to set with merge and have tag collisions.
Accepts a series and a set of tags to rename in Key1=NewK1,Key2=NewK2
format. All data points will have the tag keys renamed according to the spec provided, in order. This can be useful for combining results from seperate queries that have similar tagsets with different tag keys.
Accepts a tag key to remove from the set. The function will error if removing the tag key from the set would cause the resulting set to have a duplicate item in it.
Transposes N series of length 1 to 1 series of length N. If the group parameter is not the empty string, the number of series returned is equal to the number of tagks passed. This is useful for performing scalar aggregation across multiple results from a query. For example, to get the total memory used on the web tier: sum(t(avg(q("avg:os.mem.used{host=*-web*}", "5m", "")), ""))
. See Understanding the Transpose Function for more explanation.
How transpose works conceptually
Transpose Grouped results into a Single Result:
Before Transpose (Value Type is NumberSet):
Group | Value |
---|---|
{host=web01} | 1 |
{host=web02} | 7 |
{host=web03} | 4 |
After Transpose (Value Type is SeriesSet):
Group | Value |
---|---|
{} | 1,7,4 |
Transpose Groups results into Multiple Results:
Before Transpose by host (Value Type is NumberSet)
Group | Value |
---|---|
{host=web01,disk=c} | 1 |
{host=web01,disc=d} | 3 |
{host=web02,disc=c} | 4 |
After Transpose by “host” (Value type is SeriesSet)
Group | Value |
---|---|
{host=web01} | 1,3 |
{host=web02} | 4 |
Useful Example of Transpose Alert if more than 50% of servers in a group have ping timeouts
alert or_down {
$group = host=or-*
# bosun.ping.timeout is 0 for no timeout, 1 for timeout
$timeout = q("sum:bosun.ping.timeout{$group}", "5m", "")
# timeout will have multiple groups, such as or-web01,or-web02,or-web03.
# each group has a series type (the observations in the past 10 mintutes)
# so we need to *reduce* each series values of each group into a single number:
$max_timeout = max($timeout)
# Max timeout is now a group of results where the value of each group is a number. Since each
# group is an alert instance, we need to regroup this into a sigle alert. We can do that by
# transposing with t()
$max_timeout_series = t("$max_timeout", "")
# $max_timeout_series is now a single group with a value of type series. We need to reduce
# that series into a single number in order to trigger an alert.
$number_down_series = sum($max_timeout_series)
$total_servers = len($max_timeout_series)
$percent_down = $number_down_servers / $total_servers) * 100
warnNotification = $percent_down > 25
}
Since our templates can reference any variable in this alert, we can show which servers are down in the notification, even though the alert just triggers on 25% of or-* servers being down.
Returns the input with its group removed. Used to combine queries from two differing groups.
Executes and returns the key
expression from alert name
(which must be
warn
or crit
). Any alert of the same name that is unknown or unevaluated
is also returned with a value of 1
. Primarily for use with the depends
alert keyword.
Example: alert("host.down", "crit")
returns the crit
expression from the host.down alert.
Returns the absolute value of each value in the set.
Returns a seriesSet where each series is has datapoints removed if the datapoint is before start (from now, in seconds) or after end (also from now, in seconds). This is useful if you want to alert on different timespans for different items in a set, for example:
lookup test {
entry host=ny-bosun01 {
start = 30
}
entry host=* {
start = 60
}
}
alert test {
template = test
$q = q("avg:rate:os.cpu{host=ny-bosun*}", "5m", "")
$c = crop($q, lookup("test", "start") , 0)
crit = avg($c)
}
Returns the number of seconds of the OpenTSDB duration string.
Returns an OpenTSDB duration string that represents the given number of seconds. This lets you do math on durations and then pass it to the duration arguments in functions like q()
Returns series smoothed using Holt-Winters double exponential smoothing. Alpha (scalar) is the data smoothing factor. Beta (scalar) is the trend smoothing factor.
Remove any values greater than number from a series. Will error if this operation results in an empty series.
Remove any values greater than or equal to number from a series. Will error if this operation results in an empty series.
Remove any values lower than number from a series. Will error if this operation results in an empty series.
Remove any values lower than or equal to number from a series. Will error if this operation results in an empty series.
Remove any NaN or Inf values from a series. Will error if this operation results in an empty series.
Drop datapoints where the corresponding value in the second series set is zero. (See Series Operations for what corresponding means). The following example drops tr_avg (avg response time per bucket) datapoints if the count in that bucket was + or - 100 from the average count over the time period.
Example:
$count = q("sum:traffic.haproxy.route_tr_count{host=literal_or(ny-logsql01),route=Questions/Show}", "30m", "")
$avg = q("sum:traffic.haproxy.route_tr_avg{host=literal_or(ny-logsql01),route=Questions/Show}", "30m", "")
$avgCount = avg($count)
dropbool($avg, !($count < $avgCount-100 || $count > $avgCount+100))
Returns the Unix epoch in seconds of the expression start time (scalar).
Returns all results in variantSet that are a subset of numberSet and have a non-zero value. Useful with the limit and sort functions to return the top X results of a query.
Returns the first count (scalar) items of the set.
Returns the first key from the given lookup table with matching tags, this searches the built-in index and so only makes sense when using OpenTSDB and sending data to /index or relaying through bosun.
Using the lookup function will set unJoinedOk to true for the alert.
Returns the first key from the given lookup table with matching tags. The first argument is a series to use from which to derive the tag information. This is good for alternative storage backends such as graphite and influxdb.
Using the lookupSeries function will set unJoinedOk to true for the alert.
map applies the subExpr to each value in each series in the set. A special function v()
which is only available in a numberSetExpr and it gives you the value for each item in the series.
For example you can do something like the following to get the absolute value for each item in the series (since the normal abs()
function works on normal numbers, not series:
$q = q("avg:rate:os.cpu{host=*bosun*}", "5m", "")
map($q, expr(abs(v())))
Or for another example, this would get you the absolute difference of each datapoint from the series average as a new series:
$q = q("avg:rate:os.cpu{host=*bosun*}", "5m", "")
map($q, expr(abs(v()-avg($q))))
Since this function is not optimized for a particular operation on a seriesSet it may not be very efficent. If you find you are doing things that involve more complex expressions within the expr(...)
inside map (for example, having query functions in there) than you may want to consider requesting a new function to be added to bosun’s DSL.
expr takes an expression and returns either a numberSetExpr or a seriesSetExpr depending on the resulting type of the inner expression. This exists for functions like map
- it is currently not valid in the expression language outside of function arguments.
Returns the epoch of either the start or end of the month. Offset is the timezone offset from UTC that the month starts/ends at (but the returned epoch is representitive of UTC). startEnd must be either "start"
or "end"
. Useful for things like monthly billing, for example:
$hostInt = host=ny-nexus01,iname=Ethernet1/46
$inMetric = "sum:5m-avg:rate{counter,,1}:__ny-nexus01.os.net.bytes{$hostInt,direction=in}"
$outMetric = "sum:5m-avg:rate{counter,,1}:__ny-nexus01.os.net.bytes{$hostInt,direction=in}"
$commit = 100
$monthStart = month(-4, "start")
$monthEnd = month(-4, "end")
$monthLength = $monthEnd - $monthStart
$burstTime = ($monthLength)*.05
$burstableObservations = $burstTime / d("5m")
$in = q($inMetric, tod(epoch()-$monthStart), "") * 8 / 1e6
$out = q($inMetric, tod(epoch()-$monthStart), "") * 8 / 1e6
$inOverCount = sum($in > $commit)
$outOverCount = sum($out > $commit)
$inOverCount > $burstableObservations || $outOverCount > $burstableObservations
Returns a seriesSet with one series. The series will have a group (a.k.a tagset). The tagset can be “” for the empty group, or in “key=value,key=value” format. You can then optionally pass epoch value pairs (if non are provided, the series will be empty). This is can be used for testing or drawing arbitary lines. For example:
$now = epoch()
$hourAgo = $now-d("1h")
merge(series("foo=bar", $hourAgo, 5, $now, 10), series("foo=bar2", $hourAgo, 6, $now, 11))
Shift takes a seriesSet and shifts the time forward by the value of dur (OpenTSDB duration string) and adds a tag for representing the shift duration. This is meant so you can overlay times visually in a graph.
leftjoin takes multiple numberSets and joins them to the first numberSet to form a table. tagsCSV is a string that is comma delimited, and should match tags from query that you want to display (i.e., “host,disk”). dataCSV is a list of column names for each numberset, so it should have the same number of labels as there are numberSets.
The only current intended use case is for constructing “Table” panels in Grafana.
For Example, the following in Grafana would create a table that shows the CPU of each host for the current period, the cpu for the adjacent previous period, and the difference between them:
$cpuMetric = "avg:$ds-avg:rate{counter,,1}:os.cpu{host=*bosun*}{}"
$currentCPU = avg(q($cpuMetric, "$start", ""))
$span = (epoch() - (epoch() - d("$start")))
$previousCPU = avg(q($cpuMetric, tod($span*2), "$start"))
$delta = $currentCPU - $previousCPU
leftjoin("host", "Current CPU,Previous CPU,Change", $currentCPU, $previousCPU, $delta)
Note that in the above example is intended to be used in Grafana via the Bosun datasource, so $start
and $ds
are replaced by Grafana before the query is sent to Bosun.
Merge takes multiple seriesSets and merges them into a single seriesSet. The function will error if any of the tag sets (groups) are identical. This is meant so you can display multiple seriesSets in a single expression graph.
Change the NaN value during binary operations (when joining two queries) of unknown groups to the scalar. This is useful to prevent unknown group and other errors from bubbling up.
Returns the results sorted by value in ascending (“asc”) or descending (“desc”) order. Results are first sorted by groupname and then stably sorted so that results with identical values are always in the same order.
Returns the difference between successive timestamps in a series. For example:
timedelta(series("foo=bar", 1466133600, 1, 1466133610, 1, 1466133710, 1))
Would return a seriesSet equal to:
series("foo=bar", 1466133610, 10, 1466133710, 100)
Returns the most recent num points from a series. If the series is shorter than the number of requeted points the series is unchanged as all points are in the requested window. This function is useful for making calculating on the leading edge. For example:
tail(series("foo=bar", 1466133600, 1, 1466133610, 1, 1466133710, 1), 2)
Would return a seriesSet equal to:
series("foo=bar", 1466133610, 1, 1466133710, 1)