Usage

This part of the documentation covers using Bosun’s user interface and the incident workflow.

Alerts and Incidents

Overview

Each alert definition has the potential to turn into multiple incidents (an instantiation of the alert). Incidents get a unique global ID and are also associated with an Alert Key. The Alert Key is made up of the alert name and the tagset. Every possible group in your top level expression is evaluated independently. As an example, with an expression like avg(q("avg:rate{counter,,1}:os.cpu{host=*}", "5m", "")) you can have the potential to create an incident for every tag-value of the “host” tag-key that has sent data for the os.cpu metric.

The lifetime of an incident

An incident gets created when the warn or crit expression evaluates to non-zero, or the alert goes unknown. Once an incident has been created it will notify users only when the lifetime severity of the incident increases. An exception to this is if you have set up notification chains, in which case the alert will send more notifications until someone acknowledges the alert.

Example:

You have an alert named high.cpu defined, and it has warn expression like avg(q(os.cpu{host=*} ...)) > 50. One of your hosts (web01) triggers the warn condition of the alert
We now have an incident, the incident will get a global ID like #23412 and will have an alert key of high.cpu{host=web01} and will have a current severity state of warn. Assuming a notification has been set up, the notification will be sent (i.e. an email)
The incident then goes back to normal severity, and then to warn again. When this happens, no notifications are sent. It is important to note that notifications are only sent when the lifetime severity of an incident increases. The lifetime of the incident continues until the alert has been closed - which is generally done by a user.
The incident can be closed when it goes back to normal state. Once the incident is closed, it is possible for a new incident to be created for the same Alert Key (high.cpu{host=web01}).

Severity States

Incidents can be in one of the following severity levels (From highest to lowest):

Unknown: When a warn or crit expression can not be evaluated because data is missing. When you define an alert bosun tracks each resulting tagset from the warn/crit expressions. If a tagset is no longer present, that instance goes into an unknown state. Since bosun has data pushed to it, unknown can mean that either data collection has failed, or that the source is down. Unknown triggers when there is no data for the tagset in 2x the check frequency duration. This means that if a query spans an hour, it will be one hour + 2x the check frequency before it triggers.
Error: There is some sort of bosun internal error such as divide by zero or “response too large” with the alert. The error can be viewed by clicking the Errors button on the dashboard
Critical: The expression that crit is equal to in the alert definition is non-zero (true). It is recommend that “Critical” be thought of as “has failed”.
Warning: The expression that warn is equal to in the alert definition is non-zero (true) and critical is not true. It is recommended that warning be thought of ha “could lead to failure”.
Normal: None of the above states.

Additional States

Active: The alert is currently in a non-normal state. This is indicated by an exclamation on the dashboard: . Alerts don’t disappear from the dashboard when they are no longer active until they are closed. This is to ensure that all alerts get handled - which reduces alert noise and fatigue.
Silenced: Someone has created a silence rule that stops this alert from triggering any notification. It will also automatically close when the alert is no longer active. This is indicated by a volume off speaker icon: .
Acknowledged: Someone has acknowledged the alert, the reason and person should be available via the web interface. Acknowledged alerts stop sending notification chains as long as the severity doesn’t increase.
Unacknowledged: Nobody has acknowledged the alert yet at its current severity level.
Unevaluated: An incident is unevaluated if the dependency expression as defined in the alert’s depends keyword is non-zero. Unevaluated alerts do not change state or become unknown. If an incident is open then it will still show up on the dashboard, but with a question mark icon: . New incidents will not be created.

Dashboard

Indicators

Colors

The color of the major of the bar is the incident’s last abnormal status. The color that makes up the sliver on the left side of the bar is the incident’s current status.

Blue: Unknown
Red: Critical
Yellow: Warning
Green: Normal

Icons

An exclamation icon means the alert is currently in an active state.
A silence icon means the alert has been silenced.
A question icon means the alert is unevaluated.
A fire icon means the alert is in an error state.

Actions

Acknowledge: Prevent further notifications unless there is a state increase. This also moves it to the acknowledged section of the dashboard. When you acknowledge something you enter a name and a reason. So this means that the person has committed to fixing the problem or the alert.
Close: Make it disappear from the dashboard. This should be used when an alert is handled. Active (non-normal) alerts can not be closed (since all that will happen is that will reappear on the the dashboard after the next schedule run).
Forget: Make bosun forget about this instance of the alert. This is used on active unknown alerts. It is useful when something is not coming back (i.e. you have decommissioned a host). This act is non-destructive because if that data gets sent to bosun again everything will come back.
Force Close: Like close, but does not require alert to be in a normal state. In a few circumstances an alert can be “open” and “active” at the same time. This can occur when a host is decommissioned and an alert has ignoreUnknown set, for example. This may help to clear some of those “stuck” alerts.
Purge: Will delete an active alert and all history for that alert key. Should only be used when you absolutely want to forget all data about a host, like when shutting it down. Like forget, but does not require an alert to be unknown.
History: View a timeline of history for the selected alert instances.
Note: Attach a note to an incident. This has no impact on the behavior of the alert and is purely for communication.

Incident Filters

The open incident filter supports joining terms in () as well as the AND, OR, and ! operators. The following query terms are supported and are always in the format of something:something:

Term Spec	Description
`ack:(true\|false)`	If `ack:true` incidents that have been acknowledge are returned, when `ack:false` incidents that have not been acknowledged are returned.
`ackTime:[<\|>](1d)`	Returns incidents that were acknowledged before `<` or incidents that were acknowledged after `>` the relative time to now based on the duration. Duration can be in units of s (seconds), m (minutes), h (hours), d (days), w (weeks), n (months), y (years). If less than or greater than are not part of the value, it defaults to greater than (after). Now is clock time and is not related to the time range specified in Grafana. For example, `ackTime:<24h` shows incidents that were acknowledged more than 24 hours ago.
`hasTag:(tagKey\|tagKey=\|=tagValue\|tagKey=tagValue)`	Determine if the tag key, value, or key=value pair. If there is no equals sign, it is treated as a tag key. Tag Values maybe have globs such has `hasTag:host=ny-*`
`hidden:(true\|false)`	If `hidden:false` incidents that are hidden will not be show. An incident is hidden if it is in a silenced or unevaluated state.
`name:(something*)`	Returns incidents where the alert name (not including the tagset) matches the value. Globs can be used in the value.
`user:(username*)`	Returns incidents where a user has taken any action on that incident. Globs can be used in the value
`notify:(notificationName*)`	Returns incidents where a the notificationName is somewhere in either the crit or warn notification chains. Globs can be used in the value
`silenced:(true\|false)`	If `silenced:false` incidents that have not been silenced are returned, when `silenced:true` incidents that have not been silenced are returned.
`start:[<\|>](1d)`	Returns incidents that started before `<` or incidents that started after `>` the relative time to now based on the duration. Duration can be in units of s (seconds), m (minutes), h (hours), d (days), w (weeks), n (months), y (years). If less than or greater than are not part of the value, it defaults to greater than (after). Now is clock time and is not related to the time range specified in Grafana.
`unevaluated:(true\|false)`	If `unevaluated:false` incidents that are not in an unevaluated state are returned, when `ack:true` incidents that are unevaluated are returned.
`status:(normal\|warning\|critical\|unknown)`	Returns incidents that are currently in the requested state
`worstStatus:(normal\|warning\|critical\|unknown)`	Returns incidents that have a worst status equal to the requested state
`lastAbnormalStatus:(warning\|critical\|unknown)`	Returns incidents that have a last abnormal status equal to the requested state
`subject:(something*)`	Returns incidents where the subject string matches the value. Globs can be used in the value
`since:[<\|>](1d)`	Returns incidents that in `status` more than `<` or incidents that in `status` less than `>` the relative time to now based on the duration. Duration can be in units of s (seconds), m (minutes), h (hours), d (days), w (weeks), n (months), y (years). If less than or greater than are not part of the value, it defaults to greater than (after). Now is clock time and is not related to the time range specified in Grafana. e.g. `status:normal AND since:<15d` return alerts that are in `normal` more than 15 day's

Rule Editor

The rule editor allows you to edit the the definitions in the RuleConf, preview rendered templates, and test alerts against historical data.

Rule Editor Image

Textarea

The text area will be loaded with the running config when the Rule Editor view is loaded. A hash of the config when you start editing it is saved. If someone else edits the UI and saves it, Bosun will detect that the config hash has changed and show a warning above the text area.

When you run test your version of the config is saved in Bosun, and you can link to it so others can see it.

The editor is built using the open source Ace editor.

Jump Buttons

The Jump drop downs ① will take you to defined sections within the config. In particular, the alert drop down selects which alert will be used for testing.

At the end there is a switcher that can be used when you are working on an alert. It allows you to just back and forth between the alert and the alert referenced in the template.

Download / Validate

The download button ② will download the config file as a text file. Validate makes sure that Bosun considers the config valid using the same validation that is required for Bosun to start.

Definition [Rule] Saving

The save button ② will bring up a dialogue that lets you save the config. This only appears if you have permission to save the config, and the system configuration’s EnableSave has been set to true.

The save dialogue will show you a contextual diff of your config and the running config. There are several protections in place to prevent you from overwriting someone elses configuration changes:

The Rule Editor will show a warning if the config has been saved since you started editing it
A contextual-diff is shown of your changes versus the running config (and the save we fail if the contextual diff happens to change in the time window before you hit save)
When the file is being saved, a global lock is taken in Bosun so nobody else can save while the save his happening

If the config file is successfully saved then Bosun will reload the new definitions. Alerts that are currently being processed will be cancelled and restarted. In other words, a restart of the Bosun process is not required for the new changes to take effect.

An external command to run on saves can also be defined with the CommandHookPath setting in the system configuration. This can be used to do things like create backups of the file or check the changes into version control. If this command returns a non-zero exit code, saving will also fail.

In all cases where a save fails, a reload will not happen and the save will not be persisted (the definitions file will not be changed).

Alert Testing

Alerts can be tested before they are committed to production. This allows you to refine the trigger conditions to control the signal to noise and to preview the rendered templates to make sure alerts are informative. This done by selecting the alert the from the Jump Alert Drop down at ① and the clicking the test alert button at ④.

There are two ways you can test alerts:

A single iteration (a snapshot of time)
Multiple iterations over a period of time.

Which behavior is used depends on the From and To fields at ③. If From is left blank, that a single iteration is tested with the time current time. If From is set to a time and To is unset, a single iteration will be done at that time. When doing single iteration testing the Results and Template ⑤ tabs at will be populated. The Results tabs show the warn/crit results for each set, and a rendered template will be show in the Template tab.

Which item from the result set that will be rendered in the Template tab is controlled by the Template Group field at ④. Which result to use for the template is picked by specifying a tagset in the format of key=value,key=value. The first result that has the specified tags will be used. If no results match, than the first result is chosen.

Tip

When working on a template it is good to set the From time to a fixed date. That way when expressions are rerun they will likely hit Bosun's query cache and things will be faster.

The Email field at ④ makes it so when an alert is tested, the rendered template is emailed to the address specified in the field. This is so you can check for any differences between what you see in the Template tab.

Setting both From and To enables testing multiple iterations of the selected alert over time. The number of iterations depends on the setting to the two linked fields Intervals and Step Duration at ③. Changing one changes the other. Intervals will be the number of runs to do even spaced out over the duration of From to To and Step Duration is how much time in minutes should be between intervals. Doing a test over time will populate the Timeline tab ⑤ which draws a clickable graphic of severity states for each item in the set:

Rule Editor Timeline Image

Each row in the image is one of the items in the result set. The color squares represent the severity of that instance. The X-Axis is time. When you click the a square on the image, it will take you to the event you clicked and show you what the template would look like at that time for that particular item.

Annotations

Annotations are currently stored in elastic. When annotations are enabled you can create, edit and visualize them on the the Graph page. There is also a Submit Annotations page that allows for creation and editing annotations. The API described in this README gets injected into bosun under /api/ - you can also find a description of the schema there.