2017-11-16 09:13:43 +00:00
|
|
|
Description
|
|
|
|
===========
|
|
|
|
|
|
|
|
reed-alert is a small and simple monitoring tool for your server,
|
|
|
|
written in Common LISP.
|
|
|
|
|
|
|
|
reed-alert checks the status of various processes on a server and
|
|
|
|
triggers self defined notifications.
|
|
|
|
|
|
|
|
Each triggered message is called an 'alert'.
|
|
|
|
Each check is called a 'probe'.
|
|
|
|
Each probe can be customized by different parameters.
|
|
|
|
|
|
|
|
|
|
|
|
Dependencies
|
2016-10-07 13:39:01 +00:00
|
|
|
============
|
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
reed-alert is regularly tested on FreeBSD/OpenBSD/Linux and has been
|
|
|
|
tested with both **sbcl** and **ecl** - which should be available for
|
|
|
|
most distributions.
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
(On OpenBSD you may prefer to use ecl because sbcl needs 'wxallowed'
|
2018-01-10 19:17:32 +00:00
|
|
|
on the partition where the binary is.)
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
To make reed-alert's deployment easier I avoid using external
|
|
|
|
libraries. reed-alert only requires a Common LISP interpreter and a
|
2018-01-10 19:17:32 +00:00
|
|
|
its own files.
|
|
|
|
|
|
|
|
A development to use quicklisp libraries to write more sophisticated
|
|
|
|
checks like "does this url contains a pattern ?" had begun and had
|
|
|
|
been abandoned, it has been decided to write shell command in the
|
|
|
|
probe **command** if the user need more elaborated checks.
|
2016-10-07 13:49:52 +00:00
|
|
|
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
Code-Readability
|
|
|
|
================
|
2016-10-07 13:49:52 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
Although the code is very rough for now, I think it's already fairly
|
|
|
|
understandable by people who do need this kind of tool.
|
2016-10-07 13:49:52 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
I will try to improve on the readability of the config file in future
|
2018-01-10 19:17:32 +00:00
|
|
|
commits. NOTE : declaration of notifiers is easier now.
|
2017-11-16 09:13:43 +00:00
|
|
|
|
|
|
|
|
|
|
|
Usage
|
|
|
|
=====
|
|
|
|
|
|
|
|
Start reed-alert
|
|
|
|
----------------
|
2016-10-07 13:49:52 +00:00
|
|
|
To start reed-alert
|
|
|
|
|
|
|
|
+ sbcl : **sbcl --script config_file.lisp**
|
2017-11-16 09:20:01 +00:00
|
|
|
+ ecl : **ecl -shell config_file.lisp**
|
2016-10-07 13:49:52 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
Personal Configuration File
|
|
|
|
---------------------------
|
|
|
|
You may want to rename **config.lisp.sample** to **config.lisp** in
|
|
|
|
order to create your own configuration file.
|
2016-10-07 13:56:58 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
The configuration is explained below.
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
|
|
|
|
The Notification System
|
|
|
|
=======================
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2018-01-10 19:17:32 +00:00
|
|
|
When a check return an error, a previously defined notifier will be
|
|
|
|
called. The notifier is a shell command with a name. The shell command
|
|
|
|
can contains variables from reed-alert.
|
|
|
|
|
|
|
|
+ %function% : the name of the probe
|
|
|
|
+ %date% : the current date with format YYYY/MM/DD hh:mm:ss
|
|
|
|
+ %params% : the parameters of the probe
|
|
|
|
+ %hostname% : the hostname of the server
|
|
|
|
+ %result% : the error returned (the value exceeding the limit, file not found)
|
|
|
|
+ %description% : an arbitrary description naming a check
|
|
|
|
+ %level% : the type of notification used
|
|
|
|
+ %os% : the type of operating system (FreeBSD/Linux/OpenBSD)
|
|
|
|
+ %newline% : a newline character
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
|
2018-01-10 19:17:32 +00:00
|
|
|
Example Probe 1: 'Check For Load Average'
|
2017-11-16 09:13:43 +00:00
|
|
|
---------------------------------------
|
|
|
|
If you want to send a mail with a message like:
|
|
|
|
|
2018-01-10 19:17:32 +00:00
|
|
|
"On 2016/10/06 11:11:12 server.foo.com has encountered a problem
|
2017-11-16 09:13:43 +00:00
|
|
|
during LOAD-AVERAGE-15 (:LIMIT 10) with a value of 30"
|
|
|
|
|
|
|
|
|
2018-01-10 19:17:32 +00:00
|
|
|
write the following at the top of the file and use **pretty-mail** in your checks:
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2018-01-10 19:17:32 +00:00
|
|
|
(alert pretty-mail "echo 'On %date% %hostname% has encountered a problem during %function%
|
|
|
|
%params% with a value of %result%' | mail yourmail@foo.bar")
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2018-01-10 19:17:32 +00:00
|
|
|
Example Probe 2: 'Don't do anything'
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
If you don't want anything to be done when an error occur, use the following :
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2018-01-10 19:17:32 +00:00
|
|
|
(alert nothing-to-send "")
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2018-01-10 19:17:32 +00:00
|
|
|
Example Probe 3: 'Send SMS'
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
You may want to use an external service to send a SMS, this is totally
|
|
|
|
possible as we rely on a shell command :
|
2017-11-16 09:13:43 +00:00
|
|
|
|
2018-01-10 19:17:32 +00:00
|
|
|
(alert sms "echo 'error on %hostname : %function% %result%'
|
|
|
|
| curl -u login:pass http://api.sendsms.com/")
|
2016-10-07 13:39:01 +00:00
|
|
|
|
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
The Probes
|
|
|
|
==========
|
|
|
|
|
2018-01-10 19:17:32 +00:00
|
|
|
Probes are written in Common LISP. They are predefined checks.
|
2017-11-16 09:13:43 +00:00
|
|
|
|
|
|
|
The :desc Parameter
|
|
|
|
-------------------
|
|
|
|
The :desc parameter allows you to describe specifically what your check
|
|
|
|
does. It can be put in every probe.
|
|
|
|
|
2016-10-07 13:39:01 +00:00
|
|
|
:desc "STRING"
|
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
|
|
|
|
Overview
|
|
|
|
--------
|
|
|
|
As of this commit, reed-alert ships with the following probes:
|
|
|
|
|
|
|
|
(1) number-of-processes
|
|
|
|
(2) pid-running
|
|
|
|
(3) disk-usage
|
|
|
|
(4) file-exists
|
|
|
|
(5) file-updated
|
|
|
|
(6) load-average-1
|
|
|
|
(7) load-average-5
|
|
|
|
(8) load-average-15
|
|
|
|
(9) ping
|
|
|
|
(10) command
|
2017-11-16 09:20:01 +00:00
|
|
|
(11) service
|
|
|
|
(12) file-less-than
|
2017-11-16 09:13:43 +00:00
|
|
|
|
|
|
|
|
2016-10-07 13:39:01 +00:00
|
|
|
number-of-processes
|
|
|
|
-------------------
|
2017-11-16 09:13:43 +00:00
|
|
|
Check if the actual number of processes of the system exceeds a specific limit.
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
> Set the limit that will trigger an alert when exceeded.
|
2016-10-07 13:39:01 +00:00
|
|
|
:limit INTEGER
|
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
Example : `(=> alert number-of-processes (:limit 200))`
|
|
|
|
|
2016-10-07 13:39:01 +00:00
|
|
|
|
|
|
|
pid-running
|
|
|
|
-----------
|
2017-11-16 09:13:43 +00:00
|
|
|
Check if the PID number found in a .pid file is alive.
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
> Set the path of the pid file. If $USER doesn't have permission to open it, return "file not found".
|
2016-10-07 13:39:01 +00:00
|
|
|
:path "STRING"
|
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
Example : `(=> alert pid-running (:path "/var/run/nginx.pid"))`
|
2016-10-07 13:39:01 +00:00
|
|
|
|
|
|
|
|
|
|
|
disk-usage
|
|
|
|
----------
|
2017-11-16 09:13:43 +00:00
|
|
|
Check if the disk-usage of a chosen partition does exceed a specific limit.
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
> Set the mountpoint to check.
|
2016-10-07 13:39:01 +00:00
|
|
|
:path "STRING"
|
2017-11-16 09:20:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
> Set the limit that will trigger an alert when exceeded.
|
2016-10-07 13:39:01 +00:00
|
|
|
:limit INTEGER
|
2017-11-16 09:20:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
Example : `(=> alert disk-usage (:path "/tmp" :limit 50))`
|
2016-10-07 13:39:01 +00:00
|
|
|
|
|
|
|
|
|
|
|
file-exists
|
|
|
|
-----------
|
2017-11-16 09:13:43 +00:00
|
|
|
Check if a file exists.
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
> Set the path of the file to check.
|
2016-10-07 13:39:01 +00:00
|
|
|
:path "STRING"
|
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
Example : `(=> alert file-exists (:path "/var/postgresql/standby"))`
|
|
|
|
|
2016-10-07 13:39:01 +00:00
|
|
|
|
|
|
|
file-updated
|
|
|
|
------------
|
2017-11-16 09:13:43 +00:00
|
|
|
Check if a file exists and has been updated since a defined time.
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
> Set the path of the file to check.
|
2016-10-07 13:39:01 +00:00
|
|
|
:path "STRING"
|
2017-11-16 09:20:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
> Set the limit in minutes since the last modification time before triggering an alert.
|
2016-10-07 13:39:01 +00:00
|
|
|
:limit INTEGER
|
2017-11-16 09:20:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
Example : `(=> alert file-updated (:path "/var/log/nginx/access.log" :limit 60))`
|
|
|
|
|
2016-10-07 13:39:01 +00:00
|
|
|
|
|
|
|
load-average-1
|
|
|
|
--------------
|
2017-11-16 09:13:43 +00:00
|
|
|
Check if the load average during the last minute exceeds a specific limit.
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
> Set the limit not to exceed.
|
2016-10-07 13:39:01 +00:00
|
|
|
:limit INTEGER
|
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
Example : `(=> alert load-average-1 (:limit 2))`
|
|
|
|
|
2016-10-07 13:39:01 +00:00
|
|
|
|
|
|
|
load-average-5
|
|
|
|
--------------
|
2017-11-16 09:13:43 +00:00
|
|
|
Check if the load average during the last five minutes exceeds a specific limit.
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
> Set the limit not to exceed.
|
2016-10-07 13:39:01 +00:00
|
|
|
:limit INTEGER
|
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
Example : `(=> alert load-average-5 (:limit 2))`
|
|
|
|
|
2016-10-07 13:39:01 +00:00
|
|
|
|
|
|
|
load-average-15
|
|
|
|
---------------
|
2017-11-16 09:13:43 +00:00
|
|
|
Check if the load average during the last fifteen minutes exceeds a specific limit.
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
> Set the limit not to exceed.
|
2016-10-07 13:39:01 +00:00
|
|
|
:limit INTEGER
|
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
Example : `(=> alert load-average-15 (:limit 2))`
|
|
|
|
|
2016-10-07 13:39:01 +00:00
|
|
|
|
|
|
|
ping
|
|
|
|
----
|
2017-11-16 09:13:43 +00:00
|
|
|
Check if a remote host answers the 2 ICMP ping.
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
> Set the host to ping. Return an error if ping command returns non-zero.
|
2016-10-07 13:39:01 +00:00
|
|
|
:host "STRING" (can be IP or hostname)
|
2017-11-16 09:20:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
Example : `(=> alert ping (:host "8.8.8.8"))`
|
|
|
|
|
2016-10-07 13:39:01 +00:00
|
|
|
|
|
|
|
command
|
|
|
|
-------
|
2017-11-16 09:13:43 +00:00
|
|
|
Execute an arbitrary command which triggers an alert if it returns a non-zero value.
|
2018-01-10 19:17:32 +00:00
|
|
|
This may be the most useful probe because it let the user do any check needed.
|
2016-10-07 13:39:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
> Command to execute, accept commands with pipes.
|
2016-10-07 13:39:01 +00:00
|
|
|
:command "STRING"
|
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
Example : `(=> alert command (:command "tail -n 10 /var/log/messages | grep -v CRITICAL"))`
|
|
|
|
|
|
|
|
service
|
|
|
|
-------
|
|
|
|
Check if a service is started on the system.
|
|
|
|
|
|
|
|
> Set the name of the service to test
|
|
|
|
:name STRING
|
|
|
|
|
|
|
|
Example : `(=> alert service (:name "mysql-server"))`
|
|
|
|
|
|
|
|
file-less-than
|
|
|
|
--------------
|
|
|
|
Check if a file has a size less than a specified limit.
|
|
|
|
|
|
|
|
> Set the path of the file to check.
|
|
|
|
:path "STRING"
|
2017-11-16 09:20:01 +00:00
|
|
|
|
2017-11-16 09:13:43 +00:00
|
|
|
> Set the limit in bytes before triggering an alert.
|
|
|
|
:limit INTEGER
|
2017-11-16 09:20:01 +00:00
|
|
|
|
2018-01-10 19:17:32 +00:00
|
|
|
Example : `(=> alert file-less-than (:path "/var/log/nginx.log" :limit 60))`
|
|
|
|
|
|
|
|
|
|
|
|
The configuration file
|
|
|
|
======================
|
|
|
|
|
|
|
|
The configuration file is Common LISP code, so it's evaluated. It's
|
|
|
|
possible to write some logic within it.
|
|
|
|
|
|
|
|
|
|
|
|
Loops
|
|
|
|
-----
|
|
|
|
It's possible to write loops if you don't want to repeat code
|
|
|
|
|
|
|
|
(loop for host in '("bitreich.org" "dataswamp.org" "floodgap.com")
|
|
|
|
do
|
|
|
|
(=> mail ping (:host host)))
|
|
|
|
|
|
|
|
or another example
|
|
|
|
|
|
|
|
(loop for service in '("smtpd" "nginx" "mysqld" "postgresql")
|
|
|
|
do
|
|
|
|
(=> mail service (:name service)))
|
|
|
|
|
|
|
|
and another example using rows from a file to check remote hosts
|
|
|
|
|
|
|
|
(with-open-file (stream "hosts.txt")
|
|
|
|
(loop for line = (read-line stream nil)
|
|
|
|
while line
|
|
|
|
do
|
|
|
|
(=> mail ping (:host line))))
|
|
|
|
|
|
|
|
|
|
|
|
Conditional
|
|
|
|
-----------
|
|
|
|
It is also possible to achieve conditionals. There are two very useful
|
|
|
|
conditionals groups.
|
|
|
|
|
|
|
|
|
|
|
|
Dependency
|
|
|
|
~~~~~~~~~~
|
|
|
|
Sometimes it may be a good idea to stop some probes if a probe
|
|
|
|
fail. In a case where you need to check a path through a network, from
|
|
|
|
the nearest machine to the remote target. If we can't reach our local
|
|
|
|
router, probes requiring the router to work will trigger errors so we
|
|
|
|
should skip them.
|
|
|
|
|
|
|
|
(stop-if-error
|
|
|
|
(=> mail ping (:host "192.168.1.1" :desc "My local router"))
|
|
|
|
(=> mail ping (:host "89.89.89.89" :desc "My ISP DNS server"))
|
|
|
|
(=> mail ping (:host "kernel.org" :desc "Remote website")))
|
|
|
|
|
|
|
|
Note : stop-if-error is an alias for the **and** function.
|
|
|
|
|
|
|
|
|
|
|
|
Escalation
|
|
|
|
~~~~~~~~~~
|
|
|
|
It could be a good idea to use different alerts
|
|
|
|
depending on how critical a check is, but sometimes, the critical
|
|
|
|
level may depend of the value of the error and/or the delay between
|
|
|
|
the detection and fixing it. You could want to receive a mail when
|
|
|
|
things need to be fixed on spare time, but mail another people if
|
|
|
|
things aren't fixed after some level.
|
|
|
|
|
|
|
|
(escalation
|
|
|
|
(=> mail-me disk-usage (:path "/" :limit 70))
|
|
|
|
(=> sms-me disk-usage (:path "/" :limit 90))
|
|
|
|
(=> buzzer disk-usage (:path "/" :limit 98)))
|
|
|
|
|
|
|
|
In this example, we check the disk usage, I will get a mail through
|
|
|
|
"mail-me" alert if the disk usage go get more than 70%. Once it goes
|
|
|
|
that far, it will check if the disk usage gets more than 90%, if so,
|
|
|
|
I'll receive a sms through "sms-me" alert. And then, if it goes more
|
|
|
|
than 98%, the "buzzer" alert will make some bad noises in the room to
|
|
|
|
warn me about this.
|
|
|
|
|
|
|
|
Note : escalation is an alias for the **or** function.
|