Simple unix alerting system
Go to file
Solene Rapenne 3f03224030 New syntax allowing the use of code in parameters 2018-01-11 15:03:46 +01:00
LICENSE add license 2016-10-07 15:38:44 +02:00
README New syntax allowing the use of code in parameters 2018-01-11 15:03:46 +01:00
TODO adding TODO 2017-11-03 18:10:44 +00:00
config.lisp.sample New syntax allowing the use of code in parameters 2018-01-11 15:03:46 +01:00
example.lisp New syntax allowing the use of code in parameters 2018-01-11 15:03:46 +01:00
functions.lisp New syntax allowing the use of code in parameters 2018-01-11 15:03:46 +01:00
probes.lisp dd gentoo rc-service support 2018-01-10 20:16:34 +01:00

README

Description
===========

reed-alert is a small and simple monitoring tool for your server,
written in Common LISP.

reed-alert checks the status of various processes on a server and
triggers self defined notifications.

Each triggered message is called an 'alert'.
Each check is called a 'probe'.
Each probe can be customized by different parameters.


Dependencies
============

reed-alert is regularly tested on FreeBSD/OpenBSD/Linux and has been
tested with both **sbcl** and **ecl** - which should be available for
most distributions.

(On OpenBSD you may prefer to use ecl because sbcl needs 'wxallowed'
on the partition where the binary is.)

To make reed-alert's deployment easier I avoid using external
libraries. reed-alert only requires a Common LISP interpreter and a
its own files.

A development to use quicklisp libraries to write more sophisticated
checks like "does this url contains a pattern ?" had begun and had
been abandoned, it has been decided to write shell command in the
probe **command** if the user need more elaborated checks.


Code-Readability
================

Although the code is very rough for now, I think it's already fairly
understandable by people who do need this kind of tool.

I will try to improve on the readability of the config file in future
commits. NOTE : declaration of notifiers is easier now.


Usage
=====

Start reed-alert
----------------
To start reed-alert

+ sbcl : **sbcl --script config_file.lisp**
+ ecl  : **ecl -shell config_file.lisp**

Personal Configuration File
---------------------------
You may want to rename **config.lisp.sample** to **config.lisp** in
order to create your own configuration file.

The configuration is explained below.


The Notification System
=======================

When a check return an error, a previously defined notifier will be
called. The notifier is a shell command with a name. The shell command
can contains variables from reed-alert.

+ %function%    : the name of the probe
+ %date%        : the current date with format YYYY/MM/DD hh:mm:ss
+ %params%      : the parameters of the probe
+ %hostname%    : the hostname of the server
+ %result%      : the error returned (the value exceeding the limit, file not found)
+ %description% : an arbitrary description naming a check
+ %level%       : the type of notification used
+ %os%          : the type of operating system (FreeBSD/Linux/OpenBSD)
+ %newline%     : a newline character


Example Probe 1: 'Check For Load Average'
---------------------------------------
If you want to send a mail with a message like:

	"On 2016/10/06 11:11:12 server.foo.com has encountered a problem
	during LOAD-AVERAGE-15 (:LIMIT 10) with a value of 30"


write the following at the top of the file and use **pretty-mail** in your checks:

   (alert pretty-mail "echo 'On %date% %hostname% has encountered a problem during %function%
	                 %params% with a value of %result%' | mail yourmail@foo.bar")

Example Probe 2: 'Don't do anything'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you don't want anything to be done when an error occur, use the following :

    (alert nothing-to-send "")

Example Probe 3: 'Send SMS'
~~~~~~~~~~~~~~~~~~~~~~~~~~~
You may want to use an external service to send a SMS, this is totally
possible as we rely on a shell command :

    (alert sms "echo 'error on %hostname : %function% %result%'
                      | curl -u login:pass http://api.sendsms.com/")


The Probes
==========

Probes are written in Common LISP. They are predefined checks.

The :desc Parameter
-------------------
The :desc parameter allows you to describe specifically what your check
does. It can be put in every probe.

    :desc "STRING"


Overview
--------
As of this commit, reed-alert ships with the following probes:

	(1) 	number-of-processes
	(2) 	pid-running
	(3) 	disk-usage
	(4) 	file-exists
	(5) 	file-updated
	(6) 	load-average-1
	(7) 	load-average-5
	(8) 	load-average-15
	(9)	ping
	(10)	command
	(11)	service
	(12)	file-less-than


number-of-processes
-------------------
Check if the actual number of processes of the system exceeds a specific limit.

> Set the limit that will trigger an alert when exceeded.
    :limit INTEGER

Example : `(=> alert number-of-processes :limit 200)`


pid-running
-----------
Check if the PID number found in a .pid file is alive.

> Set the path of the pid file. If $USER doesn't have permission to open it, return "file not found".
    :path "STRING"

Example : `(=> alert pid-running :path "/var/run/nginx.pid")`


disk-usage
----------
Check if the disk-usage of a chosen partition does exceed a specific limit.

> Set the mountpoint to check.
    :path "STRING"

> Set the limit that will trigger an alert when exceeded.
    :limit INTEGER

Example : `(=> alert disk-usage :path "/tmp" :limit 50)`


file-exists
-----------
Check if a file exists.

> Set the path of the file to check.
    :path "STRING"

Example : `(=> alert file-exists :path "/var/postgresql/standby")`


file-updated
------------
Check if a file exists and has been updated since a defined time.

> Set the path of the file to check.
    :path "STRING"

> Set the limit in minutes since the last modification time before triggering an alert.
    :limit INTEGER

Example : `(=> alert file-updated :path "/var/log/nginx/access.log" :limit 60)`


load-average-1
--------------
Check if the load average during the last minute exceeds a specific limit.

> Set the limit not to exceed.
    :limit INTEGER

Example : `(=> alert load-average-1 :limit 2)`


load-average-5
--------------
Check if the load average during the last five minutes exceeds a specific limit.

> Set the limit not to exceed.
    :limit INTEGER

Example : `(=> alert load-average-5 :limit 2)`


load-average-15
---------------
Check if the load average during the last fifteen minutes exceeds a specific limit.

> Set the limit not to exceed.
    :limit INTEGER

Example : `(=> alert load-average-15 :limit 2)`


ping
----
Check if a remote host answers the 2 ICMP ping.

> Set the host to ping. Return an error if ping command returns non-zero.
    :host "STRING" (can be IP or hostname)

Example : `(=> alert ping :host "8.8.8.8")`


command
-------
Execute an arbitrary command which triggers an alert if it returns a non-zero value.
This may be the most useful probe because it let the user do any check needed.

> Command to execute, accept commands with pipes.
    :command "STRING"

Example : `(=> alert command :command "tail -n 10 /var/log/messages | grep -v CRITICAL")`

service
-------
Check if a service is started on the system.

> Set the name of the service to test
    :name STRING

Example : `(=> alert service :name "mysql-server")`

file-less-than
--------------
Check if a file has a size less than a specified limit.

> Set the path of the file to check.
    :path "STRING"

> Set the limit in bytes before triggering an alert.
    :limit INTEGER

Example : `(=> alert file-less-than :path "/var/log/nginx.log" :limit 60)`


The configuration file
======================

The configuration file is Common LISP code, so it's evaluated. It's
possible to write some logic within it.


Loops
-----
It's possible to write loops if you don't want to repeat code

    (loop for host in '("bitreich.org" "dataswamp.org" "floodgap.com")
     do
       (=> mail ping :host host))

or another example

    (loop for service in '("smtpd" "nginx" "mysqld" "postgresql")
     do
       (=> mail service :name service))

and another example using rows from a file to check remote hosts

    (with-open-file (stream "hosts.txt")
      (loop for line = (read-line stream nil)
        while line
        do
          (=> mail ping :host line)))


Conditional
-----------
It is also possible to achieve conditionals. There are two very useful
conditionals groups.


Dependency
~~~~~~~~~~
Sometimes it may be a good idea to stop some probes if a probe
fail. In a case where you need to check a path through a network, from
the nearest machine to the remote target. If we can't reach our local
router, probes requiring the router to work will trigger errors so we
should skip them.

(stop-if-error
  (=> mail ping :host "192.168.1.1" :desc "My local router")
  (=> mail ping :host "89.89.89.89" :desc "My ISP DNS server")
  (=> mail ping :host "kernel.org"  :desc "Remote website"))

Note : stop-if-error is an alias for the **and** function.


Escalation
~~~~~~~~~~
It could be a good idea to use different alerts
depending on how critical a check is, but sometimes, the critical
level may depend of the value of the error and/or the delay between
the detection and fixing it. You could want to receive a mail when
things need to be fixed on spare time, but mail another people if
things aren't fixed after some level.

    (escalation
      (=> mail-me disk-usage :path "/" :limit 70)
      (=> sms-me  disk-usage :path "/" :limit 90)
      (=> buzzer  disk-usage :path "/" :limit 98))

In this example, we check the disk usage, I will get a mail through
"mail-me" alert if the disk usage go get more than 70%. Once it goes
that far, it will check if the disk usage gets more than 90%, if so,
I'll receive a sms through "sms-me" alert. And then, if it goes more
than 98%, the "buzzer" alert will make some bad noises in the room to
warn me about this.

Note : escalation is an alias for the **or** function.