Description
===========

reed-alert is a small and simple monitoring tool for your server,
written in Common LISP.

reed-alert checks the status of various processes on a server and
triggers user defined notifications.

Each triggered message is called an 'alert'.
Each check is called a 'probe'.
Each probe can be customized by different parameters.


Dependencies
============

reed-alert is regularly tested on FreeBSD/OpenBSD/Linux and has been
tested with both **sbcl** and **ecl** - which should be available for
most distributions.

(On OpenBSD you may prefer to use ecl because sbcl needs 'wxallowed'
on the partition where the binary is.)

To make reed-alert's deployment easier I avoid using external
libraries. reed-alert only requires a Common LISP interpreter and a
its own files.

A development to use quicklisp libraries to write more sophisticated
checks like "does this url contains a pattern ?" had begun and had
been abandoned, it has been decided to write shell command in the
probe **command** if the user need more elaborated checks.


Code-Readability
================

Although the code is very rough for now, I think it's already fairly
understandable by people who do need this kind of tool.

I will try to improve on the readability of the config file in future
commits. NOTE : declaration of notifiers is easier now.


Usage
=====

Install reed-alert
------------------

    $ cd reed-alert
    $ make
    $ sudo make install
    $ /usr/local/bin/reed-alert ~/monitoring/my_config.lisp


Special folder
--------------

reed-alert will create a folder using the following path, in order to
save the probes states between each invocation.

    ~/.reed-alert/states/

If you delete it, you will lose the failures states of previous run.


Reed-alert start automation
---------------------------

You can use cron to start reed-alert every n minutes (or whatever time
range you want). The frequency depend on what you check, if you only
want to check the daily backup worked, running reed-alert once a day
is fine but if you need to monitor a critical service then every
minute seems more adapted.

As always with cron jobs, be sure that either you call the interpreter
using its full path or that $PATH inside the crontab contains it.

A cron job every minute using ecl would looks like this :

    */5 * * * * ( cd /opt/reed-alert/ && /usr/local/bin/ecl --shell server.lisp )


Personal Configuration File
---------------------------
You may want to rename **example-simple.lisp** to **config.lisp** in
order to create your own configuration file.

The configuration is explained below.


The Notification System
=======================

When a check return a failure, a previously defined notifier will be
called. This will be triggered only after reed-alert find **3**
failures (not more or less, but this can be changed globally by
modifying *tries* variable) in a row for this check, this is a default
value that can be changed per probe with the :try parameter as
explained later in this document. This is to prevent reed-alert to
spam notifications for a long time (number of failures very high, like
a disk space usage that can't be fixed before a long time) OR
preventing reed-alert to send notifications about a check on the edge
of the limit like a ping almost working but failing from time to time
or the load average around the limit.

reed-alert will use the notifier system when it reach its try number
and when the problem is fixed, so you know when it begins and when it
ends.

It is possible to be reminded about a failure every n tries by setting
the keyword :reminder and using a number. This is useful if you want
to be reminded from time to time if a problem is not fixed, using some
alerts like mails can be easily overlooked or lost in a huge mail
amount. The :reminder is a setting per check. For a global reminder
setting, one can set *reminder* variable.

reed-alert keep tracks of the count of failures with one file per
probe failing in the "states" folder. To ensure unique filenames, the
following format is used (+ means it's concatenated) :

    alert-name + probe-name + hash of probe parameters

The notifier is a shell command with a name. The shell command can
contains variables from reed-alert.

+ %function%    : the name of the probe
+ %date%        : the current date with format YYYY/MM/DD hh:mm:ss
+ %params%      : the parameters of the probe
+ %hostname%    : the hostname of the server
+ %result%      : the error returned (the value exceeding the limit, file not found)
+ %desc         : an arbitrary description naming a check, default to empty string
+ %level%       : the type of notification used
+ %os%          : the type of operating system (FreeBSD/Linux/OpenBSD)
+ %newline%     : a newline character
+ %state%       : "start" / "end" when problem happen / is solved


Example Probe 1: 'Check For Load Average'
---------------------------------------
If you want to send a mail with a message like:

	"On 2016/10/06 11:11:12 server.foo.com has encountered a problem
	during LOAD-AVERAGE-15 (:LIMIT 10) with a value of 30"


write the following at the top of the file and use **pretty-mail** in your checks:

   (alert pretty-mail "echo 'On %date% %hostname% has encountered a problem during %function%
	                 %params% with a value of %result%' | mail yourmail@foo.bar")

Example Probe 2: 'Don't do anything'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you don't want anything to be done when an error occur, use the following :

    (alert nothing-to-send "")

Example Probe 3: 'Send SMS'
~~~~~~~~~~~~~~~~~~~~~~~~~~~
You may want to use an external service to send a SMS, this is totally
possible as we rely on a shell command :

    (alert sms "echo 'error on %hostname : %function% %result%'
                      | curl -u login:pass http://api.sendsms.com/")


The Probes
==========

Probes are written in Common LISP. They are predefined checks.

The :desc Parameter
-------------------
The :desc parameter allows you to describe specifically what your check
does. It can be put in every probe.

    :desc "STRING"


The :try Parameter
------------------
The :try parameter allows you to change how many failure to wait
before the alert is triggered. By default, it's triggered after 3
failures. Sometimes, when using ping for example, you want to be
notified when it fails a few cycles and not at first failure.

    :try INTEGER


Overview
--------
As of this commit, reed-alert ships with the following probes:

	(1) 	number-of-processes
	(2) 	pid-running
	(3) 	disk-usage
	(4) 	check-file-exists
	(5) 	file-updated
	(6) 	load-average-1
	(7) 	load-average-5
	(8) 	load-average-15
	(9)	ping
	(10)	command
	(11)	service
	(12)	file-less-than


number-of-processes
-------------------
Check if the actual number of processes of the system exceeds a specific limit.

> Set the limit that will trigger an alert when exceeded.
    :limit INTEGER

Example: `(=> alert number-of-processes :limit 200)`


pid-running
-----------
Check if the PID number found in a .pid file is alive.

> Set the path of the pid file. If $USER doesn't have permission to open it, return "file not found".
    :path "STRING"

Example: `(=> alert pid-running :path "/var/run/nginx.pid")`


disk-usage
----------
Check if the disk-usage of a chosen partition does exceed a specific limit.

> Set the mountpoint to check.
    :path "STRING"

> Set the limit that will trigger an alert when exceeded.
    :limit INTEGER

Example: `(=> alert disk-usage :path "/tmp" :limit 50)`


check-file-exists
-----------
Check if a file exists.

> Set the path of the file to check.
    :path "STRING"

Example: `(=> alert check-file-exists :path "/var/postgresql/standby")`


file-updated
------------
Check if a file exists and has been updated since a defined time.

> Set the path of the file to check.
    :path "STRING"

> Set the limit in minutes since the last modification time before triggering an alert.
    :limit INTEGER

Example: `(=> alert file-updated :path "/var/log/nginx/access.log" :limit 60)`


load-average-1
--------------
Check if the load average during the last minute exceeds a specific limit.

> Set the limit not to exceed.
    :limit INTEGER

Example: `(=> alert load-average-1 :limit 2)`


load-average-5
--------------
Check if the load average during the last five minutes exceeds a specific limit.

> Set the limit not to exceed.
    :limit INTEGER

Example: `(=> alert load-average-5 :limit 2)`


load-average-15
---------------
Check if the load average during the last fifteen minutes exceeds a specific limit.

> Set the limit not to exceed.
    :limit INTEGER

Example: `(=> alert load-average-15 :limit 2)`


ping
----
Check if a remote host answers the 2 ICMP ping.

> Set the host to ping. Return an error if ping command returns non-zero.
    :host "STRING" (can be IP or hostname)

Example: `(=> alert ping :host "8.8.8.8")`


command
-------
Execute an arbitrary command which triggers an alert if it returns a non-zero value.
This may be the most useful probe because it let the user do any check needed.

> Command to execute, accept commands with pipes.
    :command "STRING"

Example: `(=> alert command :command "tail -n 10 /var/log/messages | grep -v CRITICAL")`


service
-------
Check if a service is started on the system.

> Set the name of the service to test
    :name STRING

Example: `(=> alert service :name "mysql-server")`


file-less-than
--------------
Check if a file has a size less than a specified limit.

> Set the path of the file to check.
    :path "STRING"

> Set the limit in bytes before triggering an alert.
    :limit INTEGER

Example: `(=> alert file-less-than :path "/var/log/nginx.log" :limit 60)`


curl-http-status
----------------
Do a HTTP request and return an error if the return code isn't
200. Requires curl.

> Set the url to request.
    :url "STRING"

> Set the time to wait before aborting.
    :timeout INTEGER


ssl-expiration
--------------------
Check if a remote SSL certificate expires in less than a specified
time. Requires openssl.

> Set the hostname for the request.
    :host "STRING"

> Set the expiration time limit in seconds.
    :seconds INTEGER

> Set the port for the request (OPTIONAL).
    :port INTEGER (default to 443)

> Use starttls (OPTIONAL).
    :starttls STRING

Example: `(=> alert ssl-expiration :host "domain.local" :seconds (* 7 24 60 60))
Example: `(=> alert ssl-expiration :host "domain.local" :seconds 86400 :port 6697)
Example: `(=> alert ssl-expiration :host "smtp.domain.local" :seconds 86400 :starttls "smtp" :port 25)


write-to-file
--------------------
Write content to a file, create it if non existent.

The purpose of this probe is to be used at the end of a reed-alert
script to update the modification time of a file, and use file-updated
on this file at the beginning of a script to monitor if reed-alert did
finish correctly on last run.

> Set the path of the file.
    :path "STRING"

> Set the content of the file (OPTIONAL).
    :text "STRING" (default to current time in seconds)

Example: `(=> alert write-to-file :path "/tmp/reed-alert.txt")`
Example: `(=> alert write-to-file :path "/tmp/reed-alert.txt" :text "hello world")`


The configuration file
======================

The configuration file is Common LISP code, so it's evaluated. It's
possible to write some logic within it.


Loops
-----
It's possible to write loops if you don't want to repeat code

    (loop for host in '("bitreich.org" "dataswamp.org" "floodgap.com")
     do
       (=> mail ping :host host))

or another example

    (loop for service in '("smtpd" "nginx" "mysqld" "postgresql")
     do
       (=> mail service :name service))

and another example using rows from a file to check remote hosts

    (with-open-file (stream "hosts.txt")
      (loop for line = (read-line stream nil)
        while line
        do
          (=> mail ping :host line)))


Conditional
-----------
It is also possible to achieve conditionals. There are two very useful
conditionals groups.


Dependency
~~~~~~~~~~
Sometimes it may be a good idea to stop some probes if a probe
fail. In a case where you need to check a path through a network, from
the nearest machine to the remote target. If we can't reach our local
router, probes requiring the router to work will trigger errors so we
should skip them.

(stop-if-error
  (=> mail ping :host "192.168.1.1" :desc "My local router")
  (=> mail ping :host "89.89.89.89" :desc "My ISP DNS server")
  (=> mail ping :host "kernel.org"  :desc "Remote website"))

Note : stop-if-error is an alias for the **and** function.


Escalation
~~~~~~~~~~
It could be a good idea to use different alerts
depending on how critical a check is, but sometimes, the critical
level may depend of the value of the error and/or the delay between
the detection and fixing it. You could want to receive a mail when
things need to be fixed on spare time, but mail another people if
things aren't fixed after some level.

    (escalation
      (=> mail-me disk-usage :path "/" :limit 70)
      (=> sms-me  disk-usage :path "/" :limit 90)
      (=> buzzer  disk-usage :path "/" :limit 98))

In this example, we check the disk usage, I will get a mail through
"mail-me" alert if the disk usage go get more than 70%. Once it goes
that far, it will check if the disk usage gets more than 90%, if so,
I'll receive a sms through "sms-me" alert. And then, if it goes more
than 98%, the "buzzer" alert will make some bad noises in the room to
warn me about this.

Note : escalation is an alias for the **or** function.


Extend with your own probes
===========================

It is likely that you want to write your own probes. While using the
command probe can be convenient, you may want to have a probe with
more parameters and better integration than the command probe.

There are two methods for adding probes :
- in the configuration file before using it
- in a separated lisp file that you load from the configuration file

If you want to reuse for multiples configuration files or servers, I
would recommend a separate file, otherwise, adding it at the top of
the configuration file can be convenient too.


Using a shell command
---------------------

A minimum of Common LISP comprehension is needed for this. But using
the easiest way to go by writing a probe using a command shell, the
declaration can be really simple.

We are going to write a probe that will use curl to fetch an page and
then grep on the output to look for a pattern. The return code of grep
will be the return status of the probe, if grep finds the pattern,
it's a success, if not it's a failure.

In the following code, the "create-probe" part is a macro that will
write most of the code for you. Then, we use "command-return-code"
function which will execute the shell command passed as a string (or
as a list) and return the correct values in case of success or
failure.

    (create-probe
     check-http-pattern
     (command-return-code (format nil "curl ~a | grep -i ~a"
                                  (getf params :url) (getf params :pattern))))

If you don't know LISP, "format" function works like "printf", using
"~a" instead of "%s". This is the only required thing to know if you
want to reuse the previous code.

Then we can call it like this :

    (=> notifier check-http-pattern :url "http://127.0.0.1" :pattern "Powered by cl-yag")


Using plain LISP
----------------

We have seen previously how tocreate new probes from a shell command,
but one may want to do it in LISP, allowing to use full features of
the language and even some libraries to check values in a database for
example. I recommend to read the "probes.lisp" file, it's the best way
to learn how to write a new probe. But as an example, we will learn
from the easiest probe included : check-file-exists

    (create-probe
     check-file-exists
     (let ((result (probe-file (getf params :path))))
       (if result
           t
           (list nil "file not found"))))

Like before, we use the "create-probe" macro and give a name to the
probe. Then, we have to write some code, in the current case, check if
the file exists. Finally, if it is a success, we have to return **t**,
if it fails we return a list containing **nil** and a value or a
string. The second element in the list will replaced %result% in the
notification command, so you can use something explicit, a
concatenation of a message with the return value etc..". Parameters
should be get with getf from **params** variable, allowing to use a
default value in case it's not defined in the configuration file.