reed-alert/README

542 lines
16 KiB
Plaintext
Raw Permalink Normal View History

Description
===========
reed-alert is a small and simple monitoring tool for your server,
written in Common LISP.
reed-alert checks the status of various processes on a server and
2018-11-07 10:46:04 +00:00
triggers user defined notifications.
Each triggered message is called an 'alert'.
Each check is called a 'probe'.
Each probe can be customized by different parameters.
Dependencies
============
reed-alert is regularly tested on FreeBSD/OpenBSD/Linux and has been
tested with both **sbcl** and **ecl** - which should be available for
most distributions.
(On OpenBSD you may prefer to use ecl because sbcl needs 'wxallowed'
on the partition where the binary is.)
To make reed-alert's deployment easier I avoid using external
libraries. reed-alert only requires a Common LISP interpreter and a
its own files.
A development to use quicklisp libraries to write more sophisticated
checks like "does this url contains a pattern ?" had begun and had
been abandoned, it has been decided to write shell command in the
probe **command** if the user need more elaborated checks.
2016-10-07 13:49:52 +00:00
Code-Readability
================
2016-10-07 13:49:52 +00:00
Although the code is very rough for now, I think it's already fairly
understandable by people who do need this kind of tool.
2016-10-07 13:49:52 +00:00
I will try to improve on the readability of the config file in future
commits. NOTE : declaration of notifiers is easier now.
Usage
=====
2018-10-24 04:47:07 +00:00
Install reed-alert
2018-10-24 04:48:54 +00:00
------------------
2018-01-25 18:35:30 +00:00
2018-10-24 04:47:07 +00:00
$ cd reed-alert
$ make
$ sudo make install
$ /usr/local/bin/reed-alert ~/monitoring/my_config.lisp
2018-01-25 18:35:30 +00:00
2018-10-24 04:48:54 +00:00
Special folder
--------------
reed-alert will create a folder using the following path, in order to
save the probes states between each invocation.
~/.reed-alert/states/
If you delete it, you will lose the failures states of previous run.
2018-01-25 18:35:30 +00:00
Reed-alert start automation
---------------------------
You can use cron to start reed-alert every n minutes (or whatever time
range you want). The frequency depend on what you check, if you only
want to check the daily backup worked, running reed-alert once a day
is fine but if you need to monitor a critical service then every
minute seems more adapted.
As always with cron jobs, be sure that either you call the interpreter
using its full path or that $PATH inside the crontab contains it.
A cron job every minute using ecl would looks like this :
*/5 * * * * ( cd /opt/reed-alert/ && /usr/local/bin/ecl --shell server.lisp )
2016-10-07 13:49:52 +00:00
Personal Configuration File
---------------------------
2019-12-31 19:51:49 +00:00
You may want to rename **example-simple.lisp** to **config.lisp** in
order to create your own configuration file.
2016-10-07 13:56:58 +00:00
The configuration is explained below.
The Notification System
=======================
When a check return a failure, a previously defined notifier will be
called. This will be triggered only after reed-alert find **3**
2019-01-15 16:21:18 +00:00
failures (not more or less, but this can be changed globally by
modifying *tries* variable) in a row for this check, this is a default
value that can be changed per probe with the :try parameter as
explained later in this document. This is to prevent reed-alert to
spam notifications for a long time (number of failures very high, like
a disk space usage that can't be fixed before a long time) OR
preventing reed-alert to send notifications about a check on the edge
of the limit like a ping almost working but failing from time to time
or the load average around the limit.
reed-alert will use the notifier system when it reach its try number
and when the problem is fixed, so you know when it begins and when it
ends.
2019-01-15 16:21:18 +00:00
It is possible to be reminded about a failure every n tries by setting
the keyword :reminder and using a number. This is useful if you want
to be reminded from time to time if a problem is not fixed, using some
alerts like mails can be easily overlooked or lost in a huge mail
amount. The :reminder is a setting per check. For a global reminder
setting, one can set *reminder* variable.
reed-alert keep tracks of the count of failures with one file per
probe failing in the "states" folder. To ensure unique filenames, the
following format is used (+ means it's concatenated) :
alert-name + probe-name + hash of probe parameters
The notifier is a shell command with a name. The shell command can
contains variables from reed-alert.
+ %function% : the name of the probe
+ %date% : the current date with format YYYY/MM/DD hh:mm:ss
+ %params% : the parameters of the probe
+ %hostname% : the hostname of the server
+ %result% : the error returned (the value exceeding the limit, file not found)
+ %desc : an arbitrary description naming a check, default to empty string
+ %level% : the type of notification used
+ %os% : the type of operating system (FreeBSD/Linux/OpenBSD)
+ %newline% : a newline character
+ %state% : "start" / "end" when problem happen / is solved
Example Probe 1: 'Check For Load Average'
---------------------------------------
If you want to send a mail with a message like:
"On 2016/10/06 11:11:12 server.foo.com has encountered a problem
during LOAD-AVERAGE-15 (:LIMIT 10) with a value of 30"
write the following at the top of the file and use **pretty-mail** in your checks:
(alert pretty-mail "echo 'On %date% %hostname% has encountered a problem during %function%
%params% with a value of %result%' | mail yourmail@foo.bar")
Example Probe 2: 'Don't do anything'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you don't want anything to be done when an error occur, use the following :
(alert nothing-to-send "")
Example Probe 3: 'Send SMS'
~~~~~~~~~~~~~~~~~~~~~~~~~~~
You may want to use an external service to send a SMS, this is totally
possible as we rely on a shell command :
(alert sms "echo 'error on %hostname : %function% %result%'
| curl -u login:pass http://api.sendsms.com/")
The Probes
==========
Probes are written in Common LISP. They are predefined checks.
The :desc Parameter
-------------------
The :desc parameter allows you to describe specifically what your check
does. It can be put in every probe.
:desc "STRING"
The :try Parameter
------------------
The :try parameter allows you to change how many failure to wait
before the alert is triggered. By default, it's triggered after 3
failures. Sometimes, when using ping for example, you want to be
notified when it fails a few cycles and not at first failure.
:try INTEGER
Overview
--------
As of this commit, reed-alert ships with the following probes:
(1) number-of-processes
(2) pid-running
(3) disk-usage
(4) check-file-exists
(5) file-updated
(6) load-average-1
(7) load-average-5
(8) load-average-15
(9) ping
(10) command
2017-11-16 09:20:01 +00:00
(11) service
(12) file-less-than
number-of-processes
-------------------
Check if the actual number of processes of the system exceeds a specific limit.
> Set the limit that will trigger an alert when exceeded.
:limit INTEGER
2018-10-20 18:32:17 +00:00
Example: `(=> alert number-of-processes :limit 200)`
pid-running
-----------
Check if the PID number found in a .pid file is alive.
> Set the path of the pid file. If $USER doesn't have permission to open it, return "file not found".
:path "STRING"
2018-10-20 18:32:17 +00:00
Example: `(=> alert pid-running :path "/var/run/nginx.pid")`
disk-usage
----------
Check if the disk-usage of a chosen partition does exceed a specific limit.
> Set the mountpoint to check.
:path "STRING"
2017-11-16 09:20:01 +00:00
> Set the limit that will trigger an alert when exceeded.
:limit INTEGER
2017-11-16 09:20:01 +00:00
2018-10-20 18:32:17 +00:00
Example: `(=> alert disk-usage :path "/tmp" :limit 50)`
check-file-exists
-----------
Check if a file exists.
> Set the path of the file to check.
:path "STRING"
Example: `(=> alert check-file-exists :path "/var/postgresql/standby")`
file-updated
------------
Check if a file exists and has been updated since a defined time.
> Set the path of the file to check.
:path "STRING"
2017-11-16 09:20:01 +00:00
> Set the limit in minutes since the last modification time before triggering an alert.
:limit INTEGER
2017-11-16 09:20:01 +00:00
2018-10-20 18:32:17 +00:00
Example: `(=> alert file-updated :path "/var/log/nginx/access.log" :limit 60)`
load-average-1
--------------
Check if the load average during the last minute exceeds a specific limit.
> Set the limit not to exceed.
:limit INTEGER
2018-10-20 18:32:17 +00:00
Example: `(=> alert load-average-1 :limit 2)`
load-average-5
--------------
Check if the load average during the last five minutes exceeds a specific limit.
> Set the limit not to exceed.
:limit INTEGER
2018-10-20 18:32:17 +00:00
Example: `(=> alert load-average-5 :limit 2)`
load-average-15
---------------
Check if the load average during the last fifteen minutes exceeds a specific limit.
> Set the limit not to exceed.
:limit INTEGER
2018-10-20 18:32:17 +00:00
Example: `(=> alert load-average-15 :limit 2)`
ping
----
Check if a remote host answers the 2 ICMP ping.
> Set the host to ping. Return an error if ping command returns non-zero.
:host "STRING" (can be IP or hostname)
2017-11-16 09:20:01 +00:00
2018-10-20 18:32:17 +00:00
Example: `(=> alert ping :host "8.8.8.8")`
command
-------
Execute an arbitrary command which triggers an alert if it returns a non-zero value.
This may be the most useful probe because it let the user do any check needed.
> Command to execute, accept commands with pipes.
:command "STRING"
2018-10-20 18:32:17 +00:00
Example: `(=> alert command :command "tail -n 10 /var/log/messages | grep -v CRITICAL")`
service
-------
Check if a service is started on the system.
> Set the name of the service to test
:name STRING
2018-10-20 18:32:17 +00:00
Example: `(=> alert service :name "mysql-server")`
file-less-than
--------------
Check if a file has a size less than a specified limit.
> Set the path of the file to check.
:path "STRING"
2017-11-16 09:20:01 +00:00
> Set the limit in bytes before triggering an alert.
:limit INTEGER
2017-11-16 09:20:01 +00:00
2018-10-20 18:32:17 +00:00
Example: `(=> alert file-less-than :path "/var/log/nginx.log" :limit 60)`
2018-05-31 15:24:14 +00:00
curl-http-status
----------------
Do a HTTP request and return an error if the return code isn't
200. Requires curl.
2018-05-31 15:24:14 +00:00
> Set the url to request.
:url "STRING"
> Set the time to wait before aborting.
:timeout INTEGER
ssl-expiration
--------------------
Check if a remote SSL certificate expires in less than a specified
time. Requires openssl.
> Set the hostname for the request.
:host "STRING"
> Set the expiration time limit in seconds.
:seconds INTEGER
> Set the port for the request (OPTIONAL).
:port INTEGER (default to 443)
> Use starttls (OPTIONAL).
:starttls STRING
2018-10-20 18:32:17 +00:00
Example: `(=> alert ssl-expiration :host "domain.local" :seconds (* 7 24 60 60))
Example: `(=> alert ssl-expiration :host "domain.local" :seconds 86400 :port 6697)
Example: `(=> alert ssl-expiration :host "smtp.domain.local" :seconds 86400 :starttls "smtp" :port 25)
2019-07-11 08:06:26 +00:00
write-to-file
--------------------
Write content to a file, create it if non existent.
The purpose of this probe is to be used at the end of a reed-alert
script to update the modification time of a file, and use file-updated
on this file at the beginning of a script to monitor if reed-alert did
finish correctly on last run.
> Set the path of the file.
:path "STRING"
> Set the content of the file (OPTIONAL).
:text "STRING" (default to current time in seconds)
Example: `(=> alert write-to-file :path "/tmp/reed-alert.txt")`
Example: `(=> alert write-to-file :path "/tmp/reed-alert.txt" :text "hello world")`
The configuration file
======================
The configuration file is Common LISP code, so it's evaluated. It's
possible to write some logic within it.
Loops
-----
It's possible to write loops if you don't want to repeat code
(loop for host in '("bitreich.org" "dataswamp.org" "floodgap.com")
do
(=> mail ping :host host))
or another example
(loop for service in '("smtpd" "nginx" "mysqld" "postgresql")
do
(=> mail service :name service))
and another example using rows from a file to check remote hosts
(with-open-file (stream "hosts.txt")
(loop for line = (read-line stream nil)
while line
do
(=> mail ping :host line)))
Conditional
-----------
It is also possible to achieve conditionals. There are two very useful
conditionals groups.
Dependency
~~~~~~~~~~
Sometimes it may be a good idea to stop some probes if a probe
fail. In a case where you need to check a path through a network, from
the nearest machine to the remote target. If we can't reach our local
router, probes requiring the router to work will trigger errors so we
should skip them.
(stop-if-error
(=> mail ping :host "192.168.1.1" :desc "My local router")
(=> mail ping :host "89.89.89.89" :desc "My ISP DNS server")
(=> mail ping :host "kernel.org" :desc "Remote website"))
Note : stop-if-error is an alias for the **and** function.
Escalation
~~~~~~~~~~
It could be a good idea to use different alerts
depending on how critical a check is, but sometimes, the critical
level may depend of the value of the error and/or the delay between
the detection and fixing it. You could want to receive a mail when
things need to be fixed on spare time, but mail another people if
things aren't fixed after some level.
(escalation
(=> mail-me disk-usage :path "/" :limit 70)
(=> sms-me disk-usage :path "/" :limit 90)
(=> buzzer disk-usage :path "/" :limit 98))
In this example, we check the disk usage, I will get a mail through
"mail-me" alert if the disk usage go get more than 70%. Once it goes
that far, it will check if the disk usage gets more than 90%, if so,
I'll receive a sms through "sms-me" alert. And then, if it goes more
than 98%, the "buzzer" alert will make some bad noises in the room to
warn me about this.
Note : escalation is an alias for the **or** function.
2018-01-22 07:06:19 +00:00
Extend with your own probes
===========================
It is likely that you want to write your own probes. While using the
command probe can be convenient, you may want to have a probe with
more parameters and better integration than the command probe.
There are two methods for adding probes :
- in the configuration file before using it
- in a separated lisp file that you load from the configuration file
If you want to reuse for multiples configuration files or servers, I
would recommend a separate file, otherwise, adding it at the top of
the configuration file can be convenient too.
Using a shell command
---------------------
A minimum of Common LISP comprehension is needed for this. But using
the easiest way to go by writing a probe using a command shell, the
declaration can be really simple.
We are going to write a probe that will use curl to fetch an page and
then grep on the output to look for a pattern. The return code of grep
will be the return status of the probe, if grep finds the pattern,
it's a success, if not it's a failure.
In the following code, the "create-probe" part is a macro that will
write most of the code for you. Then, we use "command-return-code"
function which will execute the shell command passed as a string (or
as a list) and return the correct values in case of success or
failure.
(create-probe
check-http-pattern
(command-return-code (format nil "curl ~a | grep -i ~a"
(getf params :url) (getf params :pattern))))
If you don't know LISP, "format" function works like "printf", using
"~a" instead of "%s". This is the only required thing to know if you
want to reuse the previous code.
Then we can call it like this :
(=> notifier check-http-pattern :url "http://127.0.0.1" :pattern "Powered by cl-yag")
Using plain LISP
----------------
We have seen previously how tocreate new probes from a shell command,
but one may want to do it in LISP, allowing to use full features of
the language and even some libraries to check values in a database for
example. I recommend to read the "probes.lisp" file, it's the best way
to learn how to write a new probe. But as an example, we will learn
from the easiest probe included : check-file-exists
2018-01-22 07:06:19 +00:00
(create-probe
check-file-exists
2018-01-22 07:06:19 +00:00
(let ((result (probe-file (getf params :path))))
(if result
t
(list nil "file not found"))))
Like before, we use the "create-probe" macro and give a name to the
probe. Then, we have to write some code, in the current case, check if
the file exists. Finally, if it is a success, we have to return **t**,
if it fails we return a list containing **nil** and a value or a
string. The second element in the list will replaced %result% in the
notification command, so you can use something explicit, a
concatenation of a message with the return value etc..". Parameters
should be get with getf from **params** variable, allowing to use a
default value in case it's not defined in the configuration file.