Description =========== reed-alert is a small and simple monitoring tool for your server, written in Common LISP. reed-alert checks the status of various processes on a server and triggers self defined notifications. Each triggered message is called an 'alert'. Each check is called a 'probe'. Each probe can be customized by different parameters. Dependencies ============ reed-alert is regularly tested on FreeBSD/OpenBSD/Linux and has been tested with both **sbcl** and **ecl** - which should be available for most distributions. (On OpenBSD you may prefer to use ecl because sbcl needs 'wxallowed' on the partition where the binary is.) To make reed-alert's deployment easier I avoid using external libraries. reed-alert only requires a Common LISP interpreter and a its own files. A development to use quicklisp libraries to write more sophisticated checks like "does this url contains a pattern ?" had begun and had been abandoned, it has been decided to write shell command in the probe **command** if the user need more elaborated checks. Code-Readability ================ Although the code is very rough for now, I think it's already fairly understandable by people who do need this kind of tool. I will try to improve on the readability of the config file in future commits. NOTE : declaration of notifiers is easier now. Usage ===== Start reed-alert ---------------- To start reed-alert + sbcl : **sbcl --script config_file.lisp** + ecl : **ecl --shell config_file.lisp** Older versions of ecl requires -shell instead of --shell. Reed-alert start automation --------------------------- You can use cron to start reed-alert every n minutes (or whatever time range you want). The frequency depend on what you check, if you only want to check the daily backup worked, running reed-alert once a day is fine but if you need to monitor a critical service then every minute seems more adapted. As always with cron jobs, be sure that either you call the interpreter using its full path or that $PATH inside the crontab contains it. A cron job every minute using ecl would looks like this : */5 * * * * ( cd /opt/reed-alert/ && /usr/local/bin/ecl --shell server.lisp ) Personal Configuration File --------------------------- You may want to rename **config.lisp.sample** to **config.lisp** in order to create your own configuration file. The configuration is explained below. The Notification System ======================= When a check return a failure, a previously defined notifier will be called. This will be triggered only after reed-alert find **3** failures (not more or less) in a row for this check, this is a default value that can be changed per probe with the :try parameter as explained later in this document. This is to prevent reed-alert to spam notifications for a long time (number of failures very high, like a disk space usage that can't be fixed before a long time) OR preventing reed-alert to send notifications about a check on the edge of the limit like a ping almost working but failing from time to time or the load average around the limit. reed-alert will use the notifier system when it reach its try number and when the problem is fixed, so you know when it begins and when it ends. reed-alert keep tracks of the count of failures with one file per probe failing in the "states" folder. To ensure unique filenames, the following format is used (+ means it's concatenated) : alert-name + probe-name + hash of probe parameters The notifier is a shell command with a name. The shell command can contains variables from reed-alert. + %function% : the name of the probe + %date% : the current date with format YYYY/MM/DD hh:mm:ss + %params% : the parameters of the probe + %hostname% : the hostname of the server + %result% : the error returned (the value exceeding the limit, file not found) + %description% : an arbitrary description naming a check + %level% : the type of notification used + %os% : the type of operating system (FreeBSD/Linux/OpenBSD) + %newline% : a newline character + %state% : "start" / "end" when problem happen / is solved Example Probe 1: 'Check For Load Average' --------------------------------------- If you want to send a mail with a message like: "On 2016/10/06 11:11:12 server.foo.com has encountered a problem during LOAD-AVERAGE-15 (:LIMIT 10) with a value of 30" write the following at the top of the file and use **pretty-mail** in your checks: (alert pretty-mail "echo 'On %date% %hostname% has encountered a problem during %function% %params% with a value of %result%' | mail yourmail@foo.bar") Example Probe 2: 'Don't do anything' ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you don't want anything to be done when an error occur, use the following : (alert nothing-to-send "") Example Probe 3: 'Send SMS' ~~~~~~~~~~~~~~~~~~~~~~~~~~~ You may want to use an external service to send a SMS, this is totally possible as we rely on a shell command : (alert sms "echo 'error on %hostname : %function% %result%' | curl -u login:pass http://api.sendsms.com/") The Probes ========== Probes are written in Common LISP. They are predefined checks. The :desc Parameter ------------------- The :desc parameter allows you to describe specifically what your check does. It can be put in every probe. :desc "STRING" The :try Parameter ------------------ The :try parameter allows you to change how many failure to wait before the alert is triggered. By default, it's triggered after 3 failures. Sometimes, when using ping for example, you want to be notified when it fails a few cycles and not at first failure. :try INTEGER Overview -------- As of this commit, reed-alert ships with the following probes: (1) number-of-processes (2) pid-running (3) disk-usage (4) file-exists (5) file-updated (6) load-average-1 (7) load-average-5 (8) load-average-15 (9) ping (10) command (11) service (12) file-less-than number-of-processes ------------------- Check if the actual number of processes of the system exceeds a specific limit. > Set the limit that will trigger an alert when exceeded. :limit INTEGER Example : `(=> alert number-of-processes :limit 200)` pid-running ----------- Check if the PID number found in a .pid file is alive. > Set the path of the pid file. If $USER doesn't have permission to open it, return "file not found". :path "STRING" Example : `(=> alert pid-running :path "/var/run/nginx.pid")` disk-usage ---------- Check if the disk-usage of a chosen partition does exceed a specific limit. > Set the mountpoint to check. :path "STRING" > Set the limit that will trigger an alert when exceeded. :limit INTEGER Example : `(=> alert disk-usage :path "/tmp" :limit 50)` file-exists ----------- Check if a file exists. > Set the path of the file to check. :path "STRING" Example : `(=> alert file-exists :path "/var/postgresql/standby")` file-updated ------------ Check if a file exists and has been updated since a defined time. > Set the path of the file to check. :path "STRING" > Set the limit in minutes since the last modification time before triggering an alert. :limit INTEGER Example : `(=> alert file-updated :path "/var/log/nginx/access.log" :limit 60)` load-average-1 -------------- Check if the load average during the last minute exceeds a specific limit. > Set the limit not to exceed. :limit INTEGER Example : `(=> alert load-average-1 :limit 2)` load-average-5 -------------- Check if the load average during the last five minutes exceeds a specific limit. > Set the limit not to exceed. :limit INTEGER Example : `(=> alert load-average-5 :limit 2)` load-average-15 --------------- Check if the load average during the last fifteen minutes exceeds a specific limit. > Set the limit not to exceed. :limit INTEGER Example : `(=> alert load-average-15 :limit 2)` ping ---- Check if a remote host answers the 2 ICMP ping. > Set the host to ping. Return an error if ping command returns non-zero. :host "STRING" (can be IP or hostname) Example : `(=> alert ping :host "8.8.8.8")` command ------- Execute an arbitrary command which triggers an alert if it returns a non-zero value. This may be the most useful probe because it let the user do any check needed. > Command to execute, accept commands with pipes. :command "STRING" Example : `(=> alert command :command "tail -n 10 /var/log/messages | grep -v CRITICAL")` service ------- Check if a service is started on the system. > Set the name of the service to test :name STRING Example : `(=> alert service :name "mysql-server")` file-less-than -------------- Check if a file has a size less than a specified limit. > Set the path of the file to check. :path "STRING" > Set the limit in bytes before triggering an alert. :limit INTEGER Example : `(=> alert file-less-than :path "/var/log/nginx.log" :limit 60)` The configuration file ====================== The configuration file is Common LISP code, so it's evaluated. It's possible to write some logic within it. Loops ----- It's possible to write loops if you don't want to repeat code (loop for host in '("bitreich.org" "dataswamp.org" "floodgap.com") do (=> mail ping :host host)) or another example (loop for service in '("smtpd" "nginx" "mysqld" "postgresql") do (=> mail service :name service)) and another example using rows from a file to check remote hosts (with-open-file (stream "hosts.txt") (loop for line = (read-line stream nil) while line do (=> mail ping :host line))) Conditional ----------- It is also possible to achieve conditionals. There are two very useful conditionals groups. Dependency ~~~~~~~~~~ Sometimes it may be a good idea to stop some probes if a probe fail. In a case where you need to check a path through a network, from the nearest machine to the remote target. If we can't reach our local router, probes requiring the router to work will trigger errors so we should skip them. (stop-if-error (=> mail ping :host "192.168.1.1" :desc "My local router") (=> mail ping :host "89.89.89.89" :desc "My ISP DNS server") (=> mail ping :host "kernel.org" :desc "Remote website")) Note : stop-if-error is an alias for the **and** function. Escalation ~~~~~~~~~~ It could be a good idea to use different alerts depending on how critical a check is, but sometimes, the critical level may depend of the value of the error and/or the delay between the detection and fixing it. You could want to receive a mail when things need to be fixed on spare time, but mail another people if things aren't fixed after some level. (escalation (=> mail-me disk-usage :path "/" :limit 70) (=> sms-me disk-usage :path "/" :limit 90) (=> buzzer disk-usage :path "/" :limit 98)) In this example, we check the disk usage, I will get a mail through "mail-me" alert if the disk usage go get more than 70%. Once it goes that far, it will check if the disk usage gets more than 90%, if so, I'll receive a sms through "sms-me" alert. And then, if it goes more than 98%, the "buzzer" alert will make some bad noises in the room to warn me about this. Note : escalation is an alias for the **or** function. Extend with your own probes =========================== It is likely that you want to write your own probes. While using the command probe can be convenient, you may want to have a probe with more parameters and better integration than the command probe. There are two methods for adding probes : - in the configuration file before using it - in a separated lisp file that you load from the configuration file If you want to reuse for multiples configuration files or servers, I would recommend a separate file, otherwise, adding it at the top of the configuration file can be convenient too. Using a shell command --------------------- A minimum of Common LISP comprehension is needed for this. But using the easiest way to go by writing a probe using a command shell, the declaration can be really simple. We are going to write a probe that will use curl to fetch an page and then grep on the output to look for a pattern. The return code of grep will be the return status of the probe, if grep finds the pattern, it's a success, if not it's a failure. In the following code, the "create-probe" part is a macro that will write most of the code for you. Then, we use "command-return-code" function which will execute the shell command passed as a string (or as a list) and return the correct values in case of success or failure. (create-probe check-http-pattern (command-return-code (format nil "curl ~a | grep -i ~a" (getf params :url) (getf params :pattern)))) If you don't know LISP, "format" function works like "printf", using "~a" instead of "%s". This is the only required thing to know if you want to reuse the previous code. Then we can call it like this : (=> notifier check-http-pattern :url "http://127.0.0.1" :pattern "Powered by cl-yag") Using plain LISP ---------------- We have seen previously how tocreate new probes from a shell command, but one may want to do it in LISP, allowing to use full features of the language and even some libraries to check values in a database for example. I recommend to read the "probes.lisp" file, it's the best way to learn how to write a new probe. But as an example, we will learn from the easiest probe included : file-exists (create-probe file-exists (let ((result (probe-file (getf params :path)))) (if result t (list nil "file not found")))) Like before, we use the "create-probe" macro and give a name to the probe. Then, we have to write some code, in the current case, check if the file exists. Finally, if it is a success, we have to return **t**, if it fails we return a list containing **nil** and a value or a string. The second element in the list will replaced %result% in the notification command, so you can use something explicit, a concatenation of a message with the return value etc..". Parameters should be get with getf from **params** variable, allowing to use a default value in case it's not defined in the configuration file.