Description =========== reed-alert is a small and simple monitoring tool for your server, written in Common LISP. reed-alert checks the status of various processes on a server and triggers user defined notifications. Each triggered message is called an 'alert'. Each check is called a 'probe'. Each probe can be customized by different parameters. Dependencies ============ reed-alert is regularly tested on FreeBSD/OpenBSD/Linux and has been tested with both **sbcl** and **ecl** - which should be available for most distributions. (On OpenBSD you may prefer to use ecl because sbcl needs 'wxallowed' on the partition where the binary is.) To make reed-alert's deployment easier I avoid using external libraries. reed-alert only requires a Common LISP interpreter and a its own files. A development to use quicklisp libraries to write more sophisticated checks like "does this url contains a pattern ?" had begun and had been abandoned, it has been decided to write shell command in the probe **command** if the user need more elaborated checks. Code-Readability ================ Although the code is very rough for now, I think it's already fairly understandable by people who do need this kind of tool. I will try to improve on the readability of the config file in future commits. NOTE : declaration of notifiers is easier now. Usage ===== Install reed-alert ------------------ $ cd reed-alert $ make $ sudo make install $ /usr/local/bin/reed-alert ~/monitoring/my_config.lisp Special folder -------------- reed-alert will create a folder using the following path, in order to save the probes states between each invocation. ~/.reed-alert/states/ If you delete it, you will lose the failures states of previous run. Reed-alert start automation --------------------------- You can use cron to start reed-alert every n minutes (or whatever time range you want). The frequency depend on what you check, if you only want to check the daily backup worked, running reed-alert once a day is fine but if you need to monitor a critical service then every minute seems more adapted. As always with cron jobs, be sure that either you call the interpreter using its full path or that $PATH inside the crontab contains it. A cron job every minute using ecl would looks like this : */5 * * * * ( cd /opt/reed-alert/ && /usr/local/bin/ecl --shell server.lisp ) Personal Configuration File --------------------------- You may want to rename **example-simple.lisp** to **config.lisp** in order to create your own configuration file. The configuration is explained below. The Notification System ======================= When a check return a failure, a previously defined notifier will be called. This will be triggered only after reed-alert find **3** failures (not more or less, but this can be changed globally by modifying *tries* variable) in a row for this check, this is a default value that can be changed per probe with the :try parameter as explained later in this document. This is to prevent reed-alert to spam notifications for a long time (number of failures very high, like a disk space usage that can't be fixed before a long time) OR preventing reed-alert to send notifications about a check on the edge of the limit like a ping almost working but failing from time to time or the load average around the limit. reed-alert will use the notifier system when it reach its try number and when the problem is fixed, so you know when it begins and when it ends. It is possible to be reminded about a failure every n tries by setting the keyword :reminder and using a number. This is useful if you want to be reminded from time to time if a problem is not fixed, using some alerts like mails can be easily overlooked or lost in a huge mail amount. The :reminder is a setting per check. For a global reminder setting, one can set *reminder* variable. reed-alert keep tracks of the count of failures with one file per probe failing in the "states" folder. To ensure unique filenames, the following format is used (+ means it's concatenated) : alert-name + probe-name + hash of probe parameters The notifier is a shell command with a name. The shell command can contains variables from reed-alert. + %function% : the name of the probe + %date% : the current date with format YYYY/MM/DD hh:mm:ss + %params% : the parameters of the probe + %hostname% : the hostname of the server + %result% : the error returned (the value exceeding the limit, file not found) + %desc : an arbitrary description naming a check, default to empty string + %level% : the type of notification used + %os% : the type of operating system (FreeBSD/Linux/OpenBSD) + %newline% : a newline character + %state% : "start" / "end" when problem happen / is solved Example Probe 1: 'Check For Load Average' --------------------------------------- If you want to send a mail with a message like: "On 2016/10/06 11:11:12 server.foo.com has encountered a problem during LOAD-AVERAGE-15 (:LIMIT 10) with a value of 30" write the following at the top of the file and use **pretty-mail** in your checks: (alert pretty-mail "echo 'On %date% %hostname% has encountered a problem during %function% %params% with a value of %result%' | mail yourmail@foo.bar") Example Probe 2: 'Don't do anything' ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you don't want anything to be done when an error occur, use the following : (alert nothing-to-send "") Example Probe 3: 'Send SMS' ~~~~~~~~~~~~~~~~~~~~~~~~~~~ You may want to use an external service to send a SMS, this is totally possible as we rely on a shell command : (alert sms "echo 'error on %hostname : %function% %result%' | curl -u login:pass http://api.sendsms.com/") The Probes ========== Probes are written in Common LISP. They are predefined checks. The :desc Parameter ------------------- The :desc parameter allows you to describe specifically what your check does. It can be put in every probe. :desc "STRING" The :try Parameter ------------------ The :try parameter allows you to change how many failure to wait before the alert is triggered. By default, it's triggered after 3 failures. Sometimes, when using ping for example, you want to be notified when it fails a few cycles and not at first failure. :try INTEGER Overview -------- As of this commit, reed-alert ships with the following probes: (1) number-of-processes (2) pid-running (3) disk-usage (4) check-file-exists (5) file-updated (6) load-average-1 (7) load-average-5 (8) load-average-15 (9) ping (10) command (11) service (12) file-less-than number-of-processes ------------------- Check if the actual number of processes of the system exceeds a specific limit. > Set the limit that will trigger an alert when exceeded. :limit INTEGER Example: `(=> alert number-of-processes :limit 200)` pid-running ----------- Check if the PID number found in a .pid file is alive. > Set the path of the pid file. If $USER doesn't have permission to open it, return "file not found". :path "STRING" Example: `(=> alert pid-running :path "/var/run/nginx.pid")` disk-usage ---------- Check if the disk-usage of a chosen partition does exceed a specific limit. > Set the mountpoint to check. :path "STRING" > Set the limit that will trigger an alert when exceeded. :limit INTEGER Example: `(=> alert disk-usage :path "/tmp" :limit 50)` check-file-exists ----------- Check if a file exists. > Set the path of the file to check. :path "STRING" Example: `(=> alert check-file-exists :path "/var/postgresql/standby")` file-updated ------------ Check if a file exists and has been updated since a defined time. > Set the path of the file to check. :path "STRING" > Set the limit in minutes since the last modification time before triggering an alert. :limit INTEGER Example: `(=> alert file-updated :path "/var/log/nginx/access.log" :limit 60)` load-average-1 -------------- Check if the load average during the last minute exceeds a specific limit. > Set the limit not to exceed. :limit INTEGER Example: `(=> alert load-average-1 :limit 2)` load-average-5 -------------- Check if the load average during the last five minutes exceeds a specific limit. > Set the limit not to exceed. :limit INTEGER Example: `(=> alert load-average-5 :limit 2)` load-average-15 --------------- Check if the load average during the last fifteen minutes exceeds a specific limit. > Set the limit not to exceed. :limit INTEGER Example: `(=> alert load-average-15 :limit 2)` ping ---- Check if a remote host answers the 2 ICMP ping. > Set the host to ping. Return an error if ping command returns non-zero. :host "STRING" (can be IP or hostname) Example: `(=> alert ping :host "8.8.8.8")` command ------- Execute an arbitrary command which triggers an alert if it returns a non-zero value. This may be the most useful probe because it let the user do any check needed. > Command to execute, accept commands with pipes. :command "STRING" Example: `(=> alert command :command "tail -n 10 /var/log/messages | grep -v CRITICAL")` service ------- Check if a service is started on the system. > Set the name of the service to test :name STRING Example: `(=> alert service :name "mysql-server")` file-less-than -------------- Check if a file has a size less than a specified limit. > Set the path of the file to check. :path "STRING" > Set the limit in bytes before triggering an alert. :limit INTEGER Example: `(=> alert file-less-than :path "/var/log/nginx.log" :limit 60)` curl-http-status ---------------- Do a HTTP request and return an error if the return code isn't 200. Requires curl. > Set the url to request. :url "STRING" > Set the time to wait before aborting. :timeout INTEGER ssl-expiration -------------------- Check if a remote SSL certificate expires in less than a specified time. Requires openssl. > Set the hostname for the request. :host "STRING" > Set the expiration time limit in seconds. :seconds INTEGER > Set the port for the request (OPTIONAL). :port INTEGER (default to 443) > Use starttls (OPTIONAL). :starttls STRING Example: `(=> alert ssl-expiration :host "domain.local" :seconds (* 7 24 60 60)) Example: `(=> alert ssl-expiration :host "domain.local" :seconds 86400 :port 6697) Example: `(=> alert ssl-expiration :host "smtp.domain.local" :seconds 86400 :starttls "smtp" :port 25) write-to-file -------------------- Write content to a file, create it if non existent. The purpose of this probe is to be used at the end of a reed-alert script to update the modification time of a file, and use file-updated on this file at the beginning of a script to monitor if reed-alert did finish correctly on last run. > Set the path of the file. :path "STRING" > Set the content of the file (OPTIONAL). :text "STRING" (default to current time in seconds) Example: `(=> alert write-to-file :path "/tmp/reed-alert.txt")` Example: `(=> alert write-to-file :path "/tmp/reed-alert.txt" :text "hello world")` The configuration file ====================== The configuration file is Common LISP code, so it's evaluated. It's possible to write some logic within it. Loops ----- It's possible to write loops if you don't want to repeat code (loop for host in '("bitreich.org" "dataswamp.org" "floodgap.com") do (=> mail ping :host host)) or another example (loop for service in '("smtpd" "nginx" "mysqld" "postgresql") do (=> mail service :name service)) and another example using rows from a file to check remote hosts (with-open-file (stream "hosts.txt") (loop for line = (read-line stream nil) while line do (=> mail ping :host line))) Conditional ----------- It is also possible to achieve conditionals. There are two very useful conditionals groups. Dependency ~~~~~~~~~~ Sometimes it may be a good idea to stop some probes if a probe fail. In a case where you need to check a path through a network, from the nearest machine to the remote target. If we can't reach our local router, probes requiring the router to work will trigger errors so we should skip them. (stop-if-error (=> mail ping :host "192.168.1.1" :desc "My local router") (=> mail ping :host "89.89.89.89" :desc "My ISP DNS server") (=> mail ping :host "kernel.org" :desc "Remote website")) Note : stop-if-error is an alias for the **and** function. Escalation ~~~~~~~~~~ It could be a good idea to use different alerts depending on how critical a check is, but sometimes, the critical level may depend of the value of the error and/or the delay between the detection and fixing it. You could want to receive a mail when things need to be fixed on spare time, but mail another people if things aren't fixed after some level. (escalation (=> mail-me disk-usage :path "/" :limit 70) (=> sms-me disk-usage :path "/" :limit 90) (=> buzzer disk-usage :path "/" :limit 98)) In this example, we check the disk usage, I will get a mail through "mail-me" alert if the disk usage go get more than 70%. Once it goes that far, it will check if the disk usage gets more than 90%, if so, I'll receive a sms through "sms-me" alert. And then, if it goes more than 98%, the "buzzer" alert will make some bad noises in the room to warn me about this. Note : escalation is an alias for the **or** function. Extend with your own probes =========================== It is likely that you want to write your own probes. While using the command probe can be convenient, you may want to have a probe with more parameters and better integration than the command probe. There are two methods for adding probes : - in the configuration file before using it - in a separated lisp file that you load from the configuration file If you want to reuse for multiples configuration files or servers, I would recommend a separate file, otherwise, adding it at the top of the configuration file can be convenient too. Using a shell command --------------------- A minimum of Common LISP comprehension is needed for this. But using the easiest way to go by writing a probe using a command shell, the declaration can be really simple. We are going to write a probe that will use curl to fetch an page and then grep on the output to look for a pattern. The return code of grep will be the return status of the probe, if grep finds the pattern, it's a success, if not it's a failure. In the following code, the "create-probe" part is a macro that will write most of the code for you. Then, we use "command-return-code" function which will execute the shell command passed as a string (or as a list) and return the correct values in case of success or failure. (create-probe check-http-pattern (command-return-code (format nil "curl ~a | grep -i ~a" (getf params :url) (getf params :pattern)))) If you don't know LISP, "format" function works like "printf", using "~a" instead of "%s". This is the only required thing to know if you want to reuse the previous code. Then we can call it like this : (=> notifier check-http-pattern :url "http://127.0.0.1" :pattern "Powered by cl-yag") Using plain LISP ---------------- We have seen previously how tocreate new probes from a shell command, but one may want to do it in LISP, allowing to use full features of the language and even some libraries to check values in a database for example. I recommend to read the "probes.lisp" file, it's the best way to learn how to write a new probe. But as an example, we will learn from the easiest probe included : check-file-exists (create-probe check-file-exists (let ((result (probe-file (getf params :path)))) (if result t (list nil "file not found")))) Like before, we use the "create-probe" macro and give a name to the probe. Then, we have to write some code, in the current case, check if the file exists. Finally, if it is a success, we have to return **t**, if it fails we return a list containing **nil** and a value or a string. The second element in the list will replaced %result% in the notification command, so you can use something explicit, a concatenation of a message with the return value etc..". Parameters should be get with getf from **params** variable, allowing to use a default value in case it's not defined in the configuration file.