monitoring Archives - Damia Blog

Outgoing SMTP server monitoring

March 1, 2015 by damia

Mail service monitoring is not as easy as web monitoring, even outgoing smtp is more difficult to monitorize. We can summarize some service problems and how to monitorize them:

Mail is not being delivered: This could happen due to connectivity problems, unavailability of remote servers, software problem of the mail server, etc. To monitorize it simply check how long is the pending queue of mails.
On a postfix we can use this script that warns us when a threshold is reached:



mailq|grep "@" > /tmp/queue.txt

if test `cat /tmp/queue.txt|wc -l` -gt 100 ; then /usr/local/bin/mobilealert TOO_MANY_MAILS_ON_QUEUE ; cat /tmp/queue.txt |mail -s TOO_MANY_MAILS_ON_QUEUE_`cat /tmp/queue.txt|wc -l` admin@company.com ; fi;

Mail is being bounced: A big problem that you could have is that your mail is being bounced by remote servers, we have to detect when our mails are bounced cause are ourself, and the cause is not on remote side.
Example of remote side bounce causes:
-Account doesn’t exist.
-Quota exceed
-Domain not accepted on remote server (missconfigured)
-Other
Example of ourself bounce causes:
-Our server IP is in a black list.
-Content is being rejected by content SPAM or malware.
That is difficult to monitor, due that each server explain the error as a description. Our approach on that is to monitor how many entries we have, eliminating remote side causes (as many as possible).
This could be done for postfix log with this script:



 cat /var/log/mail.log|grep -i " 550 " |\

grep -v "mailbox unavailable"|grep -i -v "Invalid recipient"|\

grep -i -v "does not exist"|grep -i -v "invalid address"|grep -v -i quota |grep -v -i Unknown|\

grep -v -i "Address rejected"| \

grep -v -i "invalid user"| \

grep -v -i "Mailbox unavailable"| \

grep -v -i "Mailbox disabled"| \

grep -v -i "relay not permitted"| \

grep -v -i "Account disabled"| \

grep -v -i "Invalid local address"| \

grep -v "no mailbox"|grep -v "recipient rejected"> /tmp/bounced.txt

if test `cat /tmp/bounced.txt|wc -l`  -ne 0 ; then

        mail -s BOUNCED_MAILS_`cat /tmp/bounced.txt|wc -l` admin@company.com < /tmp/bounced.txt  ;
fi ;
if test `cat /tmp/bounced.txt|wc -l`  -gt 20 ; then
        /usr/local/bin/msg2mobile TOO_MANY_BOUNCES ;
fi;

As we explain this approach is not perfect due to the impossibility to identify local problems vs remote problems, It is very important to adjust the threshold according the server load.

How to monitor changes on the Amazon AWS firewall rules

February 1, 2015 by damia

If you have your cloud infraestructure on Amazon AWS for audit and control purposes you may want monitor when the firewall rules of any of your security groups have changed.

With this script you will get notified when any of the security groups are modified.

# AWS credentials export EC2_KEYPAIR=GIJDUYE75JRHFJEJEBHFHEJE88E8ZGGG # name only, not the file name export EC2_URL=https://ec2.eu-west-1.amazonaws.com export EC2_PRIVATE_KEY=$HOME/.certs/pk-GIJDUYE75JRHFJEJEBHFHEJE88E8ZGGG.pem export EC2_CERT=$HOME/.certs/cert-GIJDUYE75JRHFJEJEBHFHEJE88E8ZGGG.pem export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64 # aws commands ec2-describe-group > /home/ubuntu/.audi/group.txt diff /home/ubuntu/.audi/group.txt /home/ubuntu/.audi/group-old.txt> /home/ubuntu/.audi/diff.txt if test `cat /home/ubuntu/.audi/diff.txt|wc -l` != 0 ; then cat /home/ubuntu/.audi/diff.txt|mail -s FW_AMAZON alert@bpmalert.com ; /usr/local/bin/msg2mobile FW_AMAZON `cat diff.txt |grep PERMISSION|awk '{print($4)}'|head -1`; fi mv /home/ubuntu/.audi/group.txt /home/ubuntu/.audi/group-old.txt

All the magic is done with the ec2-describe-group command, storing the current state and watching for differences.

How to see how many entries per minute we have on a log

January 1, 2015 by damia

On this post We are going to explain a tip used to count log entries in order to compare load of a specific entries. Let’s explain it:

We have a web server log, for example:
179.158.24.43 - - [02/Nov/2013:17:46:51 +0100] "GET /blog/wp-includes/js/jquery/jquery.js?ver=1.10.2 HTTP/1.1" 304 214 "http://blog.dom.net/blog/" "Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20100101 Firefox/10.0.12 Iceweasel/10.0.12" 179.158.24.43 - - [02/Nov/2013:17:46:51 +0100] "GET /blog/wp-content/themes/the-bootstrap/js/bootstrap.min.js?ver=2.0.3 HTTP/1.1" 304 212 "http://blog.damia.net/damianetblog/" "Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20100101 Firefox/10.0.12 Iceweasel/10.0.12" 179.158.24.43 - - [02/Nov/2013:17:46:51 +0100] "GET /blog/wp-content/themes/the-bootstrap/js/the-bootstrap.min.js?ver=2.0.1 HTTP/1.1" 304 212 "http://blog.dom.net/damianetblog/" "Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20100101 Firefox/10.0.12 Iceweasel/10.0.12" 179.158.24.43 - - [02/Nov/2013:17:46:58 +0100] "GET /blog/index.php/2011/1/manire/ HTTP/1.1" 200 4916 "http://blog.dom.net/blog/" "Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20100101 Firefox/10.0.12 Iceweasel/10.0.12"

Let’s image we want to count how many entries have happened in order to compare the load of the server, we need then to compare on the same interval of time, the solution found is to watch one minute ago.

Let’s made with this script:
tail -1000 /var/log/apache2/access.log |grep -i ` date --date='1 minute ago' +%d/%b/%Y:%H:%M|cut -c 1-17`|wc -l

This script will tell how many entries had the log one minute ago. We can grep it for look for a specific request, and see how many per minute We have.

External web monitoring pitfalls

December 1, 2014 by damia

Today society requires services available 24×7, IT management requires knowledge of the state of the services offered. This necessity have created a lot of tools, products and services. The main approach for this is the external web monitoring.

External web monitoring approach a tries to simulate the actions done by the clients, that is if the clients visits webpages, the monitor software will simulate as a client a request to the webserver, there are some pitfalls on that approach, let’s review them:

Where is located the monitor? Is it representative of the location of clients? Could it detect WAN Network problems? Normally this is solved with monitors distributed across continents being representative of Internet world availability

-Which is the url being monitored? It is very difficult to find a always representative URL, if you simply test a static web page the monitor will miss when the database is not working, the disk is full, or the application server is failing, even is difficult for the monitor to know if the answer of the page is right or not. If the monitor detects the page is failing that is true, but there could be a lot of times that the web is failing and the monitors didn’t notice it.

The first approach:

#wget http://www.bpmalert.com/index.html

A better approach

#wget http://www.bpmalert.com/register.php?user=test&data=test
but the response from the webserver could be “200 OK” but the webpage could say “Database error”.

Even accessing a simple url doesn’t emulate today web complexities, pages are not a simple url, it have a lot of components, and there are things like cookies, session id, ajax, that are not emulated.

-Simulating user flow: As we explain if we limit the monitoring to only a url, you dind’t see the full flow of a web process, for example a real customer will register, receive a mail, click a link and then make a purchase. There are very complex tools that uses real browsers, with a recorded user session and so on, but is impossible to simulate a real user, only a real person could confirm that the web is working ok.

There are some tools that emulates user behaviour but are more used to load tests than for availability monitoring, some of these tools are:

As a conclusion the external monitoring of webservers are a necessary but they are imperfect, and must be completed with other business monitors in order to know that everything is working ok.

Observing apache logs for Troubleshooting

November 1, 2014 by damia

Normaly when you are troubleshooting an Apache server you are checking parameters that don’t have monitors for them.

With the “observate” aproach the best place too look at is:

# tail -f /var/log/apache/access.log

Things to check:

200 OK requests are being served: This seems an obvious thing, you can see if there are normal traffic, or you can detect which pages are being accessed more, etc.

50x server responses are not happening: May be there are some pages that are returning server error, may be you didn’t know it, but someone somewhere is getting errors from your page. If there are only a few 50x errors, it is not a problem to ignore, may be this error is happening when a critical process is being done(registering, updating), I suggest zero tolerance for 50x errors.

404 server response are not being happening: If so it means that people are trying to access a non existant page, Sys admin will say this is not its responsability, sure It is a problem of the content but It is a problem for your customers, face it!

Let’s do in a basic way with:
# tail -f /var/log/apache/access.log|egrep '404|500|503|504'
But there is an advaced tool that read the logs and give you insights regarding the response codes, this tool is
Apachetop, it allows you to overlook, after that for more details you have the real logs, tail and grep as your friends.

On next post we will descrive other tools and techniques to check and test and monitor apache web servers, but we wanted to start with this “observate aproach”.