Setting up Central Logging with syslog-ng OSE and simplifying Application and System Maintenance

Introduction

This post describes how I setup central logging for the distributed cloud application that I am developing, and why I did so. It details the software that I used and how I configured it, and it explains why I did, what I did, where I did it.

I wanted to add central logging to my application and associated systems because,

  • Accessing individual logs on distinct systems and in distinct files on those systems had too much overhead
  • I wanted my logs, as far as possible, to survive a catastrophic failure of the application and systems (of course there is always a possibility that the logging system itself fails)
  • I wanted to avoid touching my production systems too often
  • As I add more staff for development and operations, I want to avoid complicating my firewall rules by having to open up access to the individual systems from each admin’s system
  • I wanted to log “everything” to see if I could debug problems by simply reviewing the logs rather than having to repeat problematic transactions and trace through my code. Repeating problematic transactions and tracing through code, to debug problems, is close to impossible in my production systems anyway, as this is the back-end for a consumer mobile app with a potentially huge user base. Further, I cannot rely on my users to report problems, or when they do, to report the correct set of steps to reproduce the problem.
    • So far the log everything strategy has worked very well. I’ve solved numerous problems by simply reviewing the logs and I have not had to trace through my code even once.
    • With the availability of cheap high performance systems today and with the distributed nature of the workloads, I have not encountered significant slowdowns in end user response times so far.
  • I wanted my central logging system to automatically send me an email when an error occurred rather than for me to discover the problem some other way and then go hunting for the logs
  • Down the road I want to process the logs more intelligently, and codify my common triage workflows, so that the system can automatically determine root causes and then alert me to the issue rather than its symptoms
    • E.g., In a simple example, if there is a problem with my database server, then my MySql logs will have errors in them and my Java application may encounter many SQL Exceptions that are logged as well. In this case, instead of being inundated by alerts for every Java SQL exception, I’d rather just get an alert with the error log entry from MySql detailing the Java SQL exceptions as likely symptoms.
  • Logs are also a cheap way for me to track other metrics such as which features are used the most. I can get this information by processing the logs instead of writing more code to specifically track feature usage and increment counters and metrics etc. Processing the logs also gives me more freedom to retroactively track metrics that I may need later, instead of having to update my application and getting those metrics only from that point on. Data like which features are used the most is great for focusing Marketing, Product Management, and Engineering resources.
  • A central logging system also allows me to correlate logs from multiple systems by say timestamp and username (say I am able to codify usernames into the logging from complex transactions across multiple systems). This way I can see a full end-to-end trace of what happened on behalf of a user and tell where in the chain of function calls the problem occurred and how transaction arguments morphed during that movement of data across systems.
  • I have a thought that comparing logs output from my test systems and production systems can also quickly tell me if certain code paths are being tested enough or not. Of course there is COTS software to report code coverage but I could potentially avoid running another software application if I can just use my logs.

Synchronizing time for Log Analysis

  • In order to debug problems in multi-access / multi-threaded systems like web and mobile applications, its very important that your systems have a consistent view of time. When you are debugging problems across multiple systems, e.g., even simply between a web and database server, you need to be able to string log entries together to identify the code execution path from transaction start to failure.
  • Synchronizing time is very easy with NTP so there is no reason to avoid this step

Setup Steps

Machines and OSes used

My systems are running in AWS EC2 and are currently t1 or t2 micro instances running Ubuntu 14.04 LTS

Synchronizing time across my systems using NTP

  • I’m currently of the opinion that for my purposes of synchronizing time I can just set the timezone of all my instances to be US Pacific Time irrespective of where the instances really are. This way I don’t have to do any timezone processing of my log entry timestamps when I am investigating an issue and all externally reported times just have to be converted to a single timezone for debugging purposes.
  • Setting the common timezone of all my instances
$ sudo dpkg-reconfigure tzdata
  • Setting up NTP
    • Installation
$ sudo apt-get install ntp
  • Configuration (optional)
    • The default NTP configuration file has NTP pool servers already configured and its not necessary to change them
    • Just in case, you can edit the file /etc/ntp.conf and add the lines
server 0.us.pool.ntp.org
server 1.us.pool.ntp.org
server 2.us.pool.ntp.org
server 3.us.pool.ntp.org
  • Reference
    • NTP configuration and the optimal pool servers to use are documented here

Creating a common logs format across my application and systems

I wanted to have a common logs format across all my application code and systems to make debugging easier. This way,

  • I can count on all log entries having all the data fields / information that I deem necessary for debugging and tracking
  • The code that I write to process my logs is simpler
  • Understanding / consuming the log entries is easier as they all follow a standard format

Using Log4J2 for application logging

  • Early on I standardized on using Java for all my application code because I’m good at Java and it has such widespread support that I can code every piece of my application back-end in it, be it web services and REST APIs, or components that are straight up Java server applications. There is also this amazing application logging package named Log4J2! I’ve been using Log4J2 for many years and absolutely love it.
  • Adding Log4J2 to your Java application or web project and configuring it is very easy and described in detail here
  • I’ll just say that I used a RollingFileAppender which writes logs to a well named file in a standard sub-directory under /var/log/. Having just one type of OS across my systems helps standardization of course.
  • Also I don’t need to keep old log files around on the systems since all log entries are all being sent to my central logging system. This way I can just delete the files once they get too big and start new ones.

Updating the default java.util.logging format used by Tomcat

  • I primarily have five kinds of components – Servlets that run in Tomcat instances, MySql, Redis, syslog-ng, and straight up Java server applications that are run periodically by cron jobs.
  • I’m trying to minimize the number of diverse types of components so that standardization and maintenance is easier. The net effort required to develop, deploy, and maintain, a component has to be minimal. So something that is harder to develop in-house but easier to maintain and standardize can win out over COTS software.
  • So far I’ve standardized the logging format of my Java Servlets, Java server applications, and Tomcat itself.
    • My Java Servlets and and Java server applications use Log4J2 and thats covered above
    • Tomcat itself uses the java.util.logging package. Updating the format of the Tomcat logs was easy.
    • Tomcat has three types of log files by default.
      • catalina.*, localhost.*, and localhost_access_log.*
    • The logs that I am interested in are in the localhost.* log files
    • These capture uncaught exceptions generated within my deployed applications
    • I had to do two things
      • Update the format
      • Create and update a well named hard link to the localhost.* log file
        • The format of the localhost.* log filenames is localhost.<date>.log and Tomcat creates a new file daily
        • I didn’t want to mess with this but syslog-ng OSE requires a constant filename to track and gather logs from
        • Hence, I created a well named hard link to the latest version of this file and update this hard link to the latest version every hour via a cron job
          • I had to create a hard link because the modified date on a soft link would not change when the underlying file is changed, hence syslog-ng would not know that a change has occurred / new logs have been written to the log file
        • The cron job that runs every hour executes the command
          • sudo ln -f $(ls -t localhost.* | head -1) tomcat_localhost.log

Deciding the log entries of interest that I want to be proactively notified about

  • Early on in my application design and development, I decided that I would try to log everything, as explained in the introduction to this post. Hence, I decided that I would have only three severities of logs – DEBUG, WARN, and ERROR. Actually I started out with two – DEBUG and ERROR, with ERRORs being the ones that I want to be notified about obviously. However, soon I realized that there are certain business related events that I also need notifications about, e.g., a user creating a new account or canceling their account. These are not errors but my marketing and sales departments would be interested in this information hence I added the WARN severity to cover such things.
  • My log entries are very descriptive in that they include every dynamic part of a particular action. Typically these are the arguments for the action like a username, datetime, entity ID, etc. Of course you have to be careful not to accidentally log things like passwords and authentication tokens in plain text, or at all actually.
  • This being my standard format for a log entry, then the only thing I need to change is the severity in order to decide whether its something to be notified about or not. And for Notifications I have – DEBUG (don’t notify), WARN (notify the business contact), ERROR (notify the engineering contact).
  • These rules have worked for me so far but maybe in the future they will need more complexity. The main thing is to be rigorous so that I don’t compromise consistency and ease of maintenance.

Setting up a storage system for my logs

  • Given that I want to log everything, I was obviously going to have a tremendous number of log entries. Hence, I needed a robust storage system.
  • I decided to go with MySql because its robust, scalable, and easy to manage for me. I looked at using NoSql systems but I just found that their management and maintenance requires more time than I had. Maybe down the road I will use them.
  • Another thing is that I don’t expect to store all my logs in MySql forever. I figure that I only need a few days worth of logs in MySql, in case I need to debug a problem. The log processing that I have done so far is working off recent logs and these are easy to store and get from MySql. To save older logs, I’m planning to implement a system whereby my logs are saved both in MySql and in a longer term data storage system like cloud storage sold by AWS, Microsoft Azure, or GCE, simultaneously. This way, I will not have to move logs from MySql to secondary storage. I’ll just delete older logs from MySql while still having copies in cloud storage.
  • Setting up MySql on Ubuntu is very simple and I’m not going to cover it here
  • Also, syslog-ng OSE, just requires my database name, username, and password, and then it creates / alters / drops all the tables based on my syslog-ng configuration. This cuts my overhead tremendously.

Setting up a system to alert (email) me when log entries of interest are generated

  • The first and simplest form of log processing that I wanted to do, as described above, was to get emails when an ERROR or WARN severity log entry is created.
  • Since my logs are in MySql, it was very easy to write a small Java program that retrieves new ERROR or WARN log entries (since the last run of the program) and then emails me these log entries.
  • I run this Java program via a cron job every minute, and of course this Java program is also using Log4J2 so its own logs end up in syslog-ng OSE too. Now if there is a problem with this program, I won’t get any emails but at least the logs will be in MySql for me to review.

Setting up the core logging system – syslog-ng OSE

  • I looked at many different options for a central logging system (both paid and free) – there are a ridiculously large number of products for logging out there. I was of course partial to free products.
  • I ended up testing two products
    • The so-called ELK stack consisting of ElasticSearch, LogStash, and Kibana, and
    • On the recommendation of my friend and fellow CTO, Sebastien Soudan, the syslog-ng Open Source Edition (OSE)
  • The ELK stack proved to be too time-consuming for me at the time. Maybe now the documentation and error checking has improved. It has a lot of great features and both ElasticSearch and LogStash are very fast and scalable. Ultimately though, the configuration and debugging proved to be quite challenging. Having three distinct moving parts didn’t help either. Also I ran into a bug that was fixed in the next GA release and there was no timeline for this next release.
  • So I then tested syslog-ng OSE. My experience was fantastic to say the least!
  • syslog-ng is clearly a tried and tested, robust, and scalable, logging system with many features to process logs from varied sources, write them to varied destinations (like my MySql database), route, and process log entries in complex ways, and a simple but such an important feature for me – handling multiline log entries simply.
  • There are two components to setup for syslog-ng OSEserver and client
  • The server is aggregating logs over the network, processing these logs, and sending them on for storage to my MySql database.
  • I also run the alerting program (described above) to email me when ERROR and WARN log entries are generated on this system
  • The clients run on every other system and are collecting log entries from one or more log files on the client systems, and sending these log entries over the network to the syslog-ng server using the standard and robust syslog protocol
  • Since I am using Log4J2, I could have used a syslog appender to have Log4J2 send the logs directly to my syslog-ng server. However, the log4j2 logging function call is a blocking call. This means that if the network or my server are down or slow, then my application will also block. Hence, I prefer for my application to log to a file and for syslog-ng to pick it up from the file.
  • The neat thing (and a thing that cuts my maintenance headache) is that the client and server are both the same program just configured as either client or server
  • I found that the overhead of syslog-ng is negligible in terms of CPU, memory, and disk space. The latency of logs being written to the original log files, and from there being transported over the network to the syslog-ng server and from there over the network once again to the database is negligible – less than a second end-to-end.
  • Another really neat feature is that I can configure the client and server to communicate securely. syslog-ng OSE supports TLS, hence my clients and server authenticate each other, and all my log entries are encrypted as they are transported over the network.
  • So here is my setup

Documentation

The Ubuntu packages for syslog-ng are available at

Installing syslog-ng on a system

  • Add the GPG-key to APT
curl http://packages.madhouse-project.org/debian/archive-key.txt | sudo apt-key add -
  • Add the repositories to /etc/apt/sources.list
deb http://packages.madhouse-project.org/ubuntu trusty syslog-ng-3.5 syslog-ng-incubator-3.5
deb-src http://packages.madhouse-project.org/ubuntu trusty syslog-ng-3.5 syslog-ng-incubator-3.5
  • Update the APT cache
$ sudo apt-get update
  • Install the syslog-ng package
$ sudo apt-get install syslog-ng-core
$ sudo apt-get install syslog-ng

Install libdbd-mysql on the syslog-ng server machine

  • syslog-ng requires this driver to communicate with the MySql database
$ sudo apt-get install libdbd-mysql

MySql Configuration

  • Login to MySql
mysql -u <username> -p
  • Enter your database password at the prompt
  • Create database named logs
CREATE DATABASE logs;
  • Assign users to the logs database
GRANT ALL ON logs.* TO <username>@'<IP Address of syslog-ng Server>' IDENTIFIED BY '<database password>';
  • Open MySql port 3306 from <IP Address of syslog-ng Server> in your firewall guarding the MySql server

syslog-ng Server configuration

  • The syslog-ng configuration is saved in file /etc/syslog-ng/syslog-ng.conf
  • Keep the default configuration and add the following
  • Source
source s_network_ietf_syslog {
  syslog(ip(0.0.0.0) port(5000) transport("tcp"));
};
  • This sets up syslog-ng to listen for syslog messages on TCP port 5000. Your syslog-ng clients will communicate with the server on this port.
  • Parser
parser log4j_logs_parser{
  csv-parser(
    columns(LOG4J.DAY_OF_MONTH, LOG4J.MONTH_NAME, LOG4J.YEAR, LOG4J.TIME, LOG4J.SEVERITY, LOG4J.CLASS_NAME, LOG4J.LOCATION_IN_SOURCE_CODE, LOG4J.THREAD, LOG4J.LOG_MESSAGE)
    delimiters(" ")
    flags(escape-backslash, greedy)
    quote-pairs('""')
  );
};
  • We use syslog-ng's CSV parser to parse the log entries into fields for storage in the database
  • This makes it easier to process the log entries down the line
  • Earlier in the Log4J2 configuration I used spaces judiciously between fields to make sure this parser works correctly.
    • The line columns(…) represents the fields in my log messages. The names like LOG4J.DAY_OF_MONTH and LOG4J.CLASS_NAME are names made up by me so that I can refer to each of these parsed fields later in my configuration. The syslog-ng CSV parser will parse each log entry based on the delimiter specified and then match the parts of the parsed log entry to the column names that I have specified. If I have less column names than the parser actually finds then I will lose the excess data from the log entry.
  • I also surrounded certain fields like the message within double-quotes ("") to ensure that the CSV parser treats data within double-quotes as a single field even if it has spaces (my delimiter). This is configured by the quote-pairs('""') setting.
  • The escape-backslash flag is used to escape quotes within log entry fields
  • And finally, the greedy flag is a catch-all. If by some misconfiguration or chance a log entry has more fields than my columns, then everything after the last field of the parsed log entry is added to the last column of the parser. This way I won’t lose any log entry data.
  • Destination
destination d_logs_mysql_server {
  sql(
    type(mysql)
    host("<IP address or DNS name of the MySql server>")
    username("<database username>")
    password("<database password>")
    database("logs")
    table("logs")
    columns("log_date_time datetime", "sender", "senders_send_date_time datetime", "servers_receive_date_time datetime", "severity", "class_name", "location_in_source_code", "thread", "message")
    values("${LOG4J.ISO8601_DATETIME}", "${HOST}", "${S_UNIXTIME}", "${R_UNIXTIME}", "${LOG4J.SEVERITY}", "${LOG4J.CLASS_NAME}", "${LOG4J.LOCATION_IN_SOURCE_CODE}", "${LOG4J.THREAD}", "${LOG4J.LOG_MESSAGE}")
  );
};
  • The destination for the log entries is my MySql database
  • You’ll see that the database refers to the database I created above under the MySql configuration
  • The table instructs syslog-ng to create a table, also named logs, if it does not exist in the database logs
  • The columns instructs syslog-ng to create columns having the specified names. syslog-ng does a good job of inferring data types for the columns.
  • The values refers to fields from my parser above or from default fields that every syslog message has. Remember that in my configuration syslog-ng is transmitting log entries from the clients to the server as syslog messages. Thus ${HOST} refers to the sender’s hostname contained in every syslog message. Similarly, ${S_UNIXTIME} refers to the datetime that the sender sent the syslog message.
  • The neat thing is that if I change this configuration, then syslog-ng will automatically alter the database table as necessary.
  • Log Statement
log {
  source(s_network_ietf_syslog);
  parser(log4j_logs_parser);
  destination(d_logs_mysql_server);
};
  • The log statement finally ties everything together
    • It instructs syslog-ng to receive log messages from the clients, parse the log messages, and then send them on to the MySql destination

syslog-ng Client configuration

  • Once again, the syslog-ng configuration is saved in file /etc/syslog-ng/syslog-ng.conf
  • Keep the default configuration and add the following
  • Source
source s_tmhapp_log_file{
  file (
    "/var/log/tomcat7/<my_log_file_name>"
    flags(no-parse)
    multi-line-mode(regexp)
    multi-line-prefix("[0-9]{2} [a-zA-Z]{3} [0-9]{4}")
  );
};
  • All my clients are writing logs to files and these files are the source of log entries on my clients
  • The no-parse flag instructs syslog-ng that the log entries inside the files are not in standard syslog format and thus should not be parsed as such. This is because the log entries are following my custom format that I previously set in Log4J2.
  • syslog-ng will thus treat the entire log entry as the payload (or message) of a syslog packet. It sends one syslog packet from the client to the server for each log entry.
  • The multi-line-prefix instructs syslog-ng to handle multi-line log entries. syslog-ng applies the specified regular expression to incoming text data from the log files and identifies the start (this prefix) of a new log entry and then treats all text data (including newlines) until the next instance of the prefix is found, as part of the single log entry. Without this, syslog-ng treats a newline character as the end of a log entry.
  • The multi-line-mode(regexp) instructs syslog-ng that the specified prefix is a regular expression and should be processed as such.
  • Destination
destination d_network_ietf_syslog{
  syslog(
    "<IP address or DNS name of the server>"
    transport("tcp")
    port(5000)
    localport(5000)
  );
};
  • The destination specifies where the server is and on which port
  • I also chose to fix the outgoing port for communications to lock down my firewall rules further
  • Log Statement
log {
  source(s_tmhapp_log_file);
  destination(d_network_ietf_syslog);
};
  • Finally, as in the server, the log statement ties everything together
  • Note, that the log entries are parsed on the server

Debugging syslog-ng

  • Start syslog-ng with the following command
$ sudo strace -o /tmp/trace.txt -s256 -ff -ttT syslog-ng -f /etc/syslog-ng/syslog-ng.conf -Fdv
  • syslog-ng sends all its debug output to the console and I have found this more than enough to troubleshoot my setup

Starting and Stopping syslog-ng for Production

  • The standard commands apply
$ sudo service syslog-ng <start> | <stop> | <restart>

Optional – Mutual TLS Encryption between syslog-ng clients and the server

  • We will use the same private key and ssl certificate for both client's and the server
    • This is because for syslog-ng, one must store both the client and server certificate on the machine without passwords
    • Therefore, if a malicious user breaks into these machines, they’ll get the certificates anyways
    • So the main protections are to restrict access to the machines and to have the data transmission encrypted
  • Create the private key without a passphrase
$ openssl genrsa -out <private key name>.pem 2048
  • Create the ssl certificate
$ openssl req -new -x509 -key <private key name>.pem -out <ssl certificate name>.pem -days 7300
  • Fill out the required certificate data as prompted
  • Copy the ssl certificate to the clients in the client's directory /etc/syslog-ng/ca.d/
  • Copy the ssl certificate to the server's directory /etc/syslog-ng/ca.d/
  • In the /etc/syslog-ng/ca.d/ Directory of each client and the server
    • Issue the following command on the ssl certificate
$ openssl x509 -noout -hash -in <ssl certificate name>.pem
  • The result is a hash (e.g., 6d2962a8), a series of alphanumeric characters based on the Distinguished Name of the ssl certificate.
  • Issue the following command to create a symbolic link to the ssl certificate that uses the hash returned by the previous command and the .0 suffix.
$ sudo ln -s <ssl certificate name>.pem 6d2962a8.0
  • Copy the ssl certificate to the clients in the client's directory /etc/syslog-ng/cert.d/
  • Copy the certificate to the server's directory /etc/syslog-ng/cert.d/
  • Copy the private key to the clients in the client's directory /etc/syslog-ng/key.d/
  • Copy the private key to the server's directory /etc/syslog-ng/key.d/
  • On the syslog-ng clients, configure TLS in the syslog destination to the Server
    • Change transport("tcp") to transport("tls") in the syslog destination
    • And, add
tls(
  ca-dir("/etc/syslog-ng/ca.d")
  key-file("/etc/syslog-ng/key.d/<private key name>.pem")
  cert-file("/etc/syslog-ng/cert.d/<ssl certificate name>.pem")
)
  • to the syslog destination
  • On the syslog-ng server, configure TLS in the syslog source of the Server
    • Change transport("tcp") to transport("tls") in the syslog source
    • Add
tls(
  ca-dir("/etc/syslog-ng/ca.d")
  key-file("/etc/syslog-ng/key.d/<private key name>.pem")
  cert-file("/etc/syslog-ng/cert.d/<ssl certificate name>.pem")
)
  • to the syslog source

References