This is an explanation of the strategies we use for logging at Single Music that we have developed over the past couple years. We use careful configuration of logging levels, aggressive alerting, and a short retention period to be smarter about what we're logging and so we can identify and diagnose issues quickly.
Most structured logging frameworks allow users to log messages at different "levels." The level is printed at the head of the message and serves to divide the messages into high-level categories. Since our backend is comprised primarily of Java-based
spring-boot apps, we use the SLF4J API with the Logback implementation. The SL4J API provides several log levels, but at Single we only use four: Debug, Info, Warn, and Error—in order of increasing severity, respectively. Here's an explanation of how we divide our log messgaes into the four categories.
This is our most verbose log level. It is togglable at runtime and disabled by default. The purpose of our
DEBUG-level statements is to be able to see the main code execution path through certain important processes. We don't have log statements in every branch in the code, but major "decisions" made by the business logic are logged. As the
DEBUG name implies, this serves somewhat as alternative to actual debugging, and allows us to essentially trace what's going on in the code if things are going wrong.
This is the main "something happened" log level. It's on by default and is essentially just noise. To be honest, sometimes I like to watch the "Live Tail" of all our logs (we aggregate our logs using Logz.io) and the
INFO level logs serve to assure us that things are up and running normally. A lot of what we log here is little notes when asynchronous messages are sent/received by the services.
WARN level, we make note of exceptional cases that aren’t serious enough to require immediate action. This includes odd states that we don’t expect to be in, users trying to do invalid things, and technical debt issues that need to be addressed at some point but aren't pressing. We track the number of
WARN-level messages in a dashboard and investigate if the number is getting high.
ERROR level, our highest severity, we log exceptional cases that require immediate or close to immediate investigation. These are scenarios that result in user-facing errors, failed batch or background transactions that need to be retried/replayed/fixed, and scenarios where data is corrupted or lost. Although we try to log as much context as possible in all our log statements, these
ERROR logs in particular contain the most context about the state of the system at the time of the error and of the error itself. If we're logging a thrown Exception we typically log the entire stack trace as well. Since
ERROR logs are the most important and often require some kind of administator action, we use "error codes" to further categorize them into more specific areas. For exampe, if something fails during our SoundScan reporting process in the morning, the log message will contain the code
[SOUNDSCAN REPORTING FAILURE] in it. This allows us to create more specific alerts for particularly business-crticial scenarios.
Consistency is Key
Our division of logs into four categorizes in just one example. It's less important exactly how your logs are divided into categorizes, but most important that once you decide on a categorization that you are consistent in all logs across all applications in your system.
At Single, we have many alerts configured based upon our log messages. The most significant of which is that we alert on each and every
ERROR-level log statement that is generated by the applications. We do this for a couple of reasons. First, it provides motivation for bugs to get fixed in a timely manner. It makes bugs as annoying as possible! Nobody likes to have their phone blown up with Slack notifications. Second, it forces the issue of proper error handling and trapping of exceptional cases. It forces us to think hard about the relative severity of runtime failures.
As mentioned above, we have special alerts configured in addition to the
ERROR-level ones that are specific to certain critical business processes. The alerts are configured to look for certain "error code" strings in the log messages. We've elected to use strings for the codes instead of numeric codes, because it conveys the meaning directly in the message and removes the need for us to memorize codes or reference a table to see what they mean.
Short Retention Period
We are currently operating with a three-day retention period on all aggregated logs. We never rely on logs to answer the question of "what happened?" during any business-critical transactions. The state of decisions made in critical pathways is captured in a combination of emitted events and data persisted in durable stores (mysql in our case).
This might sound similar to the kind of strategy we're using for
DEBUG-level logs, and it is! There's kind of a fuzzy line between what should be logs and what should be data. I think it comes down to whether or not the information needs to be retrieved later, needs to be queried often, or needs to be counted or combined with other information to make business decisions. Similarly to the strategy of alerting on every application error log, Having a very short retention period also forces us to make decisions early when building functionality about what information in the logs should actually be in the database—because after three days, if it's not stored somewhere durable it's gone!
As an example, say we receive an order event webhook from Shopify and the code decides to skip processing because the order doesn't contain a line item that our system is tracking. What's the best way to save a historical record of that decision? We can write a log message like: "skipping order #1234 in X shop because no line items contain tagged products". Or, we can write a record to a "processed order" database table. Suppose that down the road we get a support request asking about that particular order number asking: "What happened? Why didn't the user get an email?" If we choose the logging option we can only answer that question definitively if the log statements are still retained. If we choose the database option we can always answer that question. As an additional option, we could publish an event to a topic like
order.skipped to signify that the order has been skipped. This would allow us to bolt on functionality to the system based on that decision.
Logs are Important
All of this is is to say mostly that logs are an important part of a running software system and a key component of the observability of that system. I don't know that our strategy is the best, but I think it's important to have a strategy and think critically about what you're logging. Well-structured and organized logs enable a solid alerting strategy that'll make sure you're not having any silent failures. Keeping a short retention period ensures that you aren't relying on logs to answer questions that should be answered by persistent data.