Peter Zaitsev posted Percona’s answer to the question, “why use the slow query log instead of the PERFORMANCE_SCHEMA tables to gather information about query execution and timing?” At SolarWinds, we don’t currently use the PERFORMANCE_SCHEMA for this either. Instead, we capture and time queries by sniffing the TCP traffic to the server, and decoding the MySQL protocol. We extract rich information from the protocol, including the query text, timings, errors, warnings, query flags, and more.
In this post I’ll explain why we do this and how it works relative to other options.
Query Capture Methods
What options exist for measuring queries the server executes? Here’s a short list.- query logs
- PERFORMANCE_SCHEMA
- packet inspection
- proxies
- tools that poll commands such as SHOW PROCESSLIST
- a plugin for the audit log
- application-side instrumentation in drivers
Why We Use Packet Inspection
Why do we capture network traffic and decode the conversations between clients and servers? To begin with, I will echo Peter’s points about compatibility. The reality is that although PERFORMANCE_SCHEMA is nice, it’s not available in most server installations, and our product needs to work on older versions of the server. Until PS is widely deployed, it’s not good enough for us or our customers. Compatibility is part of the reason we don’t use logs. As you can read in Peter’s post, Percona’s tools rely on a patched/forked/enhanced version of the server, depending on what terminology you’d like to use. The patches add valuable information to the query log, but there are also several crucial enhancements to the logging functionality that help mitigate the performance overhead of logging. Without these, using the slow query log becomes impractical and potentially even dangerous. Although Percona Server and MariaDB are great products, the vast majority of users will prefer Oracle’s official version of the MySQL server. That’s why, for maximum compatibility with many server versions, we cannot use the slow query log. On the other hand, if we were to use the query logs on Percona Server or MariaDB, we would face a different problem. The features that make query logging safer and more feasible do so in part by not logging a lot of queries. We want to observe as much server activity as possible, not just 1% or so of the queries. As an aside (a topic for a different blog post), the slow query log and PERFORMANCE_SCHEMA provide information we’re not interested in, and don’t provide some of the data we want. We get the data we want in less intrusive, more accurate ways. I will speak in detail about this at Percona Live in April (and share source code so you can reproduce our work). The detail we can extract about queries is amazing, and the overhead is very low. In other words, although it might seem like you can’t get all the useful information from the protocol, and it’s only available internally to the server, that turns out to be not true in the general case. Stay tuned for more on this.Performance Concerns
For performance reasons, TCP packet inspection and decoding is much better than log parsing. I created and worked for years on Percona Toolkit, including the slow-query-log parser in pt-query-digest, the tool that Percona uses to extract query metrics from logs. I know a great deal about log parsing and its pros and cons. When I founded SolarWinds® Database Performance Monitor(DPM), I could have written new and more efficient tools to parse slow query logs, but I had already decided that it was not the way to go:- It adds high overhead on the server. We do not want to intrude upon the systems we observe, and by Peter’s admission logging all queries to the slow query log can easily approach 40% overhead in synthetic benchmarks. I would add that this is not unrealistic in real applications. The overhead is both internal to the server (locking during log writes, for example) and external (increased I/O, CPU, and memory consumption).
- Log files aren’t just expensive to write, they’re expensive to read and parse. I spent a huge amount of time over the years trying to optimize pt-query-digest’s log parsing routines, but never got it to levels of CPU and memory consumption that satisfied me. Ask anyone who does this a lot and they’ll agree: you can log 5 minutes worth of queries on a busy server and pt-query-digest will take gigabytes of memory and run at 100% CPU utilization for a long time trying to parse it. That means that even if there’s lots of spare memory and CPU for it to use, it will not be able to keep up with the log on a busy server! Part of this is due to the practically unparsable log format, which has not been improved by all the patching that’s been added to it. Part of it is also inherent to the way pt-query-digest does things (a lot of regular expressions), which could be improved by using other methods.
Configuration, Setup, and Disruptiveness
There’s another important point to mention about analyzing log files: the setup for it is non-trivial. There are a lot of moving pieces to put into place, even in a service that provides as much canned functionality as possible for you. If you’re setting up and using something like Box Anemometer, a lot more work is necessary. These are not “bad” things per se, but we want to lower the barriers to using our service as much as possible, so any configuration or setup is something we want to avoid. Beyond ease of adoption, there are risks to consider too. In addition to performance overhead, query logging can cause problems if things aren’t done correctly. When I was at Percona I studied the causes of production downtime, and wrote two white papers on the results. Full disks were the leading cause of downtime in the leading category of causes. I have seen servers get into trouble many times because the slow query log filled the disk up. Tools to avoid this, such as logrotate, can cause their own problems in many cases. There are just too many ways it can go wrong in my experience. By contrast, DPM’s setup is as simple as curl-ing an install script and giving it your API token. Everything else is automatic, and our agents will discover everything in the environment and self-configure. It’s important to note that they don’t interfere or alter anything such as server configuration. No need to configure the server, no delicate dances with enabling and disabling the logs periodically, no rotating and emptying the logs, and so on. Packet inspection lets us avoid all of this and just eavesdrop on the system without interacting with it. This works on 100 servers as easily as it works on 1; it’s so easy it’s practically magical, and takes less than 15 seconds. We’ve worked hard to make our installation process very smooth. If we had opted instead to consume query logs, we wouldn’t have been able to make things so simple and unintrusive.Flexibility
Packet inspection gives us a lot of flexibility and represents a smooth upgrade path to measuring more things in the future. For example, packet capture can be done on the database server as we currently do; we could use port mirroring to capture traffic; or we could capture traffic from the application servers. All of these offer advantages and make it possible to use our service in more ways and in a greater variety of environments. Log analysis, by contrast, is much less flexible. And PERFORMANCE_SCHEMA is also constrained in some ways as well (i.e. it cannot be done passively on a remote system, since it is a part of the server internals.)Shortcomings of Packet Inspection
That’s all great, but what are the drawbacks? I will be very frank and say that there are a few.- We can only observe queries sent over the network. We don’t see queries sent through Unix sockets or via replication. We also can’t see queries run from stored procedures. We understand the binary protocol and the compressed protocol, and we support prepared statements, but if a statement is prepared before our agent starts observing the traffic, we won’t know what statement was executed. In my experience these are not a big deal, and our customers thus far agree. Note that we will certainly support the PERFORMANCE_SCHEMA in the future, so these limitations aren’t permanent.
- The agents need to run as root. Our customers tell us this isn’t a big deal, since they already run Chef/Puppet as root, and other services are similarly run with root privileges. Even Percona’s blog post about installing their agents mentions that they should run as root. So far, no one has even raised this as a concern.