Seeing the Big Picture: Give Me All of the Data
The collection of operational and and analytics information can be an addictive habit, especially in the case of an interesting and active network. However, this information can quickly and easily overwhelm an aggregation system when the constant fire hose of information begins. Assuming the desire is collection and utilization of this information, it becomes clear that a proper strategy is required. This strategy should be comprised of a number of elements, including consideration of needs and requirements before beginning a project of this scope.
In practice, data gathering requirements will likely happen either in parallel or, as with many things in network, adjusted on-demand, much like building the airplane as it is in flight. Course correction should be an oft-used tool in any technologist’s toolbox. New peaks and valleys, pitfalls, and advantages should be referenced in the constant evaluation that occurs in any dynamic environment. Critical parts of this strategy should be included in consideration for nearly all endeavors of this kind. But even before that, the reasoning and use cases should be identified.
A few of the more important questions that need to be answered are:
- What data is available?
- What do we expect to do with the data?
- How can we access the data?
- Who can access what aspects of the data types?
- Where does the data live?
- What is the retention policy on each data type?
- What is the storage model of the data? Is it encrypted at rest? Is it encrypted in transit?
- How is the data ingested?
Starting with these questions can dramatically simplify and smooth the execution process of each part of the project. The answers to these questions may change, too. There is no fault in course correction, as mentioned above. It is part of the continuous re-evaluation process that often marks a successful plan.
Given these questions, let’s walk through a workflow to understand the reasoning for them and illuminate the usefulness of a solid monitoring and analytics plan. “What data is available?” will drive a huge number of questions and their answers. Let’s assume that the goal is to consume network flow information, system and host log data, polled SNMP time series data, and latency information. Clearly this is a large set of very diverse information, all of which should be readily available. The first mistake most engineers make is diving into the weeds of what tools to use straight away. This is a solved problem, and frankly it is far less relevant to the overall project than the rest of the questions. Use the tools that you understand, can afford to operate (both fiscally and operationally), and that provide the interfaces that you need. Set that detail aside, as answers to some of the other questions may decide it for you.
How will we store the data? Time series is easy: that typically goes into a RRD. Will there be a need for complex queries against things like NetFlow and other text, such as syslog? If so, there may be a need for an indexing tool. There are many commercial and open source options. Keep in mind that this is one of the more nuanced parts, as answers to this question may change answers to the others, specifically retention, access, and storage location. Data storage is the hidden bane of an analytics system. Disk isn’t expensive, but it’s hard to do right, and on budget. Whatever disk space is required, always, always, always add head room. It will be necessary later, or adjustment of the retention policy may be necessary.
Encryption comes into play here as well. Typical security practice is to encrypt in flight and at rest, but in many cases this isn’t feasible (think router syslog). Encryption at rest also incurs a fairly heavy cost, both one-time (CPU cycles to encrypt) and perpetual (decryption for access). In many cases, the justification for encryption does not make sense. Exceptions should be documented and risks accepted to provide a documented path on decisions and acceptance of risk by management on the off chance that sensitive information is leaked or exfiltrated.
With all of this data, what is the real end goal? Simple: Baseline. Nearly all monitoring and measurement systems provide, at their elemental level, a baseline. Knowing how something operates is fundamental to successful management of any resource, and networks are no exception. By having stored statistical information it becomes significantly easier to identify issues. Functionally, any data collected will likely be useful at some point if it is available and referenced. Having a solid plan as to how the statistical data is dealt with is the foundation of ensuring those deliverables are met.