"Benefits of High-Frequency Monitoring"
Open Space Session Notes from Devopsdays 2012
Jun 28, 2012
Bergmans Mechatronics LLC
July 7, 2012
This Open Space session was called to explore use-cases within the field of DevOps for high-frequency server monitoring, and specifically high rate data visualization.
I started the session with a live demonstration of collectdViewer, a system for plotting collectd data in a browser. The demo showed how collectdViewer could display CPU, memory and interface data from a collectd daemon on an EC2 instance at a rate of 2x per second.
After the demo, there was a discussion about where this type of system could be of practical use. Key points of the discussion included:
- collectdViewer is probably most applicable for debugging and tuning small numbers of servers or for monitoring servers during crash recovery
- the ability to view performance data from only a single server makes the application, in its current form, impractical for monitoring systems with large numbers (eg hundreds) of servers
The issue of data processing and alerting using collectdViewer was also raised during the discussion. Since a RabbitMQ message broker forms the core of the system, as shown below, the system could readily be expanded to include modules which would provide these functions.
collectdViewer System Overview
Two features that were suggested to be added to collectdViewer were:
- The capability to remotely stop a collectd daemon when it is not needed. This would help to reduce CPU load on the target instance when monitoring is not required and would also reduce incoming traffic on the collectdviewer.com server
- Enable developers to access the raw data stream created by the RabbitMQ message broker of collectdViewer.
Additional information about this system is available at collectdviewer.com
Netflix API Monitoring
Adrian Cockcroft (@adrianco) of Netflix also took the floor during the session to speak about the Netflix system to monitor dependencies within its API. This system features circuit breaker functions for high resiliency and a WebSocket-enabled dashboard with an update rate of 1x per second.
After the Devopsdays meeting, Adrian kindly provided the following links to information about Netflix' efforts to increase the reliability and performance of the firm's API:
- "Making the Netflix API More Resilient" by Ben Schmaus
A blog article describing how a circuit-breaker mechanism enables graceful degradation of the API when a failure occurs. Included in the article is a short video of the Netflix API Dependencies Monitor dashboard in action.
- "Fault Tolerance in a High Volume, Distributed System" by Ben Christensen
A blog article on how the Netflix API is designed to be fault-tolerant through the implementation of multiple failure response mechanisms, including circuit breakers.
- "Performance and Fault Tolerance for the Netflix API" by Ben Christensen
A slide deck expanding on the material presented in Mr. Christensen's above-referenced blog post. This slide deck also describes a method to enhance API performance through the use of a server-side agent that operates on behalf of a client to reduce calls to the API.
I offer the following brief observations on high-frequency monitoring based on discussions during this and other Open Space sessions and hallway conversations at Devopsdays 2012, as well as some recent thinking about this subject. These observations are only intended as a snapshot of my current thoughts and not a complete overview of high-frequency monitoring. Any comments or feedback about these observations is welcome.
High-Frequency Monitoring Can Enable Short Response Times to Anomalies
Increasing the monitoring rate (ie. the frequency of acquisition of performance information) is beneficial as it can reduce the response time for operators or automatic control functions to react to anomalies. Two caveats to this observation are that increasing the monitoring rate can also result in: i) in certain circumstances, an increase in the noise level in a particular measurement; and, ii) an increase in system resource utilization (eg. CPU, network, etc).
Data Visualization Not Always Required
If automatic alerting and logic functions are used to process and respond appropriately to the acquired performance data, then visualization of the data in real-time, while gratifying and interesting to watch, is probably not required under normal circumstances. Analogously, a car driver doesn't really need to watch the speedometer when cruise control has been enabled on the open highway.
Utility of High-Frequency Visualization for Addressing Anomalies
As discussed during the Open Session, high-frequency data visualization could be beneficial during anomalous circumstances (eg. system tuning or crash recovery) when on-the-fly decision-making by system operators might be required.
This benefit seems conceivable when considering the use of a car speedometer in anomalous situations. For example, when navigating around unfamiliar corners, knowing the instantaneous speed of a car (and not the speed 10+ seconds ago) allows the driver to make decisions on-the-fly as to how to operate their vehicle in a safe, yet expedient manner.
Advantages of Browser-Based Data Visualization
Data visualization software can be either browser-based or coded as native, platform-specific applications. Two notable advantages of browser-based data visualization are i) accessibility to the data from a variety of desktop and mobile platforms using a single version of the client software; and, ii) the relative ease of providing updated client software to users.
John Bergmans founded Bergmans Mechatronics LLC in 2003. His primary area of interest is WebSocket application research & development and training. He also develops custom data acquisition and control system software for clients in the software, industrial, medical, and defense sectors.
2004-2012 Bergmans Mechatronics LLC