Fork me on GitHub

System and Application health - Is there a data collection challenge at the DC?



My champion-hacker friend Sumanth and I spent a little time few weeks ago digging to know if there was a data collection challenge for system and application health metrics at a typical small data center. Here is the little that we discovered…

But the self questioning was preceded by a phase where I came to know about open source monitoring tools like Ganglia and discovered that they were widely deployed. Ganglia in particular is very active as a development project and uses Multicast for data transmission. It caches data at all nodes within a cluster. The collection station has to communicate to only one (any) node within a cluster. I asked myself what could be the rationale behind this design. And as I looked more deeply into the world of Ganglia I discovered the amazing attention to detail - to optimise data and time without compromising accuracy or extensibility. For those wishing to understand Ganglia, this book is a must read. The chapter on case-studies in this book makes for a truly fascinating read. This paper also provides a very good intro.

And then there is SNMP. SNMP was built for monitoring. One small deficiency is that not all metrics are available through SNMP. In this blog I decided to keep SNMP aside and do the analysis. I present the numbers first and my inferences later.

Usecase

Let me take the example of a data center for a small eCommerce startup, say, called “WebTraveller”. Now, a little detail about this company -

IT Resources

Now, how much IT resources will WebTraveller require? Since I am the de facto CIO of WebTraveller and it is my first CIO job, I decide to start with a nice whole number - say 100 servers (okay, I hear you, all on cloud). Now, here is the split up of what this 100 servers are going to do…

1 Servers running WebTraveller's Ruby based dynamic Web-Application 25
2 Servers sourcing WebTraveller's static page and load-balancer (httpd/nginix) 15
3 Servers running WebTraveller Java interface with its data providers 25
4 WebTraveller database servers 20
5 IT analytics 10
6 IT management/monitoring 5
- Total Servers 100

(How much am I off the mark in these assumptions? If its horrific, then please let me know and I promise to re-do this blog)

Quantum of data to collect

Being the CIO, I want to understand how my IT is coping. So I need data. Data on server’s utilisation, database metrics, web-server metrics etc. Industry calls these various metrics as KPI - Key Performance Indicators. So KPI it will be. How many KPIs do I need to collect for each type of IT resource?

# KPI Type Approximate Number of KPIs per instance Number of Instances (from the above table) Total KPIs to collect
1 Operating system level KPIs - CPU, RAM, open sockets, HDD usage, network card stats etc 5 100 500
2 Ruby web-app KPIs 10 25 250
3 Ruby web-app runs on the Rails server. KPIs that speak Rails health 10 25 250
4 Java web-app KPIs 10 25 250
5 Java web-app's use a JVM and app-server (JBoss/Glassfish/Tomcat). KPIs that speak Java platform health 10 25 250
6 HTTPD or NGINIX KPIs 10 15 150
7 Database Server KPIs 20 20 400
8 KPIs from the Analytics system (say running Hadoop) 10 10 100
- Total - - 2150

So, the approximate total number of KPIs to collect is 2150. Which is an average of about 21 KPIs to be collected from the 100 servers of WebTraveller. Now, how frequently do we want to collect this data? I as the CIO of WebTraveller want my IT to be really AGILE - which means I don’t want to miss any data (especially in its initial days!). And I also want to keep it SIMPLE. So I ask my monitoring team to collect all these KPIs every minute.

The Developer’s View

‘Mr. Bean’ is a developer in WebTraveller’s IT team. Mr. Bean’s task is cut out - he has to develop the monitoring app that collects 2150 metrics every minute by polling. Being a seasoned developer, he knows for sure that to collect so many KPIs he needs to code a ‘multi-threaded’ application. So Bean decides to do some estimation. How many threads will his application need to capture 2150 KPIs every minute?

First of all, what are the different methods that exist to capture these KPIs from a remote server? Here are the necessary few -

Mr. Bean calculates the response-time for various collection methods -

# Collection Method Observation Mean time to collect a set of KPIs from one instance Num of servers that can be covered in 1 minute in a single thread
1 SSH SSH involves two types of time - (1) time taken for connection establishment and teardown (2) multiple commands need to be run on the remote shell, data collated and retrieved 15 seconds 60/15 => 4 servers
2 JMX Multiple JMX attributes can retrieved at once. But here again there are the 2 phases of connection and retrieval 15 seconds 60/15 => 4 servers
3 JDBC Single JDBC session can get a lot of metrics Assume 20 seconds to retrieve all 20 database server KPIs of one instance 60/20 => 3 servers
4 RPC or RMI Am not sure if multiple variable can be retrieved in a single session. Assuming its possible... 15 seconds 60/15 => 4 servers

With this understanding, Mr. Bean decides on which collection technology to use for each class of KPIs (in table below). Also, Mr. Bean wants to know the number of threads his application may have to run. Mr. Bean knows that ideally, he would want to do Asynchronous collection for each of these - that is, start a request in Thread-A and retrieve the data from Thread-B when it arrives - there are many libraries that provide such Asynchronous capabilities for each of SSH, RPC, RMI, JMX, JDBC etc. However, Asynchronous communication does not lead to conservative number of threads - a thread gets forked whenever data arrives. For most conservative number of threads, a select-and-poll based method is most appropriate. The big deficiency of select-and-poll approach however is that data collection with time boundaries becomes tougher. There is no guarantee that the above mentioned mean times will always hold good. And also, the data that arrives is distributed wildly on the temporal scale.

So, Mr. Bean calculates the number of threads that his application will end-up with if he takes either of the approaches -

# KPI Type Data Collection Technology Number of Instances (from the above table) Number of threads for select-and-poll approach Number of threads for asynchronous approach
1 Operating system level KPIs - CPU, RAM, open sockets, HDD usage, network card stats etc SSH 100 (100 servers/4 servers per min per thread) => 25 threads 100
2 Ruby web-app KPIs RPC or RMI 25 (25/4) => 6.25 (assuming fractions in num threads is possible!) 25
3 Ruby web-app runs on the Rails server. KPIs that speak Rails health RPC or RMI 25 (25/4) => 6.25 25
4 Java web-app KPIs JMX 25 (25/4) => 6.25 25
5 Java web-app's use a JVM and app-server (JBoss/Glassfish/Tomcat). KPIs that speak Java platform health JMX 25 (25/4) => 6.25 25
6 HTTPD or NGINIX KPIs SSH 15 (15/4) => 4 15
7 Database Server KPIs JDBC 20 (20/3) => 7 20
8 KPIs from the Analytics system (say running Hadoop) SSH 10 (10/4) => 3 10
Total - - 64 threads 245 threads

So the realm of number of threads to gather information from the 100 server deployment at WebTraveller is approximately between 60 to 250 threads. The following factors are pertinent -

Is there a Data Collection ‘Challenge’?

The numbers say that on average 1 to 2 threads/sockets are required to collect data from each instance. This does not sound much in WebTraveller’s case but one needs to pay attention to the following details -

With little more scale

The situation changes considerably if we consider a data-center with 3000 servers. Even with a linear extrapolation, it would involve collection of about 65,000 data-points every minute. And in excess of 10,000 threads and sockets.

With Ganglia

In WebTraveller’s case, Mr. Bean could potentially do one other thing. He could use Ganglia to collect the data. Each of the 5 functional groups (from the first table above) in WebTraveller’s data-center could be configured as separate Ganglia clusters. This leads to -


comments powered by Disqus