Fork me on GitHub

Few days with Apache Cassandra



Few years ago I was a product developer at a big software (but non-database) company. We were writing the v2 of a new product after a fairly successful development round of v1. For everything OLTP, we used the wonderful open-source database - Postgres. But by v2, we had new, hight-volume data like NetFlow coming in. This would have intensely tested Postgres’s scalability and read/write performance. And we had some datawarehousing and OLAP requirements too. A hard look at our queries told us that column-stores would be a great-fit. Looking back, the options for a new product to store and query on massive data volumes boiled down to these few options -

The fact was, there were no open-source, reliable, horizontally scalable column-stores or parallel DBMS to consider.

Times have improved. We now have Cassandra, HBase, Hypertable etc (MongoDB, CouchDB etc are document stores with less of modeling - here the context is of schema-full data with rich data-type support).

So, I decided to try and understand Cassandra. Wanted to answer the simple question - if I were to re-live the product development scenario described above, would I choose Cassandra? So in this article I talk about my experiment with Cassandra. Here, I choose a very specific use-case to illustrate what I found - Monitoring JVM metrics in a small data center.

A Simple Usecase

Data Volumes

Fine-grained Data
Coarse-grained Data
Adding it all up!

Number of data points collected PER DAY -

There are couple of VERY IMPORTANT things to realize before going further -

Small-data problem? Its just a prototype!!

Before we start data modeling…

Data Access methods in Cassandra

Predominantly, there are three ways to interact with Cassandra - Hector, Astyanax and CQL. Cassandra supports Thrift by providing an API. Hector and Astyanax use the Thrift API to talk to the DBMS. CQL3 proposes a new SQL like API. This slidedeck has CQL3 performance vis-a-vis Thrift-API by the main committer of this piece - Eric Evans. Take your pick! In this prototype, I use CQL3.

SuperColumns

Recent articles and blogs suggest that supercolumns are a bad design and will go away in future releases of Cassandra. So I use composite keys and not supercolumns to model the data

Denormalization and Data Modeling by Queries

One of the central ideas in column-stores is to model data per the queries expected. Also denormalize, that is, store multiple replicas of data if required. Both these ideas have strong theoratical backing. Let me state just two -

Code Itself

The JBoss7 based implementation of this prototype can be found in my github repository. You will find a couple of MBean’s - JvmMethodMetricsDAO and JvmMethodIdNameDAO which have the persist() and find() methods. The procedure to use this is -

  1. Build the artifact using maven - ‘mvn clean install’ at the top level directory
  2. Deploy the jim-ear.ear in JBoss’s standalone/deployments
  3. Start JBoss’s jconsole and you should be able to see these MBean’s in the jconsole’s UI

Data Modeling

Here are few of the broad guidelines I set and followed -

Keyspace Configuration
For JVM Method metrics
CREATE KEYSPACE JvmMethodMetrics    WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1};
For JVM wide statistics
CREATE KEYSPACE JvmMetrics          WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1};
Column Families in JvmMethodMetrics KEYSPACE
Raw Trend Query Tables
CREATE TABLE JvmMethodIdNameMap (
    jvm_id int,
    method_id int,
    method_name varchar,
    PRIMARY KEY (jvm_id)
);

CREATE INDEX jvm_method_name ON JvmMethodIdNameMap (method_name);

CREATE TABLE JvmMethodMetricsRaw (
    jvm_id int,
    date varchar,
    day_time int,
    method_id int,
    invocations bigint,
    response_time float,
    PRIMARY KEY (jvm_id, date)
);

CREATE INDEX jvm_method_id ON JvmMethodMetricsRaw (method_id);
Trend Query Roll-up Tables
CREATE TABLE JvmMethodMetricsHourly (
    jvm_id int,
    hour int,
    method_id bigint,
    invocations bigint,
    response_time float,
    PRIMARY KEY (jvm_id)
);

CREATE TABLE JvmMethodMetricsDaily (
    jvm_id int,
    day int,
    method_id bigint,
    invocations bigint,
    response_time float,
    PRIMARY KEY (jvm_id)
);

CREATE TABLE JvmMethodMetricsWeekly (
    jvm_id int,
    week int,
    method_id bigint,
    invocations bigint,
    response_time float,
    PRIMARY KEY (jvm_id)
);

CREATE TABLE JvmMethodMetricsMonthly (
    jvm_id int,
    month int,
    method_id bigint,
    invocations bigint,
    response_time float,
    PRIMARY KEY (jvm_id)
);
TopN Query Tables

Data in these tables is kept sorted by maximum (response-time/invocations) to minimum

CREATE TABLE JvmMethodTopNHourly (
    jvm_id int,
    hour int,
    method_id_type varchar,      // Example: 100_RT => for method 100 response-time, 103_INV => for method 103 invocation count
    response_time_map map<text, float>,
    invocation_count_map map<text, long>,
    PRIMARY KEY (jvm_id, hour)
);

CREATE TABLE JvmMethodTopNDaily (
    jvm_id int,
    day int,
    method_id_type varchar,
    response_time_map map<text, float>,
    invocation_count_map map<text, long>,
    PRIMARY KEY (jvm_id, hour)
);

CREATE TABLE JvmMethodTopNWeekly (
    jvm_id int,
    week int,
    method_id_type varchar,
    response_time_map map<text, float>,
    invocation_count_map map<text, long>,
    PRIMARY KEY (jvm_id, hour)
);

CREATE TABLE JvmMethodTopNMonthly (
    jvm_id int,
    month int,
    method_id_type varchar,
    response_time_map map<text, float>,
    invocation_count_map map<text, long>,
    PRIMARY KEY (jvm_id, hour)
);
Column Families in JvmMetricsRaw KEYSPACE
CREATE TABLE JvmMetricsRaw (
  jvm_id int,
  date varchar,
  day_time int,          
  total_live_threads int,

  mem_heap set<bigint>,             // 3 data points - commited, max, used
  mem_nonheap set<bigint>,

  ds_freepool map<int, bigint>, // key is datasource_id, free pool of
  ds_usetime map<int, bigint>       // threads, avg query time over 1 min

  PRIMARY KEY (jvm_id, date)
);

Query Code

CQL3 packs a QueryBuilder utility that offers some basic features. Refer to the QueryBuild JavaDocs for more info. I was able to build simple queries for ‘select’ using different ‘where’ clauses for time and ID’s without much effort. I would recommend users to extend Cassandra’s QueryBuilder in their DAO layer to provide model specific functionality and catch errors. The prototype offers a Entity/DAO model which can be easily understood by those familiar with JPA/Hibernate. (However I am not a fan of the many ORM frameworks that are coming up for Cassandra - the knowledge of ‘entity’ modeling is critical for performance problems which Cassandra proposes to handle. Using a Cassandra ORM framework would mean lesser knowlege of data model and consequently less performant queries. Stay away from them!)

Read/Write Performance

Post modeling and unit testing I ran the application on my laptop (MacBookPro 2.9GHz/8GB RAM). Since my laptop is not an ideal performance test environment (I have multiple applications running, no tuning of cassandra or JBoss) I see no point in publishing the numbers or charts. However, I was able to ‘write’ literally millions of records per minute and read them back. Since I run MySql as well on my laptop, one thing I can vouch for is that Cassandra’s write performance is definitely far ahead of what I would have expected from my OOTB MySql.

Conclusion

Cassandra has come a long way from the 0.8 days. I did not come across any bugs working on my prototype. CQL3 and data modeling was a breeze. And there are a plethora of resources on this topic on the web. I would certainly recommend Cassandra for those looking to get a quick hang of NoSql and Column stores. If you are planning to use Cassandra as part of your application and have done the due deligence on the performance side, then, let me assure you - programming with Cassandra should not take any more time than using a ORM framework like JPA/Hibernate. And if you are like me, wanting to write a prototype then you should be able to wrap it all up from zero to running in a single working week. Ping me if you run into any issues using my code, understanding my blog or anything else. Thanks for reading!

Reading Recommendations


comments powered by Disqus