Recently I implemented a Flume sink which uses Voldemort as its data store. Now it is hosted in GitHub and please feel free to fork it and experiment with it. But before that I’ll give you some insight into Flume and Voldemort so that it’ll help you to get a good picture about both of them.
What is Flume?
Flume is a distributed system that gets your logs from their source and aggregates them to where you want to process them . The primary use case is as a logging system that gathers a set of log files on every machine in a cluster and aggregates them to a centralized persistent store such as HDFS.
Flume is an open source project from Cloudera and it is licensed under Apache license v2.0. Cloudera provides enterprise support for the Apache Hadoop project and it’s sub projects such as Pig, Hive, HBase and ZooKeeper. Their distribution is named as “Cloudera’s Distribution for Hadoop(CDH)” and it has got several sub components from Cloudera including Flume, Sqoop and Hue.
Before proceed with rest of the article, I recommend to yo visit www.cloudera.com and have a better picture about Flume and its architecture.
Flume is not limited to collect logs from distributed systems, but it is capable of performing other use cases such as
- Collecting readings from array of sensors
- Collecting impressions from custom apps for an ad network
- Collecting readings from network devices in order to monitor their performance.
Flume is targeted to preserve the reliability, scalability, manageability and extensibility while it serves maximum number of clients with higher QoS.
Flume is easy to extend
Flume was awesome to me because it is very easy to extend. The source and sink architecture is flexible and provides simpler APIs.
What is a Source?
Source in the sense, a data provider that produces event data and send them downstream. Flume supports reading from many data sources such as standard console(stdin), syslog, text file and even tailing a file. Sources are the origin of events.
What is a Sink?
Sinks are the destinations of events. They travel downward from sources to sinks. There are many possible destinations for events — to local disk, to HDFS, to the console, or forwarding across the network. You use the sink abstractions as an interface for forwarding events to any of these destinations.
The beauty of the Flume is you can implement your own sink or source and plug it to the core very easily. Plugging a new source/sink it to Flume core is just a matter of editing some configuration files.
because of this cool feature, I started thinking about adding Voldemort as a Flume event sink.
What is Voldemort?
I’ve blogged about Voldemort in my previous blog post. You can read more details about Voldemort there. But for the sake of clarity I’ll introduce you Voldemort again.
Voldemort is an open source distributed key-value storage which is implemented by SNA team at LinkedIn. It realised the Amazon’s Dynamo paper which has been a great milestone in implementing key-value storage systems.
One of most loveable feature of Voldemort is its rich documentation compared to other competitor key-value storage systems. Apart from that it has been backed by a strong community of developers so that getting help and support can be easily done.
Voldemort is a distributed, fault tolerant and highly scalable key-value storage. So it is well suited for enterprise level operations.
Analysing logs in enterprise applications is a challenging task. Logs are produced by multiple computers so that amount of data required to analyse is comparatively large. Until logs are processed, they needs to be stored in a reliable, fault tolerant and scalable storage medium. So Voldemort would fit into this situation without any doubts.
Voldemort sink for Flume
Now we are on our business. I think the prelude is somewhat lengthy one :)
Voldemort sink for Flume is supposed to store events which are received from different Flume sources into Voldemort. It looks like this is pretty much straightforward task initially. But there were several issues required more consideration.
Design considerations for Voldemort sink
There were several design considerations had to take place when I was started designing the sink. I must specially thanks you for Roshan Sumbaly from LinkedIn for his ideas. So below are the design decisions I had to consider
Key generation strategy and granularity of the key space.
I already mentioned that Voldemort is a key-value storage system. So any value being stored there must have a unique key assigned to it.
Regarding this Flume sink, same situation applied. When sink receives an event from a source, we have to store this event in Voldemort. But before that, a unique key must be generated and assigned to that event value. There are several ways to generate a unique Id for a key such as random numbers, UUID generators and time stamps.
But the biggest problem we have to face here is when retrieving the stored log entries(or events) from Voldemort, how can we exactly specify what entry we need back? We can retrieve a stored value in Voldemort by performing GET operation with key of the stored value. Since we are not keeping track of the generated key set inside the sink it is impossible to retrieve the stored entries back.
Using time stamps as keys seems to be a solution to this problem. If we use “yyyyMMddhhmmS” as the time stamp format, an example key might look like this
2010 <- Year, 10 <- month, 03 <- day, 10 <- hour, 30 <- minute, 345 <- mili seconds
But even if we use this strategy we have to face a problem. Voldemort doesnt support for range query operations so that if we need to retrieve an entry inside Voldemort, it is mandatory to know the exact time stamp when entry is inserted. Because that time stamp is the key for that entry.
At this stage Roshan told me that bucketing the keys using the time stamp would be a better solution. That means we can append log entries/events for the same key. Entries can be collected within several granularity levels such as per day, per hour and per minute. This granularity is configurable at sink’s construction time so that users have to specify the granularity level as “DAY”, “HOUR” or “MINUTE”.
Say for example the current time stamp is 2010/10/03 10:30. So the generated key will be like below under different granularity levels.
- If “DAY” is specified => key will be 20101003
- If “HOUR” is specified => key will be 2010100310
- If “MINUTE” is specified => key will be 201010031030
When “DAY” is specified, all the log entries will be appended to key “20101003”. When “HOUR” is specified all entries will be appended to “2010100310” and so on. Keys will be changed once the day, hour or minute changes in the day. To delimit the log entries, each entry is delimited with a pipe “|” character so that once you retrieve the entries it is just a matter of splitting the value String to original log lines.
Using JSON as the key serialization format
JSON is obviously the most efficient serialization format compared to Strings. So its recommended to configure the Voldemort store with JSON as key serialization format. Value can be in any prefered format.
Installing and configuring Voldemort sink for Flume
You can find the detailed installation guide at GitHub.
Coming up next…
I’m also working on the Voldemort source and planning to release it soon. Before that I’d like to know your feedbacks on this. So feel free to fork the project and hack it!
Happy hacking! :)