A brief introduction to Apache Mahout


What is Mahout?

Mahout is an open source machine learning library from Apache. It has a collection of algorithms that falls under machine learning or collective intelligence category. Mahout is just a Java library, its not a server or tool that gives you a GUI interface.
Nice thing Mahout is its scalable. If amount of data to be processesed is very large or too large to fit in a single machine, Mahout is the best solution. Most of the Mahout’s algorithms have been implemeneted in a way that they could be run on top of Apache Hadoop.

Mahout’s capabilities

Mahout consists of set of algorithms which are capable of performing following machine learning tasks.

1. Collaborative filtering
2. Clustering
3. Classification

Collaborative filtering with Mahout

Mahout’s main focus is to provide collaborative filterting or recommendation services. Given a set of users and items, Mahout can provide recommendations to the current user of the system. It requires user preferences in the form of where preference is a scalar value about user’s taste on particular item.

Mahout supports three ways of generating recommendations

  • User-based: Recommend items by finding similar users. This is often harder to scale because of the dynamic nature of users.
  • Item-based: Calculate similarity between items and make recommendations. Items usually don’t change much, so this often can be computed offline.
  • Slope-One: A very fast and simple item-based recommendation approach applicable when users have given ratings (and not just boolean preferences).

Mahout in production environment

Mahout can be operated under two modes, local mode and distributed mode.

Local mode

In local mode, it can process recommendations from up to 100 million records at real time. However this requires a machine with 4GB of Java heap size. Mahout provides a servlet so that it could be deployed inside a servlet container. By this way, Mahout’s capabilities are not limited to Java platform.

Distributed mode

If you expect to go beyond 100 million records, better to consider distributed mode which can be run as a map reduce job on a Hadoop cluster. Running on distributed mode is a batch process and it takes time.