WEB Advent 2010 / Big Data

Big data, data science, analytics. These are some of the hottest buzzwords in tech right now. Five years ago, the boasting rights went to the geek with the largest number of users: these days he with the biggest data wins.

There are a number of approaches to dealing with vast quantities of data, but one of the best known is Apache Hadoop. Hadoop is a toolkit for managing large data sets, based originally on the Google whitepapers about MapReduce and the Google File System. For Socorro, the Mozilla crash reporting system, we use HBase, a non-relational (NoSQL) database built on the Hadoop ecosystem.

The Hadoop world is largely a Java world, since all the tools are written in Java. However, if you feel the same way about Java as Sean Coates, you should not lose hope. You, too, can use PHP to work with Hadoop.

Let’s start by understanding MapReduce. This is a framework for distributed processing of large datasets.

A MapReduce job consists of two pieces of code:

A Mapper
The job of the Mapper is to map input key-value pairs to output key-value pairs.
A Reducer
The Reducer receives and collates results from Mappers.

More parts are needed to make this work:

  • An Input reader generates splits of data for each Mapper to work through.
  • A Partition function takes the output of Mappers and chooses a destination Reducer.
  • An Output writer takes the output of the Reducers and writes it to the Hadoop Distributed File System (HDFS).

In summary, the Mapper and Reducer are the core functionality of a MapReduce job. Now, let’s get set up to write a Mapper and Reducer against Hadoop with PHP.

Setting up Hadoop is a non-trivial task; luckily, a number of VMs are available to help. For this example, I am using the Training VM from Cloudera. (You’ll need VMWare Player for Windows or Linux, or VMWare Fusion for OS X to run this VM.)

Once you’ve started the VM, open up a terminal window. (This VM is Ubuntu based.)

The VM you have just installed comes with a sample data set of the complete works of Shakespeare. You’ll need to put these files into HDFS so that we can work with them. Run the following commands to put the files into HDFS:

cd ~/git/data
tar vzxf shakespeare.tar.gz
hadoop fs -put input /user/training/input

You can confirm this worked by viewing the files in the input directory on HDFS: hadoop fs -ls /user/training

Next, we need to create the mapper and reducer. To demonstrate these, we’ll reproduce what is often referred to as the canonical MapReduce example: word count.

You can find the Java version of this code in the Cloudera Hadoop Tutorial.

As you can see (if you know Java), the mapper reads words from input, and for each word it encounters, emits to standard output the word and the value 1 to indicate that the word has been encountered. The reducer takes output from mappers and aggregates it to produce a set of words and counts.

The easiest way to communicate from PHP to Hadoop and back again is using the Hadoop Streaming API. This expects mappers and reducers to use standard input and output as a pipe for communication.

This is how we write the word count mapper in PHP, which we’ll name mapper.php:

#!/usr/bin/php
<?php

$input = fopen("php://stdin", "r");

while ($line = fgets($input)) {
 $line = strtolower($line);
 if ($words = preg_split("/\W/", $line)) {
   foreach ($words as $word) {
     echo "$word\t1\n";
   }
 }
}

fclose($input);

We open standard input for reading a line at a time, split that line into an array along word boundaries using a regular expressiob, and emit output as the word encountered followed by a 1. (I delimited this with tabs, but you may use whatever you like.)

Now, here’s the reducer (reducer.php):

#!/usr/bin/php
<?php

$input = fopen("php://stdin", "r");
$counts = array();

while($line = fgets($input)) {
 $tuple = explode("\t", $line);
 $counts[$tuple[0]] += $tuple[1];
}

fclose($input);

foreach($counts as $word => $count) {
 echo("$word $count\n");
}

Again, we read a line at a time from standard input, and summarize the results in an array. Finally, we write out the array to standard output.

Copy these scripts to your VM, and once you have saved them, make them executable:

chmod a+x mapper.php
chmod a+x reducer.php

You can run this example code in the VM using the following command:

hadoop \
jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2+320.jar \
-mapper mapper.php \
-reducer reducer.php \
-input input \
-output wordcount-php-output

In the output from the command you will see a URL where you can trace the execution of your MapReduce job in a web browser as it runs. Once the job has finished running, you can view the output in the location you specified:

hadoop fs \
-ls /user/training/wordcount-php-output

You should see something like:

Found 2 items
drwxr-xr-x   - training supergroup          0 2010-12-14 15:40
/user/training/wordcount-php-output/_logs
-rw-r--r--   1 training supergroup     279706 2010-12-14 15:40
/user/training/wordcount-php-output/part-00000

You can view the output, too:

hadoop fs \
-cat /user/training/wordcount-php-output/part-00000

An excerpt from the output should look like this:

yeoman 13
yeomen 1
yerk 2
yes 211
yest 1
yesterday 25

This is a pretty trivial example, but once you have this set up and running, it’s easy to extend this to whatever you need to do. Some examples of the kinds of things you can use it for are inverted index construction, machine learning algorithms, and graph traversal. The data you can transform is limited only by your imagination, and, of course, the size of your Hadoop cluster. That’s a topic for another day.

Other posts