Tuesday, August 5, 2008

Number of Reducers in Hadoop Distributed Mode

So I may be a complete idiot here and may be missing something completely... But I couldn't figure out for the life of me why my Map Reduce program was only returning partial files in pseudo-distributed mode. After putting in some log4j messages into my code and fishing through the log files for each individual task, I noticed that if I had 2 reducers running in distributed mode, that I would only get a small portion of the file that I was expecting.

Then I tried decreasing the number of reducers down to only 1 and magically, the framework started collecting everything into 1 file as opposed to rewriting the file that was already there and yielding a small portion of the file. Now onto more complicated things...

Monday, June 30, 2008

Simple AND Elegant

Simple AND Elegant. Two adjectives you use all the time to describe algorithms like MapReduce, but you know you have stumbled upon something special when you can also say the same about a framework that implements that algorithm. I've been following Hadoop, the open source implementation of Google's MapReduce algorithm, for some time now and until recently, I haven't had the urge to really look into how it works and what it can do until I read a blog called Hadoop by a New York Times developer, Derek Gottfrid. The incredible power of the framework and Amazon Web Services can easily be seen by what Derek Gottfrid did for the NY Times Archive.

So welcome to Distributed Maze, my foray into the distributed computing world and a blog post to help others who want to do the same. Sure, the name isn't creative, but at least it's original as of today's date! Hope it helps someone do some amazing things one day.