Tuesday, August 5, 2008

Number of Reducers in Hadoop Distributed Mode

So I may be a complete idiot here and may be missing something completely... But I couldn't figure out for the life of me why my Map Reduce program was only returning partial files in pseudo-distributed mode. After putting in some log4j messages into my code and fishing through the log files for each individual task, I noticed that if I had 2 reducers running in distributed mode, that I would only get a small portion of the file that I was expecting.

Then I tried decreasing the number of reducers down to only 1 and magically, the framework started collecting everything into 1 file as opposed to rewriting the file that was already there and yielding a small portion of the file. Now onto more complicated things...

No comments: