Tuesday, October 23, 2012

Book review - Hadoop - The definite guide (3rd edition) by Tom White


Eventually, I still spend most of my professional time, developing Enterprise Java Applications that are heavily dependent on mainstream RDBM(es) (Oracle, MySQL etc) and designed based on the Relational DB paradigm/ data modelling. My exposure to systems and technologies like Hadoop or other NoSQL specific DB systems is based on self reading and free time / home development. So I thought, to try and give my  self a more thorough 'second' chance, by deep diving into Hadoop & MapReduce by reading Tom White's - book. Comparing to the amount of information that the book provides my previous experience could be considered as really minimal.

The author and his style.

Tom White, is an active commit-er on the Hadoop project @ Apache and has deep knowledge of the codebase and the system's internals. So when it comes to the quality, knowledge and writing skills of the underlying author this book scores high - in my ranking system. Despite the fact that this is not a golden rule, meaning, there are cases where active developers on certain projects have authored books of similar style but the results were not so 'concrete'. In this particular case we have an experienced developer, that really tries hard to explain (to non experienced audience as well) the logic behind Hadoop and all the related parts of the puzzle (other technologies) that you may happen to use along with it. The Hadoop puzzle is big for sure...

Chapter by chapter.

The first 3 Chapters, are a very balanced introduction into Hadoop and MapReduce. I would say Chapter one, is quite a good read for interview preparation, while Chapters 2 & 3 set the stage on how you can program based on MapReduce logic and how you can interact with a minimal Hadoop installation. The code samples provided are in Java, so Java developers would feel like home. When it comes to installing Hadoop, I went through the author's instructions on the initial chapter's and the Appendix. Eventually I managed to have my own small testing cluster up and running - by following the instruction with only one exception - I used a slightly older version of Hadoop - than the latest one currently available. So I had a running Hadoop instance for version 0.20 while I could not really make it work 100% for version 0.23 (just a note for other fellow readers). My host OS is MacOSX 10.8 and Java6 as my default JDK.

Chapter 4, provides in detail HDFS features and how data-integrity, compression and serialization issues are addressed. I made some notes about Avro and how it can be used outside of Hadoop (if needed). That was one of the many to come bonus points of this book. While reading and going through core issues and chapters, you get to be introduced to satellite technologies and tools.

Chapters 5,6,7,8 are very specific on how to Program advanced MapReduce jobs, how to run or tune them. The author on Chapter 6, does a good job on explaining yet again but in more detail about the logic behind the map and reduce functions of a job and how the underlying system is actually executing them. 

Chapters 9 & 10, were quite complex and full of details, they cover areas like  setting up a more production like Hadoop cluster and tuning. In some cases the information is not easy to follow especially if you have just started with Hadoop and I guess these chapters can act as reference for more advanced users and administrators. More experienced users and administrators are likely to focus on these specific areas, maybe more than a traditional software developer.

Chapters 11- 15 cover in high detail - technologies and tools that have emerged from the Hadoop  ecosystem. Pig, Hive, HBase, Zookeper, Sqoop. The information provided is more than advanced and readers with no prior experience will have lots of things to read and cross check. These tools are widely used among Hadoop users / installations, so the author spent a large amount of the book to introduce them and then go though in great detail to some of their main features. I think, for some of them there are additional books available that might go into even greater detail - but the information already provided was more than enough - at least for me.

Chapter 16, was one of the most interesting things to read. After several chapters and tons of information about Hadoop and it's related technologies, the author makes a nice summary of some famous hadoop uses in the industry. The exciting thing is that the chapter goes beyond the level of 'listing' companies that use Hadoop and a few lines description.In many cases you are provided with the actual logic and overview of the Jobs / code /configuration- developed per case - so it is an excellent way on understanding how all this information on the previous chapter is implemented' and used in a real world scenario.

Overall 

As the name suggest, this book can be considered a 'Definite Guide'. It can offer a more than thorough start for non experienced developers to the Hadoop universe - while acting as a reference for more developers and administrators already working with Hadoop or one of the 'satellite' technologies. Definitely not an one night or 1 week bed time reading book. Some of the topics are advanced and in some cases either the setup or specific configuration of components may not work as expected - but this is not the book to blame, if you ask me. There are two main topics that every reader should be aware that he/she is expected to understand. The first is to go through the logic behind MapReduce and program with it and the second is to master the technicalities of Hadoop as system.  In general the overall topic is not an easy thing to do. So if you happen to start working with Hadoop in your project or considering some of the relared technologies- this book is a great start.

You may find it on Amazon.

No comments:

Post a Comment