Ready to unlock the power of your data? With this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.
I first encountered Hadoop in the fall of 2008 when I was working on an internet
crawl and analysis project at Verisign. My team was making discoveries similar to those
that Doug Cutting and others at Nutch had made several years earlier regarding how
to efficiently store and manage terabytes of crawled and analyzed data. At the time, we
were getting by with our home-grown distributed system, but the influx of a new data
stream and requirements to join that stream with our crawl data couldn’t be supported
by our existing system in the required timelines....
Over the past few years, there has been a fundamental shift in data storage, management,
and processing. Companies are storing more data from more sources in more
formats than ever before. This isn’t just about being a “data packrat” but rather building
products, features, and intelligence predicated on knowing more about the world
(where the world can be users, searches, machine logs, or whatever is relevant to an
We propose a set of open-source software modules to perform structured Perceptron Training, Prediction and Evaluation within the Hadoop framework. Apache Hadoop is a freely available environment for running distributed applications on a computer cluster. The software is designed within the Map-Reduce paradigm. Thanks to distributed computing, the proposed software reduces substantially execution times while handling huge data-sets. The distributed Perceptron training algorithm preserves convergence properties, thus guaranties same accuracy performances as the serial Perceptron. ...