Mahout is not so difficult! Using Sequence Files-When and why? Introduction Creating sequence files from the command line Getting ready How to do it How it works Generating sequence files from code Getting ready How to do it How it works… Reading sequence files from code Getting ready How to do it How it works 3. How it works 5.

Author:Grojinn Shagrel
Language:English (Spanish)
Published (Last):19 March 2004
PDF File Size:15.51 Mb
ePub File Size:2.91 Mb
Price:Free* [*Free Regsitration Required]

Mahout is Not So Difficult! How it works Coding a basic recommender Getting ready How to do it See also 2. Using Sequence Files — When and Why?

Introduction Creating sequence files from the command line Getting ready How to do it Generating sequence files from code Getting ready How to do it Reading sequence files from code Getting ready How to do it… How it works… 3.

Command-line-based Canopy clustering with parameters Getting ready How to do it… How it works Coding your own cluster distance evaluation Getting ready How to do it… How it works See also 7. Using the genetic algorithm from Java code Getting ready How to do it… How it works No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals.

However, Packt Publishing cannot guarantee the accuracy of this information. ISBN www. Mohite Cover Work Nilesh R. Mohite About the Author Piero Giacomelli started playing with computers back in when he received his first PC a commodore Despite his love for computers, he graduated in Mathematics, entered the professional software industry in , and started using Java. He has been involved in a lot of software projects using Java,.

He is not only a great fan of JBoss and Apache technologies, but also uses Microsoft technologies without moral issues. He has worked in many different industrial sectors, such as aerospace, ISP, textile and plastic manufacturing, and e-health association, both as a software developer and as an IT manager. He is married with two kids, and in his spare time, he regresses to his infancy ages to play with toys and his kids.

Acknowledgments I would like to thank my family for supporting me during the exciting yet stressful months in which I wrote this book. Thanks to my wife Michela, who forces me to become a better person everyday and my mother, Milena, who did the same before marriage.

Also thanks to Lia and Roberto who greatly helped us every time we needed their help. A special acknowledgment to the entire Packt Publishing editorial team. Rozario, Amey Varangaonkar, Angel Jathanna, and Abhijit Suvarna, as they have been very patient with me even when they had no reason for being so kind. While I was writing this book, I also moved to a new job, so this is the right place to thank Giuliano Bedeschi. About the Reviewers Nicolas Gapaillard is a passionate freelance Java architect, who is aware of the innovative projects in Java and the open source world.

This business unit aimed to develop open source software that revolves around security, such as certificate management, encrypted document storage, and authentication mechanisms. This gives him the freedom to manage his time in order to work on and study innovative projects.

At that time, only Mahout provided some out-of-the-box algorithms that threats about these problems. I want to especially thank the author of this book who has worked very hard to write the book with a concern for quality.

I would also like to thank Packt Publishing who trusted me to contribute to the review of the book, and manage the processes very carefully, and permitted me to synchronize the review with the redaction of the book. I would like to thank the other reviewers who have helped for the redaction of the book and the quality of the content and also thank my wife, who let me have some free time to work on the review.

Vignesh Prajapati is working as a Big Data scientist at Pingax. He has expertise in algorithm development for data ETL and generating recommendations, predictions, and behavioral targeting over e- commerce, historical Google analytics, and other datasets.

He has also written several articles on R, Hadoop, and machine learning to produce producing intelligent Big Data applications. Apart from this book, he has worked with Packt Publishing on two other books.

I would like to thank Packt Publishing for this wonderful opportunity, and my family, friends, and the Packt Publishing team who have motivated and supported me to contribute to open source technologies.

His research interests involve spectral graph theory, machine vision, and pattern recognition for biomedical image recognition, and building real-time distributed frameworks for biosurveillance with his advisor Dr. Chakra Chennubhotla. He is also a contributor for Apache Mahout and other open source projects.

You can upgrade to the eBook version at www. Get in touch with us at for more details. At www. Why subscribe? Fully searchable across every book published by Packt Copy and paste, print and bookmark content On demand and accessible via web browsers Free Access for Packt account holders If you have an account with Packt at www. Simply use your login credentials for immediate access.

Being a father these days is becoming a very tough task; Enrico and Davide, I hope you will appreciate my efforts on this task. To give you an idea of what is going on, we refer to a study done by Qmee in that shows what usually happens on the Internet in 60 seconds.

These are the biggest websites, but even for national or international websites it is common to have millions of records, collected for logging purposes.

To manage such large volumes of information, new frameworks have been coded to basically allow the sharing of the computational tasks via different machines. Hadoop is the Apache solution for coding algorithms whose computational tasks can be divided between various hardware infrastructures.

When one deals with billions of data records to be analyzed, in most cases the purpose is the information extraction to find new relations between data. Traditionally, data mining algorithms were developed for this purpose. However, there is no way to compute, in a reasonable time, the data mining tasks when dealing with very large datasets.

Mahout is the data mining framework created to be used, coupled with Hadoop, for applying data mining algorithms to very large datasets using the MapReduce paradigm encapsulated by Hadoop. So Mahout offers the coder a ready-to-use framework for doing data mining tasks using the Hadoop infrastructure as a low level interface.

This book will present you with some real-world examples on how to use Mahout for mining data and will present you with the various approaches to data mining. The key idea is to present you with a clean, non-theoretical approach to the ways one can use Mahout for classifying data, for clustering them, and for creating forecasts.

The book is code-oriented, and so we will not enter too much into the theoretical background at every step, while we will still refer the willing reader to some reference materials for going deep into the specific arguments. Some of the challenges we faced while presenting this book are: From my experience, Mahout has a very high learning curve.

This is mainly because using an algorithm that uses the MapReduce methodology is completely different from the sequential approach. The data mining algorithms themselves are not so easy to understand and require skills that in most cases a developer does not necessarily have. So we tried to propose a code-oriented approach to allow the reader to grasp the meaning and the purpose of every piece of code suggested without the need of a very deep understanding of what is going on behind the scenes.

The result of this approach should be judged by you and we hope that you find pleasure in reading it as much as we had in writing it. A recommendation algorithm will be coded so that all the pieces involved in a data mining operation as the presence of Hadoop, the JARs to be included, and so on, will be clear to the reader without any previous knowledge of the environment. Sequence files are a key concept when using Hadoop and Mahout. In most cases, Mahout is not ready to directly treat the datasets that are used.

So before entering in the code algorithm we need to describe how to treat these particular files. How to convert document terms into vectors of numbers counting the occurrence will also be fully described. Both of them will show you the possibility to analyze some common datasets to obtain forecasts on future values.

Chapter 6, Canopy Clustering in Mahout, starts to describe the most used algorithm inside the Mahout framework, the one involving cluster analysis and classification tasks of Big Data.

In this chapter the methodology for using canopy cluster analysis to aggregate data vectors around common centroids will be described with real-world examples. Chapter 7, Spectral Clustering in Mahout, continues with the analysis of the clustering algorithms available in Mahout.

This chapter describes the ways to use spectral clustering, which is very efficient in classifying information linked together in the form of graphs.

Chapter 8, K-means Clustering, describes the use of K-means clustering, both sequential as well as MapReduce, to classify text documents in topics. We will explain the use of this algorithm from the command line as well as the Java code. It allows you to forecast the items that should be sold together moving from the previous purchases made by customers.

The Latent Dirichlet algorithm will also be presented for text classification. Chapter 10, Implementing the Genetic Algorithm in Mahout, describes the use of the Genetic algorithm in Mahout to solve the Travelling Salesman problem and to extract rules. We will see how to use different versions of Mahout to use these algorithms.

What you need for this book We will cover every software needed for this book in the first chapter. All the examples in the book have been coded using Ubuntu Who this book is for Apache Mahout Cookbook is ideal for developers who want to have a fresh and fast introduction to Mahout.

No previous knowledge of Mahout is required, and even skilled developers or system administrators will benefit from the various recipes presented in this book. Conventions In this book, you will find a number of styles of text that distinguish between different kinds of information.

Here are some examples of these styles, and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "In the same way you could also use Eclipse to access the svn repository and compile everything using the Maven Eclipse plugin. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Click on the Add button and after a few seconds, you should be able to see all the Mahout jars added.

Tip Tips and tricks appear like this. Reader feedback Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.


Apache Mahout Cookbook

To purchase books, visit Amazon or your favorite retailer. Apache Hadoop has been created to handle such heavy computational tasks. Mahout gained recognition for providing data mining classification algorithms that can be used with such kind of datasets. The book gives an insight on how to write different data mining algorithms to be used in the Hadoop environment and choose the best one suiting the task in hand.



Related Articles