Programming Elastic MapReduce: Using AWS Services to Build an End-to-End Application

3.2 avg rating
( 5 ratings by Goodreads )
 
9781449363628: Programming Elastic MapReduce: Using AWS Services to Build an End-to-End Application

Although you don’t need a large computing infrastructure to process massive amounts of data with Apache Hadoop, it can still be difficult to get started. This practical guide shows you how to quickly launch data analysis projects in the cloud by using Amazon Elastic MapReduce (EMR), the hosted Hadoop framework in Amazon Web Services (AWS).

Authors Kevin Schmidt and Christopher Phillips demonstrate best practices for using EMR and various AWS and Apache technologies by walking you through the construction of a sample MapReduce log analysis application. Using code samples and example configurations, you’ll learn how to assemble the building blocks necessary to solve your biggest data analysis problems.

  • Get an overview of the AWS and Apache software tools used in large-scale data analysis
  • Go through the process of executing a Job Flow with a simple log analyzer
  • Discover useful MapReduce patterns for filtering and analyzing data sets
  • Use Apache Hive and Pig instead of Java to build a MapReduce Job Flow
  • Learn the basics for using Amazon EMR to run machine learning algorithms
  • Develop a project cost model for using Amazon EMR and other AWS tools

"synopsis" may belong to another edition of this title.

From the Author:

Q&A with Chris Phillips, co-author of "Programming Elastic MapReduce"

Q. What makes “Programming Elastic MapReduce” important right now?
A. Big Data and Hadoop are hot technologies now with many companies exploring how they can use the technology to benefit their business and their customers. However, the upfront investment in a large Hadoop cluster and allocating space for racks of servers in the traditional data center can be a great barrier to entry for organizations that want to explore the technology and learn how it can benefit their business. Amazon Elastic MapReduce eliminates this barrier and allows organizations to explore the technology without the upfront costs and only pay for the resources they use.
NetFlix and Airbnb are among the well-known organizations that use Amazon Elastic MapReduce heavily.
 
Q. What do you hope that readers of your book will walk away with?
A. The hope in writing Programming Elastic MapReduce is to show the reader how easy it is to build an application in Amazon EMR and that they can start building their application today without building clusters of servers and finding space and resources to manage a Hadoop cluster. The reader will learn the multitude of language and technology options available to build and Amazon EMR application and can go from a development laptop to a running cloud based cluster in minutes.
 
Q. What's the most exciting and important thing happening in this space currently?
A. Data Science is a rapidly growing field with the fields of business intelligence, statistics, and computer science coming together to help business solve new problems. According to Gartner, the market will require 100,000+ data scientists by 2020. Companies like Kaggle.com now run data science competitions to source some of the best and brightest data scientists to help companies solve their data analysis problems. We are just starting to see businesses leverage this technology in examples like NetFlix's recommendation engine. The power of this technology is only now starting to be realized with tremendous growth in the future. Our book helps developers and programmer interested in this field a way to learn the technology and have a platform to start projects with low upfront costs.
 
Q. Can you provide a few tips on how to get started with Elastic MapReduce?
A. 1. Move your data to AWS: Before you can start processing data with Amazon Elastic MapReduce, you will need to move your data to Amazon S3. s3cmd and AWS Command Line are two easy to use command line utilities that can be used inside AWS or on individual servers in your data center to transfer data to S3 so it can be in a location to be processed by Amazon EMR. For very large data sets, organizations should explore the AWS Import/Export service to send their data to Amazon on physical storage.
 
2. Pick the right problem to solve: When people first learn about Hadoop or Elastic MapReduce, they think of the technology similar to database technology. Elastic MapReduce is more like a batch processing system. Elastic MapReduce can ingest a large amount of data and process it faster and more efficently than a traditional database. However, the way EMR processes this data is similar to a table scan where all of the data is processed and analyzed. EMR can not perform as efficently as a traditional database in retrieving a small number of rows from a large dataset. Additional technologies like Amazon Redshift and HBase can be used with Amazon EMR to get the benefits of both a traditional database and Hadoop.
 
3. Save money using spot instances: Amazon EMR's latest console released in November 2013 allows a user to resize a cluster quickly. A cost effective way of processing data in EMR is to start or increase the size of a running cluster with a number of task nodes that use spot instances. Spot instances let you name the price you are willing to pay for additional capacity and prices are typically far below Amazon's on-demand prices.
 
4. Set up persistent and transient Amazon EMR clusters: An Amazon EMR cluster can be set up to terminate once the cluster completes all the steps in the Job Flow. This type of Amazon EMR cluster is considered a transient cluster since it only lives for the life of the job flow it needs to complete. An Amazon EMR cluster can be set up to continue running and wait for additional steps. There are pros and cons to both of these cluster types and the use of these clusters will depend on your application. However, a few rules of thumb may help in the selection that’s right for you.
 
Transient clusters can be used to save money on Amazon EMR costs. If your data flow is sporadic, it may make sense to queue up a bunch of data in S3 and only start an EMR cluster once a week, day, or hour depending on your need. This allows you to save money on times your cluster sits idle waiting for work to arrive. You can use Amazon Cloudwatch to monitor your cluster to see if your data and workloads would benefit from using transient EMR clusters. Amazon Data Pipeline can help you build workflows that trigger EMR cluster creation when the right conditions exist to process data.
 
A persistent EMR cluster can be the right choice for your organization if the results of your data analysis are time critical or the data flow is consistent enough to necessitate constant data analysis processing. Your application and data processing will have lower processing overhead without the need to regularly build up and tear down EMR clusters.
 
5. Experiment with EMR Cluster node types: Throughout the book, we typically use the smallest and fewest number of instances in an Amazon EMR cluster. This helps reduce the costs associated with learning Amazon EMR. However, your application will need much more than this when running in a production setting with real world demands. Some applications will be more memory intensive, CPU intensive, or even disk read and write intensive. To find out what is right for your application, experiment with different instance types and number of instances with a small subset of your data to learn what size EMR cluster meets your data processing time and AWS cost requirements.

About the Author:

Kevin J. Schmidt is a senior manager at Dell SecureWorks, Inc., anindustry leading MSSP, which is part of Dell. He is responsible for the design and development of a major part of the company’s SIEM platform. This includes data acquisition, correlation, and analysis of log data. Prior to SecureWorks, Kevin worked for Reflex Security, where he worked on an IPS engine and anti-virus software. And prior to this, he was a lead developer and architect at GuardedNet, Inc., which built one of the industry’s first SIEM platforms.

He is also a commissioned officer in the United States Navy Reserve (USNR). He has over 19 years of experience in software development and design, 11 of which have been in the network security space. He holds a Bachelor of Science in Computer Science.

Kevin has spent time designing cloud services components at Dell, including virtualized components to run in Dell’s own vCloud. These components are used to protect customers who use Dell’s cloud infrastructure. Additionally, he has been working with Hadoop, machine learning, and other technology in the cloud.

Kevin is co-author of Essential SNMP, second edition (O’Reilly and Associates, ISBN: 978-0-596-00840-6) and also Logging and Log Management: The Authoritative Guide to Understanding the Concepts Surrounding Logging and Log Management (Syngress, ISBN: 978-1-597-49635-3).

Christopher Phillips is a manager and senior software developer at Dell SecureWorks, Inc, an industry leading MSSP, which is part of Dell. He is responsible for the design and development of the company’s Threat Intelligence service platform. He also has responsibility for a team involved in integrating log and event information from many third-party providers that allow customers to have all of their core security information delivered to and analyzed by the Dell SecureWorks systems and security professionals.

Prior to Dell SecureWorks, Chris worked for McKesson and Allscripts, where he worked with clients on HIPAA compliance, security, and healthcare systems integration. He has over 18 years of experience in software development and design. He holds a Bachelor of Science in Computer Science and an MBA.

Chris has spent time designing and developing virtualization and cloud Infrastructure as a Service strategies at Dell to help our security services scale globally Additionally, he has been working with Hadoop, Pig scripting languages, and Amazon Elastic Map Reduce to develop strategies to gain insights and analyze Big Data issues in the cloud.

Chris is co-author of Logging and Log Management: The Authoritative Guide to Understanding the Concepts Surrounding Logging and Log Management (Syngress, ISBN: 978-1-597-49635-3).

"About this title" may belong to another edition of this title.

Top Search Results from the AbeBooks Marketplace

International Edition
International Edition

1.

SCHMITT
ISBN 10: 1449363628 ISBN 13: 9781449363628
New Quantity Available: 5
International Edition
Seller:
bookscollection
(Delhi, DELHI, India)
Rating
[?]

Book Description Book Condition: Brand New. Book Condition New, International Edition. We Do not Ship APO FPO AND PO BOX.NOT LOOSE LEAF VERSION,NO SOLUTION MANUAL, NO CD, NO ACCESS CARD Cover Image & ISBN may be different from US edition but contents as US Edition. Printing in English language. Quick delivery by USPS/UPS/DHL/FEDEX/ARAMEX ,Customer satisfaction guaranteed. We may ship the books from Asian regions for inventory purpose. Bookseller Inventory # ABEAVS*##6002

More Information About This Seller | Ask Bookseller a Question

Buy New
US$ 14.02
Convert Currency

Add to Basket

Shipping: US$ 4.12
From India to U.S.A.
Destination, Rates & Speeds
International Edition
International Edition

2.

Kevin Schmidt
ISBN 10: 1449363628 ISBN 13: 9781449363628
New Quantity Available: 6
International Edition
Seller:
Unique Bookseller
(Delhi, India)
Rating
[?]

Book Description Book Condition: Brand New. .. Black & White or color International Edition. ISBN and front cover may be different, but contents are same as the US edition. Book printed in English. Territorial restrictions may be printed on the book. GET IT FAST within 3-5 business days by DHL/FedEx/Aramex and tracking number will be uploaded into your order page within 24-48 hours. Kindly provide day time phone number in order to ensure smooth delivery. No shipping to PO BOX, APO, FPO addresses. 100% Customer satisfaction guaranteed!. . Bookseller Inventory # UBS09306

More Information About This Seller | Ask Bookseller a Question

Buy New
US$ 20.85
Convert Currency

Add to Basket

Shipping: FREE
From India to U.S.A.
Destination, Rates & Speeds

3.

Kevin Schmidt, Christopher Phillips
Published by O Reilly Media, Inc, USA, United States (2013)
ISBN 10: 1449363628 ISBN 13: 9781449363628
New Paperback Quantity Available: 1
Seller:
The Book Depository US
(London, United Kingdom)
Rating
[?]

Book Description O Reilly Media, Inc, USA, United States, 2013. Paperback. Book Condition: New. Language: English . Brand New Book. Although you don t need a large computing infrastructure to process massive amounts of data with Apache Hadoop, it can still be difficult to get started. This practical guide shows you how to quickly launch data analysis projects in the cloud by using Amazon Elastic MapReduce (EMR), the hosted Hadoop framework in Amazon Web Services (AWS). Authors Kevin Schmidt and Christopher Phillips demonstrate best practices for using EMR and various AWS and Apache technologies by walking you through the construction of a sample MapReduce log analysis application. Using code samples and example configurations, you ll learn how to assemble the building blocks necessary to solve your biggest data analysis problems. Get an overview of the AWS and Apache software tools used in large-scale data analysis Go through the process of executing a Job Flow with a simple log analyzer Discover useful MapReduce patterns for filtering and analyzing data sets Use Apache Hive and Pig instead of Java to build a MapReduce Job Flow Learn the basics for using Amazon EMR to run machine learning algorithms Develop a project cost model for using Amazon EMR and other AWS tools. Bookseller Inventory # AAH9781449363628

More Information About This Seller | Ask Bookseller a Question

Buy New
US$ 23.50
Convert Currency

Add to Basket

Shipping: FREE
From United Kingdom to U.S.A.
Destination, Rates & Speeds

4.

Kevin Schmidt, Christopher Phillips
Published by O Reilly Media, Inc, USA, United States (2013)
ISBN 10: 1449363628 ISBN 13: 9781449363628
New Paperback Quantity Available: 1
Seller:
The Book Depository
(London, United Kingdom)
Rating
[?]

Book Description O Reilly Media, Inc, USA, United States, 2013. Paperback. Book Condition: New. Language: English . Brand New Book. Although you don t need a large computing infrastructure to process massive amounts of data with Apache Hadoop, it can still be difficult to get started. This practical guide shows you how to quickly launch data analysis projects in the cloud by using Amazon Elastic MapReduce (EMR), the hosted Hadoop framework in Amazon Web Services (AWS). Authors Kevin Schmidt and Christopher Phillips demonstrate best practices for using EMR and various AWS and Apache technologies by walking you through the construction of a sample MapReduce log analysis application. Using code samples and example configurations, you ll learn how to assemble the building blocks necessary to solve your biggest data analysis problems. Get an overview of the AWS and Apache software tools used in large-scale data analysis Go through the process of executing a Job Flow with a simple log analyzer Discover useful MapReduce patterns for filtering and analyzing data sets Use Apache Hive and Pig instead of Java to build a MapReduce Job Flow Learn the basics for using Amazon EMR to run machine learning algorithms Develop a project cost model for using Amazon EMR and other AWS tools. Bookseller Inventory # AAH9781449363628

More Information About This Seller | Ask Bookseller a Question

Buy New
US$ 24.26
Convert Currency

Add to Basket

Shipping: FREE
From United Kingdom to U.S.A.
Destination, Rates & Speeds

5.

Schmidt, Kevin; Phillips, Christopher
Published by O'Reilly Media
ISBN 10: 1449363628 ISBN 13: 9781449363628
New PAPERBACK Quantity Available: > 20
Seller:
Mediaoutlet12345
(Springfield, VA, U.S.A.)
Rating
[?]

Book Description O'Reilly Media. PAPERBACK. Book Condition: New. 1449363628 *BRAND NEW* Ships Same Day or Next!. Bookseller Inventory # SWATI2132557186

More Information About This Seller | Ask Bookseller a Question

Buy New
US$ 21.95
Convert Currency

Add to Basket

Shipping: US$ 3.99
Within U.S.A.
Destination, Rates & Speeds

6.

Kevin Schmidt; Christopher Phillips
Published by O'Reilly Media (2013)
ISBN 10: 1449363628 ISBN 13: 9781449363628
New Paperback First Edition Quantity Available: 1
Seller:
Irish Booksellers
(Rumford, ME, U.S.A.)
Rating
[?]

Book Description O'Reilly Media, 2013. Paperback. Book Condition: New. book. Bookseller Inventory # M1449363628

More Information About This Seller | Ask Bookseller a Question

Buy New
US$ 27.97
Convert Currency

Add to Basket

Shipping: FREE
Within U.S.A.
Destination, Rates & Speeds

7.

Schmidt, Kevin
Published by O'Reilly Media 12/29/2013 (2013)
ISBN 10: 1449363628 ISBN 13: 9781449363628
New Paperback or Softback Quantity Available: 10
Seller:
BargainBookStores
(Grand Rapids, MI, U.S.A.)
Rating
[?]

Book Description O'Reilly Media 12/29/2013, 2013. Paperback or Softback. Book Condition: New. Programming Elastic Mapreduce: Using Aws Services to Build an End-To-End Application. Book. Bookseller Inventory # BBS-9781449363628

More Information About This Seller | Ask Bookseller a Question

Buy New
US$ 29.40
Convert Currency

Add to Basket

Shipping: FREE
Within U.S.A.
Destination, Rates & Speeds

8.

Kevin Schmidt; Christopher Phillips
ISBN 10: 1449363628 ISBN 13: 9781449363628
New Quantity Available: 1
Seller:
Speedy Hen LLC
(Sunrise, FL, U.S.A.)
Rating
[?]

Book Description Book Condition: New. Bookseller Inventory # ST1449363628. Bookseller Inventory # ST1449363628

More Information About This Seller | Ask Bookseller a Question

Buy New
US$ 29.43
Convert Currency

Add to Basket

Shipping: FREE
Within U.S.A.
Destination, Rates & Speeds

9.

Schmidt, Kevin
Published by Oand#8242;Reilly (2014)
ISBN 10: 1449363628 ISBN 13: 9781449363628
New Quantity Available: 2
Seller:
Books2Anywhere
(Fairford, GLOS, United Kingdom)
Rating
[?]

Book Description Oand#8242;Reilly, 2014. PAP. Book Condition: New. New Book. Shipped from UK in 4 to 14 days. Established seller since 2000. Bookseller Inventory # WO-9781449363628

More Information About This Seller | Ask Bookseller a Question

Buy New
US$ 18.10
Convert Currency

Add to Basket

Shipping: US$ 11.88
From United Kingdom to U.S.A.
Destination, Rates & Speeds

10.

Kevin Schmidt
Published by OReilly Media
ISBN 10: 1449363628 ISBN 13: 9781449363628
New Paperback Quantity Available: 1
Seller:
THE SAINT BOOKSTORE
(Southport, United Kingdom)
Rating
[?]

Book Description OReilly Media. Paperback. Book Condition: New. New copy - Usually dispatched within 2 working days. Bookseller Inventory # B9781449363628

More Information About This Seller | Ask Bookseller a Question

Buy New
US$ 23.19
Convert Currency

Add to Basket

Shipping: US$ 9.16
From United Kingdom to U.S.A.
Destination, Rates & Speeds

There are more copies of this book

View all search results for this book