Find a copy online
Links to this item
Find a copy in the library
Finding libraries that hold this item...
|Additional Physical Format:||Print version:
Data Analytics with Hadoop.
[Place of publication not identified] : O'Reilly Media, Incorporated 2015
|Material Type:||Document, Internet resource|
|Document Type:||Internet Resource, Computer File|
|All Authors / Contributors:||
Benjamin Bengfort; Jenny Kim
|ISBN:||9781491913765 1491913762 9781491913758 1491913754|
|Description:||1 online resource : illustrations|
|Contents:||Copyright; Table of Contents; Preface; What to Expect from This Book; Who This Book Is For; How to Read This Book; Overview of Chapters; Programming and Code Examples; GitHub Repository; Executing Distributed Jobs; Permissions and Citation; Feedback and How to Contact Us; Safari® Books Online; How to Contact Us; Acknowledgments; Part I. Introduction to Distributed Computing; Chapter 1. The Age of the Data Product; What Is a Data Product?; Building Data Products at Scale with Hadoop; Leveraging Large Datasets; Hadoop for Data Products; The Data Science Pipeline and the Hadoop Ecosystem Big Data WorkflowsConclusion; Chapter 2. An Operating System for Big Data; Basic Concepts; Hadoop Architecture; A Hadoop Cluster; HDFS; YARN; Working with a Distributed File System; Basic File System Operations; File Permissions in HDFS; Other HDFS Interfaces; Working with Distributed Computation; MapReduce: A Functional Programming Model; MapReduce: Implemented on a Cluster; Beyond a Map and Reduce: Job Chaining; Submitting a MapReduce Job to YARN; Conclusion; Chapter 3. A Framework for Python and Hadoop Streaming; Hadoop Streaming; Computing on CSV Data with Streaming Executing Streaming JobsA Framework for MapReduce with Python; Counting Bigrams; Other Frameworks; Advanced MapReduce; Combiners; Partitioners; Job Chaining; Conclusion; Chapter 4. In-Memory Computing with Spark; Spark Basics; The Spark Stack; Resilient Distributed Datasets; Programming with RDDs; Interactive Spark Using PySpark; Writing Spark Applications; Visualizing Airline Delays with Spark; Conclusion; Chapter 5. Distributed Analysis and Patterns; Computing with Keys; Compound Keys; Keyspace Patterns; Pairs versus Stripes; Design Patterns; Summarization; Indexing; Filtering Toward Last-Mile AnalyticsFitting a Model; Validating Models; Conclusion; Part II. Workflows and Tools for Big Data Science; Chapter 6. Data Mining and Warehousing; Structured Data Queries with Hive; The Hive Command-Line Interface (CLI); Hive Query Language (HQL); Data Analysis with Hive; HBase; NoSQL and Column-Oriented Databases; Real-Time Analytics with HBase; Conclusion; Chapter 7. Data Ingestion; Importing Relational Data with Sqoop; Importing from MySQL to HDFS; Importing from MySQL to Hive; Importing from MySQL to HBase; Ingesting Streaming Data with Flume; Flume Data Flows Ingesting Product Impression Data with FlumeConclusion; Chapter 8. Analytics with Higher-Level APIs; Pig; Pig Latin; Data Types; Relational Operators; User-Defined Functions; Wrapping Up; Spark's Higher-Level APIs; Spark SQL; DataFrames; Conclusion; Chapter 9. Machine Learning; Scalable Machine Learning with Spark; Collaborative Filtering; Classification; Clustering; Conclusion; Chapter 10. Summary: Doing Distributed Data Science; Data Product Lifecycle; Data Lakes; Data Ingestion; Computational Data Stores; Machine Learning Lifecycle; Conclusion|
|Responsibility:||Benjamin Bengfort and Jenny Kim.|