Conquering Big Data with High Performance Computing 1st editon by Ritu Arora – Ebook PDF Instant Download/Delivery: 3319337424, 9783319337425
Full dowload Conquering Big Data with High Performance Computing 1st editon after payment
Product details:
ISBN 10: 3319337424
ISBN 13: 9783319337425
Author: Ritu Arora
This book provides an overview of the resources and research projects that are bringing Big Data and High Performance Computing (HPC) on converging tracks. It demystifies Big Data and HPC for the reader by covering the primary resources, middleware, applications, and tools that enable the usage of HPC platforms for Big Data management and processing. Through interesting use-cases from traditional and non-traditional HPC domains, the book highlights the most critical challenges related to Big Data processing and management, and shows ways to mitigate them using HPC resources. Unlike most books on Big Data, it covers a variety of alternatives to Hadoop, and explains the differences between HPC platforms and Hadoop. Written by professionals and researchers in a range of departments and fields, this book is designed for anyone studying Big Data and its future directions. Those studying HPC will also find the content valuable.
Conquering Big Data with High Performance Computing 1st Table of contents:
1 An Introduction to Big Data, High Performance Computing, High-Throughput Computing, and Hadoop
1.1 Big Data
1.2 High Performance Computing (HPC)
1.2.1 HPC Platform
1.2.2 Serial and Parallel Processing on HPC Platform
1.3 High-Throughput Computing (HTC)
1.4 Hadoop
1.4.1 Hadoop-Related Technologies
1.4.2 Some Limitations of Hadoop and Hadoop-Related Technologies
1.5 Convergence of Big Data, HPC, HTC, and Hadoop
1.6 HPC and Big Data Processing in Cloud and at Open-Science Data Centers
1.7 Conclusion
References
2 Using High Performance Computing for Conquering Big Data
2.1 Introduction
2.2 The Big Data Life Cycle
2.3 Technologies and Hardware Platforms for Managing the Big Data Life Cycle
2.4 Managing Big Data Life Cycle on HPC Platforms at Open-Science Data Centers
2.4.1 TACC Resources and Usage Policies
2.4.2 End-to-End Big Data Life Cycle on TACC Resources
2.5 Use Case: Optimization of Nuclear Fusion Devices
2.5.1 Optimization
2.5.2 Computation on HPC
2.5.3 Visualization Using GPUs
2.5.4 Permanent Storage of Valuable Data
2.6 Conclusions
References
3 Data Movement in Data-Intensive High Performance Computing
3.1 Introduction
3.2 Node-Level Data Movement
3.2.1 Case Study: ADAMANT
3.2.2 Case Study: Energy Cost of Data Movement
3.3 System-Level Data Movement
3.3.1 Case Study: Graphs
3.3.2 Case Study: Map Reduce
3.4 Center-Level Data Movement
3.4.1 Case Study: Spider
3.4.2 Case Study: Gordon and Oasis
3.5 About the Authors
References
4 Using Managed High Performance Computing Systems for High-Throughput Computing
4.1 Introduction
4.2 What Are We Trying to Do?
4.2.1 Deductive Computation
4.2.2 Inductive Computation
4.2.2.1 High-Throughput Computing
4.3 Hurdles to Using HPC Systems for HTC
4.3.1 Runtime Limits
4.3.2 Jobs-in-Queue Limits
4.3.3 Dynamic Job Submission Restrictions
4.3.4 Solutions from Resource Managers and Big Data Research
4.3.5 A Better Solution for Managed HPC Systems
4.4 Launcher
4.4.1 How Launcher Works
4.4.2 Guided Example: A Simple Launcher Bundle
4.4.2.1 Step 1: Create a Job File
4.4.2.2 Step 2: Build a SLURM Batch Script
4.4.3 Using Various Scheduling Methods
4.4.3.1 Dynamic Scheduling
4.4.3.2 Static Scheduling
4.4.4 Launcher with Intel®Xeon Phi™ Coprocessors
4.4.4.1 Offload
4.4.4.2 Independent Workloads for Host and Coprocessor
4.4.4.3 Symmetric Execution on Host and Phi
4.4.5 Use Case: Molecular Docking and Virtual Screening
4.5 Conclusion
References
5 Accelerating Big Data Processing on Modern HPC Clusters
5.1 Introduction
5.2 Overview of Apache Hadoop and Spark
5.2.1 Overview of Apache Hadoop Distributed File System
5.2.2 Overview of Apache Hadoop MapReduce
5.2.3 Overview of Apache Spark
5.3 Overview of High-Performance Interconnects and Storage Architecture on Modern HPC Clusters
5.3.1 Overview of High-Performance Interconnects and Protocols
5.3.1.1 Overview of High Speed Ethernet
5.3.1.2 Overview of InfiniBand
5.3.2 Overview of High-Performance Storage
5.4 Challenges in Accelerating Big Data Processing on Modern HPC Clusters
5.5 Case Studies of Accelerating Big Data Processing on Modern HPC Clusters
5.5.1 Accelerating HDFS with RDMA
5.5.2 Accelerating HDFS with Heterogeneous Storage
5.5.3 Accelerating HDFS with Lustre Through Key-Value Store-Based Burst Buffer System
5.5.4 Accelerating Hadoop MapReduce with RDMA
5.5.5 Accelerating MapReduce with Lustre
5.5.6 Accelerating Apache Spark with RDMA
5.6 High-Performance Big Data (HiBD) Project
5.7 Conclusion
References
6 dispel4py: Agility and Scalability for Data-Intensive Methods Using HPC
6.1 Introduction
6.2 Motivation
6.2.1 Supporting Domain Specialists
6.2.2 Supporting Data Scientists
6.2.3 Supporting Data-Intensive Engineers
6.2.4 Communication Between Experts
6.3 Background and Related Work
6.4 Semantics, Examples and Tutorial
6.5 dispel4py Tools
6.5.1 Registry
6.5.2 Provenance Management
6.5.3 Diagnosis Tool
6.6 Engineering Effective Mappings
6.6.1 Apache Storm
6.6.2 MPI
6.6.3 Multiprocessing
6.6.4 Spark
6.6.5 Sequential Mode
6.7 Performance
6.7.1 Experiments
6.7.2 Experimental Results
6.7.2.1 Scalability Experiments
6.7.2.2 Performance Experiments
6.7.3 Analysis of Measurements
6.8 Summary and Future Work
References
7 Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters
7.1 Introduction
7.2 Related Work
7.3 Design and Implementation
7.4 Case Study—PTF Application
7.4.1 PTF Application
7.4.2 Execution Time Analysis
7.4.3 Data Dependency Performance Analysis
7.4.3.1 Analysis of Saved Objects
7.4.3.2 Analysis of Galactic Latitude
7.5 Case Study: Job Log Analysis
7.5.1 Job Logs
7.5.2 Test Setup
7.5.3 Job Log Analysis
7.5.4 Clustering Analysis
7.6 Conclusion
References
8 Big Data Behind Big Data
8.1 Background and Goals of the Project
8.1.1 The Many Faces of Data
8.1.2 Data Variety and Location
8.1.3 The Different Consumers of the Data
8.2 What Big Data Did We Have?
8.2.1 Collected Data
8.2.2 Data-in-Flight
8.2.3 Data-at-Rest
8.2.4 Data-in-growth
8.2.5 Event Data
8.2.6 Data Types to Collect
8.3 The Old Method Prompts a New Solution
8.3.1 Environmental Data
8.3.2 Host Based Data
8.3.3 Refinement of the Goal
8.4 Out with the Old, in with the New Design
8.4.1 Elastic
8.4.2 Data Collection
8.4.2.1 Collectd
8.4.2.2 Custom Scripts
8.4.2.3 Filebeats
8.4.3 Data Transport Components
8.4.3.1 RabbitMQ®
8.4.3.2 Logstash
8.4.4 Data Storage
8.4.4.1 Elasticsearch
8.4.5 Visualization and Analysis
8.4.5.1 Kibana
8.4.6 Future Growth and Enhancements
8.5 Data Collected
8.5.1 Environmental
8.5.2 Computational
8.5.3 Event
8.6 The Analytics of It All: It Just Works!
8.7 Conclusion
References
9 Empowering R with High Performance Computing Resources for Big Data Analytics
9.1 Introduction
9.1.1 Introduction of R
9.1.2 Motivation of Empowering R with HPC
9.2 Opportunities in High Performance Computing to Empower R
9.2.1 Parallel Computation Within a Single Compute Node
9.2.2 Multi-Node Parallelism Support
9.3 Support for Parallelism in R
9.3.1 Support for Parallel Execution Within a Single Node in R
9.3.2 Support for Parallel Execution Over Multiple Nodes with MPI
9.3.3 Packages Utilizing Other Distributed Systems
9.4 Parallel Performance Comparison of Selected Packages
9.4.1 Performance of Using Intel® Xeon Phi Coprocessor
9.4.1.1 Testing Workloads
9.4.1.2 System Specification
9.4.1.3 Results and Discussion
9.4.2 Comparison of Parallel Packages in R
9.5 Use Case Examples
9.5.1 Enabling JAGS (Just Another Gibbs Sampler) on Multiple Nodes
9.5.2 Exemplar Application Using Coprocessors
9.6 Discussions and Conclusion
References
10 Big Data Techniques as a Solution to Theory Problems
10.1 Introduction
10.2 General Formulation of Big Data Solution Method
10.2.1 General Formulation of Class of Models and Solution Method
10.2.2 Computational Steps to Big Data Solution Method
10.2.3 Virtues of Equidistributed Sequences
10.3 Optimal Tax Application
10.4 Other Applications
10.5 Conclusion
References
11 High-Frequency Financial Statistics Through High-Performance Computing
11.1 Introduction
11.2 Large Portfolio Allocation for High-Frequency Financial Data
11.2.1 Background
11.2.2 Our Methods
11.3 Parallelism Considerations
11.3.1 Parallel R
11.3.2 Intel”472 MKL
11.3.3 Offloading to Phi Coprocessor
11.3.4 Our Computing Environment
11.4 Numerical Studies
11.4.1 Portfolio Optimization with High-Frequency Data
11.4.1.1 LASSO Approximation for Risk Minimization Problem
11.4.1.2 Parallelization
11.4.2 Bayesian Large-Scale Multiple Testing for Time Series Data
11.4.2.1 Hidden Markov Model and Multiple Hypothesis Testing
11.4.2.2 Parallelization
11.5 Discussion and Conclusions
References
12 Large-Scale Multi-Modal Data Exploration with Human in the Loop
12.1 Background
12.2 Details of Implementation Models
12.2.1 Developing Top-Down Knowledge Hypotheses From Visual Analysis of Multi-Modal Data Streams
12.2.1.1 Color-Based Representation of Temporal Events
12.2.1.2 Generating Logical Conjunctions
12.2.1.3 Developing Hypotheses From Visual Analysis
12.2.2 Complementing Hypotheses with Bottom-Up Quantitative Measures
12.2.2.1 Clustering in 2D Event Space
12.2.2.2 Apriori-Like Pattern Searching
12.2.2.3 Integrating Bottom-Up Machine Computation and Top-Down Domain Knowledge
12.2.3 Large-Scale Multi-Modal Data Analytics with Iterative MapReduce Tasks
12.2.3.1 Parallelization Choices
12.2.3.2 Parallel Temporal Pattern Mining Using Twister MapReduce Tasks
12.3 Preliminary Results
12.4 Conclusion and Future Work
References
13 Using High Performance Computing for Detecting Duplicate, Similar and Related Images in a Large D
13.1 Introduction
13.2 Challenges in Using Existing Solutions
13.3 New Solution for Large-Scale Image Comparison
13.3.1 Pre-processing Stage
13.3.2 Processing Stage
13.3.3 Post-processing Stage
13.4 Test Collection from the Institute of Classical Archaeology (ICA)
13.5 Testing the Solution on Stampede: Early Results and Current Limitations
13.6 Future Work
13.7 Conclusion
References
14 Big Data Processing in the eDiscovery Domain
14.1 Introduction to eDiscovery
14.2 Big Data Challenges in eDiscovery
14.3 Key Techniques Used to Process Big Data in eDiscovery
14.3.1 Culling to Reduce Dataset Size
14.3.2 Metadata Extraction
14.3.3 Dataset Partitioning and Archival
14.3.4 Sampling and Workload Profiling
14.3.5 Multi-Pass (Iterative) Processing and Interactive Analysis
14.3.6 Search and Review Methods
14.3.7 Visual Analytics
14.3.8 Software Refactoring and Parallelization
14.4 Limitations of Existing eDiscovery Solutions
14.5 Using HPC for eDiscovery
14.5.1 Data Collection and Data Ingestion
14.5.2 Pre-processing
14.5.3 Processing
14.5.4 Review and Analysis
14.5.5 Archival
14.6 Accelerating the Rate of eDiscovery Using HPC: A Case Study
14.7 Conclusions and Future Direction
References
15 Databases and High Performance Computing
15.1 Introduction
15.2 Databases on Supercomputing Resources
15.2.1 Relational Databases
15.2.2 NoSQL or Non-relational and Hadoop Databases
15.2.3 Graph Databases
15.2.4 Scientific and Specialized Databases
15.3 Installing a Database on a Supercomputing Resource
15.4 Accessing a Database on Supercomputing Resources
15.5 Optimizing Database Access on Supercomputing Resources
15.6 Examples of Applications Using Databases on Supercomputing Resources
15.7 Conclusion
References
16 Conquering Big Data Through the Usage of the Wrangler Supercomputer
16.1 Introduction
16.1.1 Wrangler System Overview
16.1.2 A New User Community for Supercomputers
16.2 First Use-Case: Evolution of Monogamy
16.3 Second Use-Case: Save Money, Save Energy with Supercomputers
16.4 Third Use-Case: Human Origins in Fossil Data
16.5 Fourth Use-Case: Dark Energy of a Million Galaxies
16.6 Conclusion
References
People also search for Conquering Big Data with High Performance Computing 1st:
high-performance big data computing pdf
a large-scale study of failures in high-performance computing systems
high performance computing book
high performance computing vs quantum computing
Reviews
There are no reviews yet.