Programming Massively Parallel Processors A Hands on Approach 3rd Edition by David Kirk, Wen mei Hwu – Ebook PDF Instant Download/Delivery: 0128119861, 9780128119860
Full download Programming Massively Parallel Processors A Hands on Approach 3rd Edition after payment
Product details:
ISBN 10: 0128119861
ISBN 13: 9780128119860
Author: David Kirk, Wen mei Hwu
Programming Massively Parallel Processors: A Hands-on Approach, Third Edition shows both student and professional alike the basic concepts of parallel programming and GPU architecture, exploring, in detail, various techniques for constructing parallel programs.
Case studies demonstrate the development process, detailing computational thinking and ending with effective and efficient parallel programs. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in-depth.
For this new edition, the authors have updated their coverage of CUDA, including coverage of newer libraries, such as CuDNN, moved content that has become less important to appendices, added two new chapters on parallel patterns, and updated case studies to reflect current industry practices.
- Teaches computational thinking and problem-solving techniques that facilitate high-performance parallel computing
- Utilizes CUDA version 7.5, NVIDIA’s software development tool created specifically for massively parallel environments
- Contains new and updated case studies
- Includes coverage of newer libraries, such as CuDNN for Deep Learning
Programming Massively Parallel Processors A Hands on Approach 3rd Table of contents:
Chapter 1. Introduction
Abstract
1.1 Heterogeneous Parallel Computing
1.2 Architecture of a Modern GPU
1.3 Why More Speed or Parallelism?
1.4 Speeding Up Real Applications
1.5 Challenges in Parallel Programming
1.6 Parallel Programming Languages and Models
1.7 Overarching Goals
1.8 Organization of the Book
References
Chapter 2. Data parallel computing
Abstract
2.1 Data Parallelism
2.2 CUDA C Program Structure
2.3 A Vector Addition Kernel
2.4 Device Global Memory and Data Transfer
2.5 Kernel Functions and Threading
2.6 Kernel Launch
2.7 Summary
References
Chapter 3. Scalable parallel execution
Abstract
3.1 CUDA Thread Organization
3.2 Mapping Threads to Multidimensional Data
3.3 Image Blur: A More Complex Kernel
3.4 Synchronization and Transparent Scalability
3.5 Resource Assignment
3.6 Querying Device Properties
3.7 Thread Scheduling and Latency Tolerance
3.8 Summary
Chapter 4. Memory and data locality
Abstract
4.1 Importance of Memory Access Efficiency
4.2 Matrix Multiplication
4.3 CUDA Memory Types
4.4 Tiling for Reduced Memory Traffic
4.5 A Tiled Matrix Multiplication Kernel
4.6 Boundary Checks
4.7 Memory as a Limiting Factor to Parallelism
4.8 Summary
Chapter 5. Performance considerations
Abstract
5.1 Global Memory Bandwidth
5.2 More on Memory Parallelism
5.3 Warps and SIMD Hardware
5.4 Dynamic Partitioning of Resources
5.5 Thread Granularity
5.6 Summary
References
Chapter 6. Numerical considerations
Abstract
6.1 Floating-Point Data Representation
6.2 Representable Numbers
6.3 Special Bit Patterns and Precision in IEEE Format
6.4 Arithmetic Accuracy and Rounding
6.5 Algorithm Considerations
6.6 Linear Solvers and Numerical Stability
6.7 Summary
References
Chapter 7. Parallel patterns: convolution: An introduction to stencil computation
Abstract
7.1 Background
7.2 1D Parallel Convolution—A Basic Algorithm
7.3 Constant Memory and Caching
7.4 Tiled 1D Convolution with Halo Cells
7.5 A Simpler Tiled 1D Convolution—General Caching
7.6 Tiled 2D Convolution With Halo Cells
7.7 Summary
7.8 Exercises
Chapter 8. Parallel patterns: prefix sum: An introduction to work efficiency in parallel algorithms
Abstract
8.1 Background
8.2 A Simple Parallel Scan
8.3 Speed and Work Efficiency
8.4 A More Work-Efficient Parallel Scan
8.5 An Even More Work-Efficient Parallel Scan
8.6 Hierarchical Parallel Scan for Arbitrary-Length Inputs
8.7 Single-Pass Scan for Memory Access Efficiency
8.8 Summary
8.9 Exercises
References
Chapter 9. Parallel patterns—parallel histogram computation: An introduction to atomic operations and privatization
Abstract
9.1 Background
9.2 Use of Atomic Operations
9.3 Block versus Interleaved Partitioning
9.4 Latency versus Throughput of Atomic Operations
9.5 Atomic Operation in Cache Memory
9.6 Privatization
9.7 Aggregation
9.8 Summary
Reference
Chapter 10. Parallel patterns: sparse matrix computation: An introduction to data compression and regularization
Abstract
10.1 Background
10.2 Parallel SpMV Using CSR
10.3 Padding and Transposition
10.4 Using a Hybrid Approach to Regulate Padding
10.5 Sorting and Partitioning for Regularization
10.6 Summary
References
Chapter 11. Parallel patterns: merge sort: An introduction to tiling with dynamic input data identification
Abstract
11.1 Background
11.2 A Sequential Merge Algorithm
11.3 A Parallelization Approach
11.4 Co-Rank Function Implementation
11.5 A Basic Parallel Merge Kernel
11.6 A Tiled Merge Kernel
11.7 A Circular-Buffer Merge Kernel
11.8 Summary
Reference
Chapter 12. Parallel patterns: graph search
Abstract
12.1 Background
12.2 Breadth-First Search
12.3 A Sequential BFS Function
12.4 A Parallel BFS Function
12.5 Optimizations
12.6 Summary
References
Chapter 13. CUDA dynamic parallelism
Abstract
13.1 Background
13.2 Dynamic Parallelism Overview
13.3 A Simple Example
13.4 Memory Data Visibility
13.5 Configurations and Memory Management
13.6 Synchronization, Streams, and Events
13.7 A More Complex Example
13.8 A Recursive Example
13.9 Summary
References
A13.1 Code Appendix
Chapter 14. Application case study—non-Cartesian magnetic resonance imaging: An introduction to statistical estimation methods
Abstract
14.1 Background
14.2 Iterative Reconstruction
14.3 Computing FHD
14.4 Final Evaluation
References
Chapter 15. Application case study—molecular visualization and analysis
Abstract
15.1 Background
15.2 A Simple Kernel Implementation
15.3 Thread Granularity Adjustment
15.4 Memory Coalescing
15.5 Summary
References
Chapter 16. Application case study—machine learning
Abstract
16.1 Background
16.2 Convolutional Neural Networks
16.3 Convolutional Layer: A Basic CUDA Implementation of Forward Propagation
16.4 Reduction of Convolutional Layer to Matrix Multiplication
16.5 cuDNN Library
References
Chapter 17. Parallel programming and computational thinking
Abstract
17.1 Goals of Parallel Computing
17.2 Problem Decomposition
17.3 Algorithm Selection
17.4 Computational Thinking
17.5 Single Program, Multiple Data, Shared Memory and Locality
17.6 Strategies for Computational Thinking
17.7 A Hypothetical Example: Sodium Map of the Brain
17.8 Summary
References
Chapter 18. Programming a heterogeneous computing cluster
Abstract
18.1 Background
18.2 A Running Example
18.3 Message Passing Interface Basics
18.4 Message Passing Interface Point-to-Point Communication
18.5 Overlapping Computation and Communication
18.6 Message Passing Interface Collective Communication
18.7 CUDA-Aware Message Passing Interface
18.8 Summary
Reference
Chapter 19. Parallel programming with OpenACC
Abstract
19.1 The OpenACC Execution Model
19.2 OpenACC Directive Format
19.3 OpenACC by Example
19.4 Comparing OpenACC and CUDA
19.5 Interoperability with CUDA and Libraries
19.6 The Future of OpenACC
Chapter 20. More on CUDA and graphics processing unit computing
Abstract
20.1 Model of Host/Device Interaction
20.2 Kernel Execution Control
20.3 Memory Bandwidth and Compute Throughput
20.4 Programming Environment
20.5 Future Outlook
References
Chapter 21. Conclusion and outlook
Abstract
21.1 Goals Revisited
21.2 Future Outlook
People also search for Programming Massively Parallel Processors A Hands on Approach 3rd:
programming massively parallel processors o’reilly
r programming parallel processing
programming massively parallel
r parallel programming
Reviews
There are no reviews yet.