PySpark in Action Python data analysis with Python and PySpark 1st edition by Jonathan Rioux – Ebook PDF Instant Download/Delivery: 1638350663, 9781638350668
Full download PySpark in Action Python data analysis with Python and PySpark 1st edition after payment

Product details:
ISBN 10: 1638350663
ISBN 13: 9781638350668
Author: Jonathan Rioux
Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines. In Data Analysis with Python and PySpark you will learn how to: Manage your data as it scales across multiple machines Scale up your data programs with full confidence Read and write data to and from a variety of sources and formats Deal with messy data with PySpark’s data manipulation functionality Discover new data sets and perform exploratory data analysis Build automated data pipelines that transform, summarize, and get insights from data Troubleshoot common PySpark errors Creating reliable long-running jobs Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required. About the technology The Spark data processing engine is an amazing analytics factory: raw data comes in, insight comes out. PySpark wraps Spark’s core engine with a Python-based API. It helps simplify Spark’s steep learning curve and makes this powerful tool available to anyone working in the Python data ecosystem. About the book Data Analysis with Python and PySpark helps you solve the daily challenges of data science with PySpark. You’ll learn how to scale your processing capabilities across multiple machines while ingesting data from any source—whether that’s Hadoop clusters, cloud data storage, or local data files. Once you’ve covered the fundamentals, you’ll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code. What’s inside Organizing your PySpark code Managing your data, no matter the size Scale up your data programs with full confidence Troubleshooting common data pipeline problems Creating reliable long-running jobs About the reader Written for data scientists and data engineers comfortable with Python. About the author As a ML director for a data-driven software company, Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts. Table of Contents 1 Introduction PART 1 GET ACQUAINTED: FIRST STEPS IN PYSPARK 2 Your first data program in PySpark 3 Submitting and scaling your first PySpark program 4 Analyzing tabular data with pyspark.sql 5 Data frame gymnastics: Joining and grouping PART 2 GET PROFICIENT: TRANSLATE YOUR IDEAS INTO CODE 6 Multidimensional data frames: Using PySpark with JSON data 7 Bilingual PySpark: Blending Python and SQL code 8 Extending PySpark with Python: RDD and UDFs 9 Big data is just a lot of small data: Using pandas UDFs 10 Your data under a different lens: Window functions 11 Faster PySpark: Understanding Spark’s query planning PART 3 GET CONFIDENT: USING MACHINE LEARNING WITH PYSPARK 12 Setting the stage: Preparing features for machine learning 13 Robust machine learning with ML Pipelines 14 Building custom ML transformers and estimators
PySpark in Action Python data analysis with Python and PySpark 1st Table of contents:
1. Introduction
1.1 What is PySpark?
1.1.1 Taking it from the start: What is Spark?
1.1.2 PySpark = Spark + Python
1.1.3 Why PySpark?
1.2 Your very own factory: How PySpark works
1.2.1 Some physical planning with the cluster manager
1.2.2 A factory made efficient through a lazy leader
1.3 What will you learn in this book?
1.4 What do I need to get started?
Summary
Part 1: Get acquainted – First steps in PySpark
2. Your first data program in PySpark
2.1 Setting up the PySpark shell
2.1.1 The SparkSession entry point
2.1.2 Configuring how chatty Spark is: The log level
2.2 Mapping our program
2.3 Ingest and explore: Setting the stage for data transformation
2.3.1 Reading data into a DataFrame with spark.read
2.3.2 From structure to content: Exploring our DataFrame with show()
2.4 Simple column transformations: Moving from a sentence to a list of words
2.4.1 Selecting specific columns using select()
2.4.2 Transforming columns: Splitting a string into a list of words
2.4.3 Renaming columns: alias and withColumnRenamed
2.4.4 Reshaping your data: Exploding a list into rows
2.4.5 Working with words: Changing case and removing punctuation
2.5 Filtering rows
Summary
Additional exercises
Exercise 2.2
Exercise 2.3
Exercise 2.4
Exercise 2.5
Exercise 2.6
Exercise 2.7
3. Submitting and scaling your first PySpark program
3.1 Grouping records: Counting word frequencies
3.2 Ordering the results on the screen using orderBy
3.3 Writing data from a DataFrame
3.4 Putting it all together: Counting
3.4.1 Simplifying your dependencies with PySpark’s import conventions
3.4.2 Simplifying our program via method chaining
3.5 Using spark-submit to launch your program in batch mode
3.6 What didn’t happen in this chapter
3.7 Scaling up our word frequency program
Summary
Additional Exercises
Exercise 3.3
Exercise 3.4
Exercise 3.5
Exercise 3.6
4. Analyzing tabular data with pyspark.sql
4.1 What is tabular data?
4.1.1 How does PySpark represent tabular data?
4.2 PySpark for analyzing and processing tabular data
4.3 Reading and assessing delimited data in PySpark
4.3.1 A first pass at the SparkReader specialized for CSV files
4.3.2 Customizing the SparkReader object to read CSV data files
4.3.3 Exploring the shape of our data universe
4.4 The basics of data manipulation: Selecting, dropping, renaming, ordering, diagnosing
4.4.1 Knowing what we want: Selecting columns
4.4.2 Keeping what we need: Deleting columns
4.4.3 Creating what’s not there: New columns with withColumn()
4.4.4 Tidying our DataFrame: Renaming and reordering columns
4.4.5 Diagnosing a DataFrame with describe() and summary()
Summary
Additional exercises
Exercise 4.3
Exercise 4.4
5. Data frame gymnastics: Joining and grouping
5.1 From many to one: Joining data
5.1.1 What’s what in the world of joins
5.1.2 Knowing our left from our right
5.1.3 The rules to a successful join: The predicates
5.1.4 How do you do it: The join() method
5.1.5 Naming conventions in the joining world
5.2 Summarizing the data via groupBy and GroupedData
5.2.1 A simple groupBy blueprint
5.2.2 A column is a column: Using agg() with custom column definitions
5.3 Taking care of null values: Drop and fill
5.3.1 Dropping it like it’s hot: Using dropna() to remove records with null values
5.3.2 Filling values to our heart’s content using fillna()
5.4 What was our question again? Our end-to-end program
Summary
Additional exercises
Exercise 5.4
Exercise 5.5
Exercise 5.6
Exercise 5.7
Part 2: Get proficient – Translate your ideas into code
6. Multidimensional data frames: Using PySpark with JSON data
6.1 Reading JSON data: Getting ready for the schemapocalypse
6.1.1 Starting small: JSON data as a limited Python dictionary
6.1.2 Going bigger: Reading JSON data in PySpark
6.2 Breaking the second dimension with complex data types
6.2.1 When you have more than one value: The array
6.2.2 The map type: Keys and values within a column
6.3 The struct: Nesting columns within columns
6.3.1 Navigating structs as if they were nested columns
6.4 Building and using the DataFrame schema
6.4.1 Using Spark types as the base blocks of a schema
6.4.2 Reading a JSON document with a strict schema in place
6.4.3 Going full circle: Specifying your schemas in JSON
6.5 Putting it all together: Reducing duplicate data with complex data types
6.5.1 Getting to the “just right” DataFrame: explode and collect
6.5.2 Building your own hierarchies: Struct as a function
Summary
Additional exercises
Exercise 6.4
Exercise 6.5
Exercise 6.6
Exercise 6.7
Exercise 6.8
7. Bilingual PySpark: Blending Python and SQL code
7.1 Banking on what we know: pyspark.sql vs. plain SQL
7.2 Preparing a DataFrame for SQL
7.2.1 Promoting a DataFrame to a Spark table
7.2.2 Using the Spark catalog
7.3 SQL and PySpark
7.4 Using SQL-like syntax within DataFrame methods
7.4.1 Get the rows and columns you want: select and where
7.4.2 Grouping similar records together: groupBy and orderBy
7.4.3 Filtering after grouping using having
7.4.4 Creating new tables/views using the CREATE keyword
7.4.5 Adding data to our table using UNION and JOIN
7.4.6 Organizing your SQL code better through subqueries and common table expressions
7.4.7 A quick summary of PySpark vs. SQL syntax
7.5 Simplifying our code: Blending SQL and Python
7.5.1 Using Python to increase resiliency and simplify the data reading stage
7.5.2 Using SQL-style expressions in PySpark
7.6 Conclusion
Summary
Additional exercises
Exercise 7.2
Exercise 7.3
Exercise 7.4
Exercise 7.5
People also search for PySpark in Action Python data analysis with Python and PySpark 1st:
pyspark in action
pyspark dataframe actions
pyspark action functions
analyzing data with pyspark
pyspark action method
Tags: Jonathan Rioux, Action Python, data analysis


