Data Lake for Enterprises 1st edition by Tomcy John, Pankaj Misra, Thomas Benjamin – Ebook PDF Instant Download/Delivery: 1787282651, 9781787282650
Full download Data Lake for Enterprises 1st edition after payment

Product details:
ISBN 10: 1787282651
ISBN 13: 9781787282650
Author: Tomcy John, Pankaj Misra, Thomas Benjamin
The term “Data Lake” has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together.
This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient.
By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake.
Data Lake for Enterprises 1st Table of contents:
- Questions
- Introduction to Data
- Exploring data
- What is Enterprise Data?
- Enterprise Data Management
- Big data concepts
- Big data and 4Vs
- Relevance of data
- Quality of data
- Where does this data live in an enterprise?
- Intranet (within enterprise)
- Internet (external to enterprise)
- Business applications hosted in cloud
- Third-party cloud solutions
- Social data (structured and unstructured)
- Data stores or persistent stores (RDBMS or NoSQL)
- Traditional data warehouse
- File stores
- Enterprise’s current state
- Enterprise digital transformation
- Enterprises embarking on this journey
- Some examples
- Data lake use case enlightenment
- Summary
- Comprehensive Concepts of a Data Lake
- What is a Data Lake?
- Relevance to enterprises
- How does a Data Lake help enterprises?
- Data Lake benefits
- How Data Lake works?
- Differences between Data Lake and Data Warehouse
- Approaches to building a Data Lake
- Lambda Architecture-driven Data Lake
- Data ingestion layer – ingest for processing and storage
- Batch layer – batch processing of ingested data
- Speed layer – near real time data processing
- Data storage layer – store all data
- Serving layer – data delivery and exports
- Data acquisition layer – get data from source systems
- Messaging Layer – guaranteed data delivery
- Exploring the Data Ingestion Layer
- Exploring the Lambda layer
- Batch layer
- Speed layer
- Serving layer
- Data push
- Data pull
- Data storage layer
- Batch process layer
- Speed layer
- Serving layer
- Relational data stores
- Distributed data stores
- Summary
- Lambda Architecture as a Pattern for Data Lake
- What is Lambda Architecture?
- History of Lambda Architecture
- Principles of Lambda Architecture
- Fault-tolerant principle
- Immutable Data principle
- Re-computation principle
- Components of a Lambda Architecture
- Batch layer
- Speed layer
- CAP Theorem
- Eventual consistency
- Serving layer
- Complete working of a Lambda Architecture
- Advantages of Lambda Architecture
- Disadvantages of Lambda Architectures
- Technology overview for Lambda Architecture
- Applied lambda
- Enterprise-level log analysis
- Capturing and analyzing sensor data
- Real-time mailing platform statistics
- Real-time sports analysis
- Recommendation engines
- Analyzing security threats
- Multi-channel consumer behaviour
- Working examples of Lambda Architecture
- Kappa architecture
- Summary
- Applied Lambda for Data Lake
- Knowing Hadoop distributions
- Selection factors for a big data stack for enterprises
- Technical capabilities
- Ease of deployment and maintenance
- Integration readiness
- Batch layer for data processing
- The NameNode server
- The secondary NameNode Server
- Yet Another Resource Negotiator (YARN)
- Data storage nodes (DataNode)
- Speed layer
- Flume for data acquisition
- Source for event sourcing
- Interceptors for event interception
- Channels for event flow
- Sink as an event destination
- Spark Streaming
- DStreams
- Data Frames
- Checkpointing
- Apache Flink
- Serving layer
- Data repository layer
- Relational databases
- Big data tables/views
- Data services with data indexes
- NoSQL databases
- Data access layer
- Data exports
- Data publishing
- Summary
- Data Acquisition of Batch Data using Apache Sqoop
- Context in data lake – data acquisition
- Data acquisition layer
- Data acquisition of batch data – technology mapping
- Why Apache Sqoop
- History of Sqoop
- Advantages of Sqoop
- Disadvantages of Sqoop
- Workings of Sqoop
- Sqoop 2 architecture
- Sqoop 1 versus Sqoop 2
- Ease of use
- Ease of extension
- Security
- When to use Sqoop 1 and Sqoop 2
- Functioning of Sqoop
- Data import using Sqoop
- Data export using Sqoop
- Sqoop connectors
- Types of Sqoop connectors
- Sqoop support for HDFS
- Sqoop working example
- Installation and Configuration
- Step 1 – Installing and verifying Java
- Step 2 – Installing and verifying Hadoop
- Step 3 – Installing and verifying Hue
- Step 4 – Installing and verifying Sqoop
- Step 5 – Installing and verifying PostgreSQL (RDBMS)
- Step 6 – Installing and verifying HBase (NoSQL)
- Configure data source (ingestion)
- Sqoop configuration (database drivers)
- Configuring HDFS as destination
- Sqoop Import
- Import complete database
- Import selected tables
- Import selected columns from a table
- Import into HBase
- Sqoop Export
- Sqoop Job
- Job command
- Create job
- List Job
- Run Job
- Create Job
- Sqoop 2
- Sqoop in purview of SCV use case
- When to use Sqoop
- When not to use Sqoop
- Real-time Sqooping: a possibility?
- Other options
- Native big data connectors
- Talend
- Pentaho’s Kettle (PDI – Pentaho Data Integration)
- Summary
- Data Acquisition of Stream Data using Apache Flume
- Context in Data Lake: data acquisition
- What is Stream Data?
- Batch and stream data
- Data acquisition of stream data – technology mapping
- What is Flume?
- Sqoop and Flume
- Why Flume?
- History of Flume
- Advantages of Flume
- Disadvantages of Flume
- Flume architecture principles
- The Flume Architecture
- Distributed pipeline – Flume architecture
- Fan Out – Flume architecture
- Fan In – Flume architecture
- Three tier design – Flume architecture
- Advanced Flume architecture
- Flume reliability level
- Flume event – Stream Data
- Flume agent
- Flume agent configurations
- Flume source
- Custom Source
- Flume Channel
- Custom channel
- Flume sink
- Custom sink
- Flume configuration
- Flume transaction management
- Other flume components
- Channel processor
- Interceptor
- Channel Selector
- Sink Groups
- Sink Processor
- Event Serializers
- Context Routing
- Flume working example
- Installation and Configuration
- Step 1: Installing and verifying Flume
- Step 2: Configuring Flume
- Step 3: Start Flume
- Flume in purview of SCV use case
- Kafka Installation
- Example 1 – RDBMS to Kafka
- Example 2: Spool messages to Kafka
- Example 3: Interceptors
- Example 4 – Memory channel, file channel, and Kafka channel
- When to use Flume
- When not to use Flume
- Other options
- Apache Flink
- Apache NiFi
- Summary
- Messaging Layer using Apache Kafka
- Context in Data Lake- messaging layer
- Messaging layer
- Messaging layer- technology mapping
- What is Apache Kafka?
- Why Apache Kafka
- History of Kafka
- Advantages of Kafka
- Disadvantages of Kafka
- Kafka architecture
- Core architecture principles of Kafka
- Data stream life cycle
- Working of Kafka
- Kafka message
- Kafka producer
- Persistence of data in Kafka using topics
- Partitions- Kafka topic division
- Kafka message broker
- Kafka consumer
- Consumer groups
- Other Kafka components
- Zookeeper
- MirrorMaker
- Kafka programming interface
- Kafka core API’s
- Kafka REST interface
- Producer and consumer reliability
- Kafka security
- Kafka as message-oriented middleware
- Scale-out architecture with Kafka
- Kafka connect
- Kafka working example
- Installation
- Producer – putting messages into Kafka
- Kafka Connect
- Consumer – getting messages from Kafka
- Setting up multi-broker cluster
- Kafka in the purview of an SCV use case
- When to use Kafka
- When not to use Kafka
- Other options
- RabbitMQ
- ZeroMQ
- Apache ActiveMQ
- Summary
- Data Processing using Apache Flink
- Context in a Data Lake – Data Ingestion Layer
- Data Ingestion Layer
- Data Ingestion Layer – technology mapping
- What is Apache Flink?
- Why Apache Flink?
- History of Flink
- Advantages of Flink
- Disadvantages of Flink
- Working of Flink
- Flink architecture
- Client
- Job Manager
- Task Manager
- Flink execution model
- Core architecture principles of Flink
- Flink Component Stack
- Checkpointing in Flink
- Savepoints in Flink
- Streaming window options in Flink
- Time window
- Count window
- Tumbling window configuration
- Sliding window configuration
- Memory management
- Flink API’s
- DataStream API
- Flink DataStream API example
- Streaming connectors
- DataSet API
- Flink DataSet API example
- Table API
- Flink domain specific libraries
- Gelly – Flink Graph API
- FlinkML
- FlinkCEP
- Flink working example
- Installation
- Example – data processing with Flink
- Data generation
- Step 1 – Preparing streams
- Step 2 – Consuming Streams via Flink
- Step 3 – Streaming data into HDFS
- Flink in purview of SCV use cases
- User Log Data Generation
- Flume Setup
- Flink Processors
- When to use Flink
- When not to use Flink
- Other options
- Apache Spark
- Apache Storm
- Apache Tez
- Summary
- Data Store Using Apache Hadoop
- Context for Data Lake – Data Storage and lambda Batch layer
- Data Storage and the Lambda Batch Layer
- Data Storage and Lambda Batch Layer – technology mapping
- What is Apache Hadoop?
- Why Hadoop?
- History of Hadoop
- Advantages of Hadoop
- Disadvantages of Hadoop
- Working of Hadoop
- Hadoop core architecture principles
- Hadoop architecture
- Hadoop architecture 1.x
- Hadoop architecture 2.x
- Hadoop architecture components
- HDFS
- YARN
- MapReduce
- Hadoop ecosystem
- Hadoop architecture in detail
- Hadoop ecosystem
- Data access/processing components
- Apache Pig
- Apache Hive
- Data storage components
- Apache HBase
- Monitoring, management and orchestration components
- Apache ZooKeeper
- Apache Oozie
- Apache Ambari
- Data integration components
- Apache Sqoop
- Apache Flume
- Hadoop distributions
- HDFS and formats
- Hadoop for near real-time applications
- Hadoop deployment modes
- Hadoop working examples
- Installation
- Data preparation
- Hive installation
- Example – Bulk Data Load
- File Data Load
- RDBMS Data Load
- Example – MapReduce processing
- Text Data as Hive Tables
- Avro Data as HIVE Table
- Hadoop in purview of SCV use case
- Initial directory setup
- Data loads
- Data visualization with HIVE tables
- When not to use Hadoop
- Other Hadoop Processing Options
- Summary
- Indexed Data Store using Elasticsearch
- Context in Data Lake: data storage and lambda speed layer
- Data Storage and Lambda Speed Layer
- Data Storage and Lambda Speed Layer: technology mapping
- What is Elasticsearch?
- Why Elasticsearch
- History of Elasticsearch
- Advantages of Elasticsearch
- Disadvantages of Elasticsearch
- Working of Elasticsearch
- Elasticsearch core architecture principles
- Elasticsearch terminologies
- Document in Elasticsearch
- Index in Elasticsearch
- What is Inverted Index?
- Shard in Elasticsearch
- Nodes in Elasticsearch
- Cluster in Elasticsearch
- Elastic Stack
- Elastic Stack – Kibana
- Elastic Stack – Elasticsearch
- Elastic Stack – Logstash
- Elastic Stack – Beats
- Elastic Stack – X-Pack
- Elastic Cloud
- Apache Lucene
- How Lucene works
- Elasticsearch DSL (Query DSL)
- Important queries in Query DSL
- Nodes in Elasticsearch
- Elasticsearch – master node
- Elasticsearch – data node
- Elasticsearch – client node
- Elasticsearch and relational database
- Elasticsearch ecosystem
- Elasticsearch analyzers
- Built-in analyzers
- Custom analyzers
- Elasticsearch plugins
- Elasticsearch deployment options
- Clients for Elasticsearch
- Elasticsearch for fast streaming layer
- Elasticsearch as a data source
- Elasticsearch for content indexing
- Elasticsearch and Hadoop
- Elasticsearch working example
- Installation
- Creating and Deleting Indexes
- Indexing Documents
- Getting Indexed Document
- Searching Documents
- Updating Documents
- Deleting a document
- Elasticsearch in purview of SCV use case
- Data preparation
- Initial Cleanup
- Data Generation
- Customer data import into Hive using Sqoop
- Data acquisition via Flume into Kafka channel
- Data ingestion via Flink to HDFS and Elasticsearch
- Packaging via POM file
- Avro schema definitions
- Schema deserialization class
- Writing to HDFS as parquet files
- Writing into Elasticsearch
- Command line arguments
- Flink deployment
- Parquet data visualization as Hive tables
- Data indexing from Hive
- Query data from ES (customer, address, and contacts)
- When to use Elasticsearch
- When not to use Elasticsearch
- Other options
- Apache Solr
- Summary
- Data Lake Components Working Together
- Where we stand with Data Lake
- Core architecture principles of Data Lake
- Challenges faced by enterprise Data Lake
- Expectations from Data Lake
- Data Lake for other activities
- Knowing more about data storage
- Zones in Data Storage
- Data Schema and Model
- Storage options
- Apache HCatalog (Hive Metastore)
- Compression methodologies
- Data partitioning
- Knowing more about Data processing
- Data validation and cleansing
- Machine learning
- Scheduler/Workflow
- Apache Oozie
- Database setup and configuration
- Build from Source
- Oozie Workflows
- Oozie coordinator
- Complex event processing
- Thoughts on data security
- Apache Knox
- Apache Ranger
- Apache Sentry
- Thoughts on data encryption
- Hadoop key management server
- Metadata management and governance
- Metadata
- Data governance
- Data lineage
- How can we achieve?
- Apache Atlas
- WhereHows
- Thoughts on Data Auditing
- Thoughts on data traceability
- Knowing more about Serving Layer
- Principles of Serving Layer
- Service Types
- GraphQL
- Data Lake with REST API
- Business services
- Serving Layer components
- Data Services
- Elasticsearch & HBase
- Apache Hive & Impala
- RDBMS
- Data exports
- Polyglot data access
- Example: serving layer
- Summary
- Data Lake Use Case Suggestions
- Establishing cybersecurity practices in an enterprise
- Know the customers dealing with your enterprise
- Bring efficiency in warehouse management
- Developing a brand and marketing of the enterprise
- Achieve a higher degree of personalization with customers
- Bringing IoT data analysis at your fingertips
- More practical and useful data archival
- Compliment the existing data warehouse infrastructure
- Achieving telecom security and regulatory compliance
- Summary
People also search Data Lake for Enterprises 1st:
data lake for enterprises pdf
data lake requirements
data lake for nonprofits
a data lake is composed of
enterprise data lake
Tags: Tomcy John, Pankaj Misra, Thomas Benjamin, Data Lake
Reviews
There are no reviews yet.