Discovering Knowledge in Data An Introduction to Data Mining 2nd Edition by Daniel Larose – Ebook PDF Instant Download/Delivery: 9781118975251, 1118975251
Full dowload Discovering Knowledge in Data An Introduction to Data Mining 2nd Edition after payment
Product details:
• ISBN 10:1118975251
• ISBN 13:9781118975251
• Author:Daniel Larose
Discovering Knowledge in Data: An Introduction to Data Mining
The field of data mining lies at the confluence of predictive analytics, statistical analysis, and business intelligence. Due to the ever-increasing complexity and size of data sets and the wide range of applications in computer science, business, and health care, the process of discovering knowledge in data is more relevant than ever before.
This book provides the tools needed to thrive in today’s big data world. The author demonstrates how to leverage a company’s existing databases to increase profits and market share, and carefully explains the most current data science methods and techniques. The reader will “learn data mining by doing data mining”. By adding chapters on data modelling preparation, imputation of missing data, and multivariate statistical analysis, Discovering Knowledge in Data, Second Edition remains the eminent reference on data mining.
Discovering Knowledge in Data An Introduction to Data Mining 2nd Table of contents:
CHAPTER 1 AN INTRODUCTION TO DATA MINING
1.1 WHAT IS DATA MINING?
1.2 WANTED: DATA MINERS
1.3 THE NEED FOR HUMAN DIRECTION OF DATA MINING
1.4 THE CROSS-INDUSTRY STANDARD PRACTICE FOR DATA MINING
Figure 1.1 CRISP-DM is an iterative, adaptive process.
1.4.1 Crisp-DM: The Six Phases
1.5 FALLACIES OF DATA MINING
1.6 WHAT TASKS CAN DATA MINING ACCOMPLISH?
1.6.1 Description
1.6.2 Estimation
Figure 1.2 Regression estimates lie on the regression line.
1.6.3 Prediction
1.6.4 Classification
TABLE 1.1 Excerpt from data set for classifying income
Figure 1.3 Which drug should be prescribed for which type of patient?
1.6.5 Clustering
TABLE 1.2 The 66 clusters used by the PRIZM segmentation system
1.6.6 Association
REFERENCES
EXERCISES
CHAPTER 2 DATA PREPROCESSING
2.1 WHY DO WE NEED TO PREPROCESS THE DATA?
2.2 DATA CLEANING
TABLE 2.1 Can you find any problems in this tiny data set?
2.3 HANDLING MISSING DATA
Figure 2.1 Some of our field values are missing.
Figure 2.2 Replacing missing field values with user-defined constants.
Figure 2.3 Replacing missing field values with means or modes.
Figure 2.4 Replacing missing field values with random draws from the distribution of the variable.
2.4 IDENTIFYING MISCLASSIFICATIONS
TABLE 2.2 Notice anything strange about this frequency distribution?
2.5 GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS
Figure 2.5 Histogram of vehicle weights: can you find the outlier?
Figure 2.6 Scatter plot of mpg against Weightlbs shows two outliers.
2.6 MEASURES OF CENTER AND SPREAD
Figure 2.7 Statistical summary of customer service calls.
TABLE 2.3 The two portfolios have the same mean, median, and mode, but are clearly different
2.7 DATA TRANSFORMATION
2.8 MIN-MAX NORMALIZATION
Figure 2.8 Summary statistics for weight.
2.9 Z-SCORE STANDARDIZATION
2.10 DECIMAL SCALING
2.11 TRANSFORMATIONS TO ACHIEVE NORMALITY
Figure 2.9 Standard normal Z distribution.
Figure 2.10 Original data.
Figure 2.11 Z-Standardized data are still right-skewed, not normally distributed.
Figure 2.12 Right-skewed data have positive skewness.
Figure 2.13 Left-skewed data have negative skewness.
Figure 2.14 Statistics for calculating skewness.
Figure 2.15 Square root transformation somewhat reduces skewness.
Figure 2.16 Natural log transformation reduces skewness even further.
Figure 2.17 Statistics for calculating skewness.
Figure 2.18 The transformation inverse_sqrt (weight) has eliminated the skewness, but is still not normal.
Figure 2.19 Statistics for inverse_sqrt (weight).
Figure 2.20 Normal probability plot of inverse_sqrt(weight) indicates nonnormality.
Figure 2.21 Normal probability plot of normally distributed data.
2.12 NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
2.13 FLAG VARIABLES
2.14 TRANSFORMING CATEGORICAL VARIABLES INTO NUMERICAL VARIABLES
2.15 BINNING NUMERICAL VARIABLES
Figure 2.22 Illustration of binning methods.
2.16 RECLASSIFYING CATEGORICAL VARIABLES
2.17 ADDING AN INDEX FIELD
2.18 REMOVING VARIABLES THAT ARE NOT USEFUL
2.19 VARIABLES THAT SHOULD PROBABLY NOT BE REMOVED
2.20 REMOVAL OF DUPLICATE RECORDS
2.21 A WORD ABOUT ID FIELDS
THE R ZONE
Getting Started with R
How to Handle Missing Data: Example Using the Cars Data Set
REFERENCES
EXERCISES
HANDS-ON ANALYSIS
CHAPTER 3 EXPLORATORY DATA ANALYSIS
3.1 HYPOTHESIS TESTING VERSUS EXPLORATORY DATA ANALYSIS
3.2 GETTING TO KNOW THE DATA SET
Figure 3.1 Field values of the first 10 records in the churn data set.
Figure 3.2 Summarization and visualization of the churn data set.
3.3 EXPLORING CATEGORICAL VARIABLES
Figure 3.3 Churners and non-churners.
Figure 3.4 Comparison bar chart of churn proportions, by International Plan participation.
Figure 3.5 Comparison bar chart of churn proportions, by International Plan participation, with equal bar length.
TABLE 3.1 Contingency table of International Plan with Churn
TABLE 3.2 Contingency table with column percentages
Figure 3.6 The clustered bar chart is the graphical counterpart of the contingency table.
Figure 3.7 Comparative pie chart associated with Table 3.2.
TABLE 3.3 Contingency table with row percentages
Figure 3.8 Clustered bar chart associated with Table 3.3.
Figure 3.9 Comparative pie chart associated with Table 3.3.
Figure 3.10 Those without the Voice Mail Plan are more likely to churn.
TABLE 3.4 Contingency table with column percentages for the Voice Mail Plan
Figure 3.11 Multilayer clustered bar chart.
Figure 3.12 Statistics for multilayer clustered bar chart.
Figure 3.13 Directed web graph supports earlier findings.
3.4 EXPLORING NUMERIC VARIABLES
Figure 3.14 Histogram of customer service calls with no overlay.
Figure 3.15 Histogram of customer service calls, with churn overlay.
Figure 3.16 “Normalized” histogram of customer service calls, with churn overlay.
Figure 3.17 (a) Nonnormalized histogram of day minutes; (b) normalized histogram of day minutes.
Figure 3.18 (a) Nonnormalized histogram of evening minutes; (b) normalized histogram of evening minutes.
Figure 3.19 (a) Nonnormalized histogram of night minutes; (b) normalized histogram of night minutes.
Figure 3.20 (a) Nonnormalized histogram of International Calls; (b) normalized histogram of International Calls.
Figure 3.21 t-test is significant for difference in mean international calls for churners and non-churners.
3.5 EXPLORING MULTIVARIATE RELATIONSHIPS
Figure 3.22 Customers with both high day minutes and high evening minutes are at greater risk of churning.
Figure 3.23 There is an interaction effect between customer service calls and day minutes, with respect to churn.
3.6 SELECTING INTERESTING SUBSETS OF THE DATA FOR FURTHER INVESTIGATION
Figure 3.24 Very high proportion of churners for high customer service calls and low day minutes.
Figure 3.25 Much lower proportion of churners for high customer service calls and high day minutes.
3.7 USING EDA TO UNCOVER ANOMALOUS FIELDS
Figure 3.26 Only three area codes for all records.
Figure 3.27 Anomaly: three area codes distributed randomly across all 50 states.
3.8 BINNING BASED ON PREDICTIVE VALUE
TABLE 3.5 Binning customer service calls shows difference in churn rates
Figure 3.28 Binning evening minutes helps to tease out a signal from the noise.
TABLE 3.6 Bin values for Evening Minutes
TABLE 3.7 We have uncovered significant differences in churn rates among the three categories
3.9 DERIVING NEW VARIABLES: FLAG VARIABLES
TABLE 3.8 Contingency table for VoiceMailMessages_Flag
Figure 3.29 Use the equation of the line to separate the records, via a flag variable.
TABLE 3.9 Contingency table for HighDayEveMins_Flag
A NOTE ABOUT CRISP-DM FOR DATA MINERS: BE STRUCTURED BUT FLEXIBLE
3.10 DERIVING NEW VARIABLES: NUMERICAL VARIABLES
Figure 3.30 (a) Nonnormalized histogram of CSCInternational_Z; (b) normalized histogram of CSCInternational_Z.
3.11 USING EDA TO INVESTIGATE CORRELATED PREDICTOR VARIABLES
STRATEGY FOR HANDLING CORRELATED PREDICTOR VARIABLES AT THE EDA STAGE
Figure 3.31 Matrix plot of Day Minutes, Day Calls, and Day Charge.
Figure 3.32 Correlations and p-values.
Figure 3.33 Minitab regression output for Day Charge vs. Day Minutes.
Figure 3.34 Account length is positively correlated with day calls.
3.12 SUMMARY
THE R ZONE
REFERENCE
EXERCISES
HANDS-ON ANALYSIS
CHAPTER 4 UNIVARIATE STATISTICAL ANALYSIS
4.1 DATA MINING TASKS IN DISCOVERING KNOWLEDGE IN DATA
TABLE 4.1 Data mining tasks in Discovering Knowledge in Data
4.2 STATISTICAL APPROACHES TO ESTIMATION AND PREDICTION
4.3 STATISTICAL INFERENCE
TABLE 4.2 Use observed sample statistics to estimate unknown population parameters
4.4 HOW CONFIDENT ARE WE IN OUR ESTIMATES?
4.5 CONFIDENCE INTERVAL ESTIMATION OF THE MEAN
Figure 4.1 Summary statistics of customer service calls.
Figure 4.2 Summary statistics of customer service calls for those with both the International Plan and VoiceMail Plan and with more than 200 day minutes.
4.6 HOW TO REDUCE THE MARGIN OF ERROR
4.7 CONFIDENCE INTERVAL ESTIMATION OF THE PROPORTION
4.8 HYPOTHESIS TESTING FOR THE MEAN
TABLE 4.3 Four possible outcomes of the criminal trial hypothesis test
TABLE 4.4 How to calculate p-value
4.9 ASSESSING THE STRENGTH OF EVIDENCE AGAINST THE NULL HYPOTHESIS
TABLE 4.5 Strength of evidence against H0 for various p-values
4.10 USING CONFIDENCE INTERVALS TO PERFORM HYPOTHESIS TESTS
TABLE 4.6 Confidence levels and levels of significance for equivalent confidence intervals and hypothesis tests
Figure 4.3 Reject values of μ0 that would fall outside the equivalent confidence interval.
Figure 4.4 Placing the hypothesized values of μ0 on the number line in relation to the confidence interval informs us immediately of the conclusion.
TABLE 4.7 Conclusions for three hypothesis tests using the confidence interval
4.11 HYPOTHESIS TESTING FOR THE PROPORTION
TABLE 4.8 Hypotheses and p-values for hypothesis tests about π
THE R ZONE
REFERENCE
EXERCISES
CHAPTER 5 MULTIVARIATE STATISTICS
5.1 TWO-SAMPLE t-TEST FOR DIFFERENCE IN MEANS
TABLE 5.1 Summary statistics for customer service calls, training data set and test data set
5.2 TWO-SAMPLE Z-TEST FOR DIFFERENCE IN PROPORTIONS
5.3 TEST FOR HOMOGENEITY OF PROPORTIONS
TABLE 5.2 Observed frequencies
TABLE 5.3 Expected frequencies
TABLE 5.4 Calculating the test statistic χdata2
5.4 CHI-SQUARE TEST FOR GOODNESS OF FIT OF MULTINOMIAL DATA
TABLE 5.5 Calculating the test statistic χdata2
5.5 ANALYSIS OF VARIANCE
TABLE 5.6 Sample ages for Groups A, B, and C
Figure 5.1 Dotplot of groups A, B, and C shows considerable overlap.
TABLE 5.7 Sample ages for Groups D, E, and F
Figure 5.2 Dotplot of Groups D, E, and F shows little overlap.
TABLE 5.8 ANOVA table
Figure 5.3 ANOVA results for H0 : μA = μB = μC.
Figure 5.4 ANOVA results for H0 : μD = μE = μF.
5.6 REGRESSION ANALYSIS
TABLE 5.9 Excerpt from cereals data set: eight fields, first six cereals
Figure 5.5 Scatter plot of nutritional rating versus sugar content for 76 cereals.
Figure 5.6 Regression results for using sugars to estimate rating.
5.7 HYPOTHESIS TESTING IN REGRESSION
5.8 MEASURING THE QUALITY OF A REGRESSION MODEL
5.9 DANGERS OF EXTRAPOLATION
Figure 5.7 Dangers of extrapolation.
5.10 CONFIDENCE INTERVALS FOR THE MEAN VALUE OF y GIVEN x
5.11 PREDICTION INTERVALS FOR A RANDOMLY CHOSEN VALUE OF y GIVEN x
5.12 MULTIPLE REGRESSION
Figure 5.8 Multiple regression results.
5.13 VERIFYING MODEL ASSUMPTIONS
Figure 5.9 Plots for verifying regression model assumptions. Note the outlier.
Figure 5.10 Plots for verifying regression model assumptions, after outlier omitted.
THE R ZONE
REFERENCE
EXERCISES
TABLE 5.10 Summary statistics for duration of customer service calls
TABLE 5.11 Observed frequencies for marital status
TABLE 5.12 Purchase amounts for three payment methods
HANDS-ON ANALYSIS
CHAPTER 6 PREPARING TO MODEL THE DATA
6.1 SUPERVISED VERSUS UNSUPERVISED METHODS
6.2 STATISTICAL METHODOLOGY AND DATA MINING METHODOLOGY
6.3 CROSS-VALIDATION
TABLE 6.1 Suggested hypothesis tests for validating different types of target variables
METHODOLOGY FOR BUILDING AND EVALUATING A DATA MODEL
6.4 OVERFITTING
Figure 6.1 The optimal level of model complexity is at the minimum error rate on the test set.
6.5 BIAS–VARIANCE TRADE-OFF
Figure 6.2 Low complexity separator with high error rate.
Figure 6.3 High complexity separator with low error rate.
Figure 6.4 With more data: low complexity separator need not change much; high complexity separator needs much revision.
6.6 BALANCING THE TRAINING DATA SET
6.7 ESTABLISHING BASELINE PERFORMANCE
THE R ZONE
REFERENCE
EXERCISES
CHAPTER 7 k-NEAREST NEIGHBOR ALGORITHM
7.1 CLASSIFICATION TASK
TABLE 7.1 Excerpt from data set for classifying income
7.2 k-NEAREST NEIGHBOR ALGORITHM
Figure 7.1 Scatter plot of sodium/potassium ratio against age, with drug overlay.
Figure 7.2 Close-up of three nearest neighbors to new patient 2.
Figure 7.3 Close-up of three nearest neighbors to new patient 3.
7.3 DISTANCE FUNCTION
Figure 7.4 Euclidean distance.
TABLE 7.2 Variable values for age and gender
7.4 COMBINATION FUNCTION
7.4.1 Simple Unweighted Voting
7.4.2 Weighted Voting
TABLE 7.3 Age and Na/K ratios for records from Figure 5.4
7.5 QUANTIFYING ATTRIBUTE RELEVANCE: STRETCHING THE AXES
7.6 DATABASE CONSIDERATIONS
7.7 k-NEAREST NEIGHBOR ALGORITHM FOR ESTIMATION AND PREDICTION
TABLE 7.4 k = 3 nearest neighbors of the new record
7.8 CHOOSING k
7.9 APPLICATION OF k-NEAREST NEIGHBOR ALGORITHM USING IBM/SPSS MODELER
TABLE 7.5 Find the k-nearest neighbor for record #10
Figure 7.5 Modeler k-nearest neighbor results.
THE R ZONE
EXERCISES
HANDS-ON ANALYSIS
CHAPTER 8 DECISION TREES
8.1 WHAT IS A DECISION TREE?
Figure 8.1 Simple decision tree.
TABLE 8.1 Sample of records that cannot lead to pure leaf node
8.2 REQUIREMENTS FOR USING DECISION TREES
8.3 CLASSIFICATION AND REGRESSION TREES
TABLE 8.2 Training set of records for classifying credit risk
TABLE 8.3 Candidate splits for t = root node
TABLE 8.4 Values of the components of the optimality measure Φ(s|t) for each candidate split, for the root node
Figure 8.2 CART decision tree after initial split.
TABLE 8.5 Values of the components of the optimality measure Φ(s|t) for each candidate split, for decision node A
Figure 8.3 CART decision tree after decision node A split.
Figure 8.4 CART decision tree, fully grown form.
Figure 8.5 Modeler’s CART decision tree.
8.4 C4.5 ALGORITHM
TABLE 8.6 Candidate splits at root node for C4.5 algorithm
TABLE 8.7 Information gain for each candidate split at the root node
Figure 8.6 C4.5 concurs with CART in choosing assets for the initial partition.
TABLE 8.8 Records available at decision node A for classifying credit risk
TABLE 8.9 Candidate splits at decision node A
Figure 8.7 C4.5 Decision tree: fully grown form.
8.5 DECISION RULES
TABLE 8.10 Decision rules generated from decision tree in Figure 8.7
8.6 COMPARISON OF THE C5.0 AND CART ALGORITHMS APPLIED TO REAL DATA
Figure 8.8 CART decision tree for the adult data set.
Figure 8.9 C5.0 decision tree for the adult data set.
THE R ZONE
REFERENCES
EXERCISES
TABLE 8.11 Decision tree data
HANDS-ON ANALYSIS
CHAPTER 9 NEURAL NETWORKS
Figure 9.1 Real neuron and artificial neuron model.
9.1 INPUT AND OUTPUT ENCODING
Figure 9.2 Simple neural network.
9.2 NEURAL NETWORKS FOR ESTIMATION AND PREDICTION
9.3 SIMPLE EXAMPLE OF A NEURAL NETWORK
TABLE 9.1 Data inputs and initial values for neural network weights
9.4 SIGMOID ACTIVATION FUNCTION
Figure 9.3 Graph of the sigmoid function y = f(x) = 1/(1 + e−x).
9.5 BACK-PROPAGATION
9.5.1 Gradient Descent Method
Figure 9.4 Using the slope of SSE with respect to w1 to find weight adjustment direction.
9.5.2 Back-Propagation Rules
9.5.3 Example of Back-Propagation
9.6 TERMINATION CRITERIA
9.7 LEARNING RATE
Figure 9.5 Large η may cause algorithm to overshoot global minimum.
9.8 MOMENTUM TERM
Figure 9.6 Small momentum α may cause algorithm to undershoot global minimum.
Figure 9.7 Large momentum α may cause algorithm to overshoot global minimum.
9.9 SENSITIVITY ANALYSIS
9.10 APPLICATION OF NEURAL NETWORK MODELING
Figure 9.8 Neural network for the adult data set generated by Insightful Miner.
Figure 9.9 Some of the neural network weights for the income example.
Figure 9.10 Most important variables: results from sensitivity analysis.
THE R ZONE
REFERENCES
EXERCISES
HANDS-ON ANALYSIS
CHAPTER 10 HIERARCHICAL AND k-MEANS CLUSTERING
10.1 THE CLUSTERING TASK
Figure 10.1 Clusters should have small within-cluster variation compared to the betweencluster variation.
10.2 HIERARCHICAL CLUSTERING METHODS
10.3 SINGLE-LINKAGE CLUSTERING
Figure 10.2 Single-linkage agglomerative clustering on the sample data set.
10.4 COMPLETE-LINKAGE CLUSTERING
Figure 10.3 Complete-linkage agglomerative clustering on the sample data set.
10.5 k-MEANS CLUSTERING
10.6 EXAMPLE OF k-MEANS CLUSTERING AT WORK
TABLE 10.1 Data points for k-means example
Figure 10.4 How will k-means partition these data into k = 2 clusters?
TABLE 10.2 Finding the nearest cluster center for each record (first pass)
Figure 10.5 Clusters and centroids Δ after first pass through k-means algorithm.
TABLE 10.3 Finding the nearest cluster center for each record (second pass)
Figure 10.6 Clusters and centroids Δ after second pass through k-means algorithm.
TABLE 10.4 Finding the nearest cluster center for each record (third pass)
10.7 BEHAVIOR OF MSB, MSE, AND PSEUDO-F AS THE k-MEANS ALGORITHM PROCEEDS
10.8 APPLICATION OF k-MEANS CLUSTERING USING SAS ENTERPRISE MINER
Figure 10.7 Enterprise Miner profile of International Plan adopters across clusters.
Figure 10.8 VoiceMail Plan adopters and nonadopters are mutually exclusive.
TABLE 10.5 Comparison of variable means across clusters shows little variation
Figure 10.9 Distribution of customer service calls is similar across clusters.
10.9 USING CLUSTER MEMBERSHIP TO PREDICT CHURN
Figure 10.10 Churn behavior across clusters for International Plan adopters and nonadopters.
Figure 10.11 Churn behavior across clusters for VoiceMail Plan adopters and nonadopters.
THE R ZONE
REFERENCES
EXERCISES
HANDS-ON ANALYSIS
CHAPTER 11 KOHONEN NETWORKS
11.1 SELF-ORGANIZING MAPS
Figure 11.1 Topology of a simple self-organizing map for clustering records by age and income.
11.2 KOHONEN NETWORKS
11.2.1 Kohonen Networks Algorithm
11.3 EXAMPLE OF A KOHONEN NETWORK STUDY
Figure 11.2 Example: topology of the 2 × 2 Kohonen network.
TABLE 11.1 Four clusters uncovered by Kohonen Network
11.4 CLUSTER VALIDITY
11.5 APPLICATION OF CLUSTERING USING KOHONEN NETWORKS
Figure 11.3 Topology of 3 × 3 Kohonen network used for clustering the churn data set.
Figure 11.4 Modeler uncovered six clusters.
11.6 INTERPRETING THE CLUSTERS
Figure 11.5 International Plan adopters reside exclusively in Clusters 12 and 22.
Figure 11.6 Similar clusters are closer to each other.
Figure 11.7 How the variables are distributed among the clusters.
Figure 11.8 Assessing whether the means across clusters are significantly different.
11.6.1 Cluster Profiles
Figure 11.9 Proportions of churners among the clusters.
11.7 USING CLUSTER MEMBERSHIP AS INPUT TO DOWNSTREAM DATA MINING MODELS
Figure 11.10 Output of CART decision tree for data set enriched by cluster membership.
THE R ZONE
REFERENCES
EXERCISES
HANDS-ON ANALYSIS
CHAPTER 12 ASSOCIATION RULES
12.1 AFFINITY ANALYSIS AND MARKET BASKET ANALYSIS
TABLE 12.1 Transactions made at the roadside vegetable stand
12.1.1 Data Representation for Market Basket Analysis
TABLE 12.2 Transactional data format for the roadside vegetable stand data
TABLE 12.3 Tabular data format for the roadside vegetable stand data
12.2 SUPPORT, CONFIDENCE, FREQUENT ITEMSETS, AND THE A PRIORI PROPERTY
MINING ASSOCIATION RULES
A PRIORI PROPERTY
12.3 HOW DOES THE A PRIORI ALGORITHM WORK?
12.3.1 Generating Frequent Itemsets
TABLE 12.4 Candidate 2-itemsets
12.3.2 Generating Association Rules
GENERATING ASSOCIATION RULES
TABLE 12.5 Candidate association rules for vegetable stand data: two antecedents
TABLE 12.6 Candidate association rules for vegetable stand data: one antecedent
TABLE 12.7 Final list of association rules for vegetable stand data: ranked by support × confidence, minimum confidence 80%
Figure 12.1 Association rules for vegetable stand data, generated by Modeler.
12.4 EXTENSION FROM FLAG DATA TO GENERAL CATEGORICAL DATA
Figure 12.2 Association rules for categorical attributes found by the a priori algorithm.
12.5 INFORMATION-THEORETIC APPROACH: GENERALIZED RULE INDUCTION METHOD
12.5.1 J-Measure
12.6 ASSOCIATION RULES ARE EASY TO DO BADLY
Figure 12.3 An association rule that is worse than useless.
Figure 12.4 This association rule is useful, because the posterior probability (0.60029) is much greater than the prior probability (0.3316).
12.7 HOW CAN WE MEASURE THE USEFULNESS OF ASSOCIATION RULES?
12.8 DO ASSOCIATION RULES REPRESENT SUPERVISED OR UNSUPERVISED LEARNING?
12.9 LOCAL PATTERNS VERSUS GLOBAL MODELS
Figure 12.5 Profitable pattern: VoiceMail Plan adopters less likely to churn.
THE R ZONE
REFERENCES
EXERCISES
TABLE 12.8 Weather data set for association rule mining
HANDS-ON ANALYSIS
CHAPTER 13 IMPUTATION OF MISSING DATA
13.1 NEED FOR IMPUTATION OF MISSING DATA
13.2 IMPUTATION OF MISSING DATA: CONTINUOUS VARIABLES
Figure 13.1 Multiple regression results for imputation of missing potassium values. (The predicted values section of this output is for Almond Delight only.)
13.3 STANDARD ERROR OF THE IMPUTATION
13.4 IMPUTATION OF MISSING DATA: CATEGORICAL VARIABLES
Figure 13.2 CART model for imputing the missing value of maritalstatus.
13.5 HANDLING PATTERNS IN MISSINGNESS
THE R ZONE
REFERENCE
EXERCISES
HANDS-ON ANALYSIS
CHAPTER 14 MODEL EVALUATION TECHNIQUES
14.1 MODEL EVALUATION TECHNIQUES FOR THE DESCRIPTION TASK
14.2 MODEL EVALUATION TECHNIQUES FOR THE ESTIMATION AND PREDICTION TASKS
Figure 14.1 Regression results, with MSE and s indicated.
14.3 MODEL EVALUATION TECHNIQUES FOR THE CLASSIFICATION TASK
14.4 ERROR RATE, FALSE POSITIVES, AND FALSE NEGATIVES
TABLE 14.1 General form of the contingency table of correct and incorrect classifications
TABLE 14.2 Contingency table for the C5.0 model
14.5 SENSITIVITY AND SPECIFICITY
14.6 MISCLASSIFICATION COST ADJUSTMENT TO REFLECT REAL-WORLD CONCERNS
TABLE 14.3 Contingency table after misclassification cost adjustment
14.7 DECISION COST/BENEFIT ANALYSIS
TABLE 14.4 Cost/benefit table for each combination of correct/incorrect decision
14.8 LIFT CHARTS AND GAINS CHARTS
Figure 14.2 Lift chart for model 1: strong lift early, then falls away rapidly.
Figure 14.3 Gains chart for model 1.
Figure 14.4 Combined lift chart for models 1 and 2.
14.9 INTERWEAVING MODEL EVALUATION WITH MODEL BUILDING
14.10 CONFLUENCE OF RESULTS: APPLYING A SUITE OF MODELS
TABLE 14.5 Most important variables for classifying income, as identified by CART, C5.0, and the neural network algorithm
THE R ZONE
REFERENCE
EXERCISES
HANDS-ON ANALYSIS
Back Matter
APPENDIX DATA SUMMARIZATION AND VISUALIZATION
PART 1 SUMMARIZATION 1: BUILDING BLOCKS OF DATA ANALYSIS
TABLE A.1 Characteristics of 10 loan applicants
PART 2 VISUALIZATION: GRAPHS AND TABLES FOR SUMMARIZING AND ORGANIZING DATA
2.1 Categorical Variables
TABLE A.2 Frequency distribution and relative frequency distribution
Figure A.1 Bar chart for marital status.
Figure A.2 Pie chart of marital status.
2.2 Quantitative Variables
TABLE A.3 Frequency distribution and relative frequency distribution of income
TABLE A.4 Cumulative frequency distribution and cumulative relative frequency distribution of income
Figure A.3 Histogram of income.
Figure A.4 Stem-and-leaf display of income.
Figure A.5 Dotplot of income.
Figure A.6 Symmetric and skewed curves.
PART 3 SUMMARIZATION 2: MEASURES OF CENTER, VARIABILITY, AND POSITION
Figure A.7 Boxplot of left-skewed data.
PART 4 SUMMARIZATION AND VISUALIZATION OF BIVARIATE RELATIONSHIPS
TABLE A.5 Contingency table for mortgage versus risk
Figure A.8 Clustered bar chart for risk, clustered by mortgage.
Figure A.9 Individual value plot of income versus risk.
Figure A.10 Some possible relationships between x and y.
INDEX
People also search for Discovering Knowledge in Data An Introduction to Data Mining 2nd:
discovering knowledge
discovering knowledge in data pdf
what is the systematic method of discovering knowledge
bahria university discovering knowledge
method of discovering knowledge
discovering knowledge meaning
Reviews
There are no reviews yet.