Data Mining with Rattle and R pdf epub mobi txt 电子书下载 2026

简体网页||繁体网页

☆☆☆☆☆

出版者:Springer

作者:Graham Williams

出品人:

页数:396

译者:

出版时间:2011-8-4

价格:GBP 49.99

装帧:Paperback

isbn号码:9781441998897

丛书系列:

图书标签:

R
数据挖掘
Rattle
Programming
Mining
计算机科学
计算机技术
方法论
数据挖掘
R语言
Rattle
机器学习
统计学习
数据分析
商业智能
数据科学
可视化
预测建模

下载链接在页面底部

facebook linkedin mastodon messenger pinterest reddit telegram twitter viber vkontakte whatsapp 复制链接

想要找书就要到图书目录大全

book.wenda123.org

立刻按 ctrl+D收藏本页

你会得到大惊喜!!

具体描述

Data mining is the art and science of intelligent data analysis. By building knowledge from information, data mining adds considerable value to the ever increasing stores of electronic data that abound today. In performing data mining many decisions need to be made regarding the choice of methodology, the choice of data, the choice of tools, and the choice of algorithms.

Throughout this book the reader is introduced to the basic concepts and some of the more popular algorithms of data mining. With a focus on the hands-on end-to-end process for data mining, Williams guides the reader through various capabilities of the easy to use, free, and open source Rattle Data Mining Software built on the sophisticated R Statistical Software. The focus on doing data mining rather than just reading about data mining is refreshing.

The book covers data understanding, data preparation, data refinement, model building, model evaluation, and practical deployment. The reader will learn to rapidly deliver a data mining project using software easily installed for free from the Internet. Coupling Rattle with R delivers a very sophisticated data mining environment with all the power, and more, of the many commercial offerings.

数据挖掘实践：基于 Python 的现代方法本书导言：在信息爆炸的时代，数据已成为驱动决策和创新的核心资产。从海量原始数据中提炼出有价值的洞察力，已不再是数据科学家的专属技能，而是各行各业专业人士的必备能力。本书旨在提供一个全面、实用的指南，带领读者深入掌握现代数据挖掘技术，重点聚焦于当前业界最流行、功能最强大的编程语言——Python 及其丰富的生态系统。本书并非对特定软件工具的冗长操作手册，而是一部侧重于方法论、算法理解与实际应用的深度教程。我们将摒弃过于学术化的理论推导，转而强调如何将复杂的统计模型和机器学习算法有效地应用于真实世界的数据集，解决实际业务问题。本书内容纲要与核心价值：本书结构清晰，循序渐进，从数据处理的基础环节开始，逐步深入到高级预测建模和结果解释。我们关注的重点是数据生命周期的完整闭环：数据获取、清洗、探索性分析、建模、评估与部署。第一部分：数据挖掘的基石——Python 环境与数据准备（约 350 字）本部分为后续复杂分析打下坚实的基础。我们将介绍构建高效 Python 数据挖掘工作环境所需的关键库，如 NumPy 用于高效的数值计算，以及 Pandas 用于强大的数据结构操作和处理。 Python 生态概览：不仅仅是安装，更重要的是理解不同库（如 SciPy, Scikit-learn, Matplotlib）在数据科学流程中的角色定位。数据导入与清洗的艺术：真实世界的数据充满“污点”。我们将详细讲解如何处理缺失值（插补策略的选择与实施）、异常值检测与处理（基于统计和基于模型的方法）、数据类型转换、以及如何应对非结构化数据中的常见问题。数据重塑与特征工程基础：数据的结构直接影响模型的性能。本章将深入探讨数据透视、合并、堆叠等操作，并引入特征工程的初步概念，包括独热编码（One-Hot Encoding）的合理应用，以及日期和时间变量的特征提取。第二部分：探索性数据分析（EDA）与数据可视化（约 300 字）在投入昂贵的计算资源进行建模之前，充分理解数据至关重要。EDA 是发现数据内在规律、识别潜在问题、并为后续模型选择提供方向的关键步骤。统计摘要与分布洞察：使用 Pandas 和 SciPy 库进行描述性统计分析，理解数据的中心趋势、离散程度和形状。重点讨论如何识别正态分布、偏态分布及离群点对分析的影响。强大的可视化技术：我们将充分利用 Matplotlib 和 Seaborn 库，超越简单的条形图和散点图。深入学习如何构建有效的箱线图、小提琴图、相关性热力图（Heatmaps），以及如何使用分面网格（Faceting）来比较不同子群体的数据特征。变量关系探索：如何通过可视化快速发现特征之间的相关性、以及特征与目标变量之间的关系，为特征选择提供直观依据。第三部分：经典与现代机器学习算法详解与应用（约 550 字）这是本书的核心部分。我们将涵盖从基础的监督学习到复杂的无监督学习方法，并确保读者不仅会调用函数，更能理解算法背后的决策边界和假设。监督学习——回归分析：涵盖线性回归的扩展（岭回归、Lasso），重点讨论模型的正则化技术如何解决多重共线性问题，并详细解释模型评估指标（$R^2$, RMSE, MAE）的选择标准。监督学习——分类算法的深度剖析：逻辑回归：深入理解几率（Odds）和 Sigmoid 函数的作用，以及如何解释系数的含义。决策树与集成学习：详细解析决策树的构建过程（信息增益与基尼不纯度）。重点放在集成方法，包括随机森林（Random Forests）和梯度提升机（Gradient Boosting Machines, GBM）。我们将对比 AdaBoost, XGBoost/LightGBM 的核心差异和性能权衡。无监督学习与模式发现：聚类技术：深入研究 K-Means 的局限性以及如何通过轮廓系数（Silhouette Score）评估聚类质量。引入层次聚类（Hierarchical Clustering）和 DBSCAN 来处理不同形状的簇。降维技术：主成分分析（PCA）的原理和应用场景，以及流形学习（如 t-SNE）在数据可视化中的作用。模型评估与选择的艺术：不仅仅是准确率。我们将全面讲解交叉验证（Cross-Validation）的策略，混淆矩阵（Confusion Matrix）的解读，以及 ROC 曲线与 AUC 值的实际意义。讨论如何处理类别不平衡问题（如 SMOTE）。第四部分：高级主题与实践部署（约 300 字）本部分将读者从纯粹的模型构建提升到实际生产环境的考虑，强调模型的可靠性、可解释性和维护性。模型可解释性（XAI）：在“黑箱”模型盛行的今天，理解模型为什么做出特定预测至关重要。我们将介绍 LIME 和 SHAP 值等工具，帮助剖析复杂模型中单个特征的贡献。时间序列分析基础：针对具有时间依赖性的数据，介绍 ARIMA 模型的组件，以及如何使用 Prophet 库进行快速、准确的基准预测。模型持久化与流水线构建：学习如何使用 Python 的 `pickle` 或 `joblib` 保存训练好的模型。更进一步，我们将使用 Scikit-learn 的 `Pipeline` 对象，将数据预处理步骤和模型训练封装成一个统一的、可复用的工作流，这是走向工业化部署的关键一步。本书特色：本书最大的特色在于其动手实践驱动性。每一个关键概念都伴随着详细、可执行的 Python 代码示例，所有代码均在 Jupyter Notebook 环境下测试通过。我们采用真实、多样化的数据集（非高度净化的教科书数据），确保读者在掌握技术的同时，也能培养出应对复杂数据挑战的直觉和批判性思维。学习本书，您将掌握的不仅仅是算法的语法，而是一套完整的、面向实战的数据挖掘方法论。

作者简介

Dr Graham Williams is Senior Director of Analytics with the Australian Taxation Office, and previously Principal Computer Scientist for Data Mining with CSIRO. He is also Visiting Professor and Senior International Scientist with the Shenzhen Institutes of Advanced Analytics of the Chinese Academy of Sciences, Adjunct Professor, Data Mining, Fraud Prevention, Security, University of Canberra, and Adjunct Professor, Australian National University. Graham regularly teaches data mining courses and is author of the freely available, open source data mining system, Rattle. He has been involved in many data mining projects for clients from government and industry over his long career. His research developments included ensemble learning (1980's) and hot spots discovery (1990's). He is actively involved in the international artificial intelligence and data mining research communities, particularly as chair of the Pacific Asia Knowledge Discovery and Data Mining conference series and founder and co-chair of the Australasian Data Mining conference series. Graham has editted a number of books and authored many academic and industry papers and reports. His current focus is on making data mining technology readily accessible, ensuring research, innovation and discovery are repeatable and available, and encouraging the free and open sharing of knowledge.

目录信息

Contents
Preface vii
I Explorations 1
1 Introduction 3
1.1 Data Mining Beginnings . . . . . . . . . . . . . . . . . . . 5
1.2 The Data Mining Team . . . . . . . . . . . . . . . . . . . 5
1.3 Agile Data Mining . . . . . . . . . . . . . . . . . . . . . . 6
1.4 The Data Mining Process . . . . . . . . . . . . . . . . . . 7
1.5 A Typical Journey . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Insights for Data Mining . . . . . . . . . . . . . . . . . . . 9
1.7 Documenting Data Mining . . . . . . . . . . . . . . . . . . 10
1.8 Tools for Data Mining: R . . . . . . . . . . . . . . . . . . 10
1.9 Tools for Data Mining: Rattle . . . . . . . . . . . . . . . . 11
1.10 Why R and Rattle? . . . . . . . . . . . . . . . . . . . . . . 13
1.11 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.12 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2 Getting Started 21
2.1 Starting R . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Quitting Rattle and R . . . . . . . . . . . . . . . . . . . . 24
2.3 First Contact . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Loading a Dataset . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Building a Model . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Understanding Our Data . . . . . . . . . . . . . . . . . . . 31
2.7 Evaluating the Model: Confusion Matrix . . . . . . . . . . 35
2.8 Interacting with Rattle . . . . . . . . . . . . . . . . . . . . 39
2.9 Interacting with R . . . . . . . . . . . . . . . . . . . . . . 43
2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.11 Command Summary . . . . . . . . . . . . . . . . . . . . . 55
3 Working with Data 57
3.1 Data Nomenclature . . . . . . . . . . . . . . . . . . . . . . 58
3.2 Sourcing Data for Mining . . . . . . . . . . . . . . . . . . 61
3.3 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 Data Matching . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Data Warehousing . . . . . . . . . . . . . . . . . . . . . . 65
3.6 Interacting with Data Using R . . . . . . . . . . . . . . . 68
3.7 Documenting the Data . . . . . . . . . . . . . . . . . . . . 71
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.9 Command Summary . . . . . . . . . . . . . . . . . . . . . 74
4 Loading Data 75
4.1 CSV Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 ARFF Data . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3 ODBC Sourced Data . . . . . . . . . . . . . . . . . . . . . 84
4.4 R Dataset|Other Data Sources . . . . . . . . . . . . . . 87
4.5 R Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.6 Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.7 Data Options . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.8 Command Summary . . . . . . . . . . . . . . . . . . . . . 97
5 Exploring Data 99
5.1 Summarising Data . . . . . . . . . . . . . . . . . . . . . . 100
5.1.1 Basic Summaries . . . . . . . . . . . . . . . . . . . 101
5.1.2 Detailed Numeric Summaries . . . . . . . . . . . . 103
5.1.3 Distribution . . . . . . . . . . . . . . . . . . . . . . 105
5.1.4 Skewness . . . . . . . . . . . . . . . . . . . . . . . 105
5.1.5 Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . 106
5.1.6 Missing Values . . . . . . . . . . . . . . . . . . . . 106
5.2 Visualising Distributions . . . . . . . . . . . . . . . . . . . 108
5.2.1 Box Plot . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.2 Histogram . . . . . . . . . . . . . . . . . . . . . . . 114
5.2.3 Cumulative Distribution Plot . . . . . . . . . . . . 116
5.2.4 Benford's Law . . . . . . . . . . . . . . . . . . . . 119
5.2.5 Bar Plot . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2.6 Dot Plot . . . . . . . . . . . . . . . . . . . . . . . . 121
5.2.7 Mosaic Plot . . . . . . . . . . . . . . . . . . . . . . 122
5.2.8 Pairs and Scatter Plots . . . . . . . . . . . . . . . 123
5.2.9 Plots with Groups . . . . . . . . . . . . . . . . . . 127
5.3 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . 128
5.3.1 Correlation Plot . . . . . . . . . . . . . . . . . . . 128
5.3.2 Missing Value Correlations . . . . . . . . . . . . . 132
5.3.3 Hierarchical Correlation . . . . . . . . . . . . . . . 133
5.4 Command Summary . . . . . . . . . . . . . . . . . . . . . 135
6 Interactive Graphics 137
6.1 Latticist . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.2 GGobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.3 Command Summary . . . . . . . . . . . . . . . . . . . . . 148
7 Transforming Data 149
7.1 Data Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.2 Transforming Data . . . . . . . . . . . . . . . . . . . . . . 153
7.3 Rescaling Data . . . . . . . . . . . . . . . . . . . . . . . . 154
7.4 Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.5 Recoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.6 Cleanup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.7 Command Summary . . . . . . . . . . . . . . . . . . . . . 167
II Building Models 169
8 Descriptive and Predictive Analytics 171
8.1 Model Nomenclature . . . . . . . . . . . . . . . . . . . . . 172
8.2 A Framework for Modelling . . . . . . . . . . . . . . . . . 172
8.3 Descriptive Analytics . . . . . . . . . . . . . . . . . . . . . 175
8.4 Predictive Analytics . . . . . . . . . . . . . . . . . . . . . 175
8.5 Model Builders . . . . . . . . . . . . . . . . . . . . . . . . 176
9 Cluster Analysis 179
9.1 Knowledge Representation . . . . . . . . . . . . . . . . . . 180
9.2 Search Heuristic . . . . . . . . . . . . . . . . . . . . . . . 181
9.3 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.4 Tutorial Example . . . . . . . . . . . . . . . . . . . . . . . 185
9.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.6 Command Summary . . . . . . . . . . . . . . . . . . . . . 191
10 Association Analysis 193
10.1 Knowledge Representation . . . . . . . . . . . . . . . . . . 194
10.2 Search Heuristic . . . . . . . . . . . . . . . . . . . . . . . 195
10.3 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
10.4 Tutorial Example . . . . . . . . . . . . . . . . . . . . . . . 197
10.5 Command Summary . . . . . . . . . . . . . . . . . . . . . 203
11 Decision Trees 205
11.1 Knowledge Representation . . . . . . . . . . . . . . . . . . 206
11.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
11.3 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
11.4 Tutorial Example . . . . . . . . . . . . . . . . . . . . . . . 215
11.5 Tuning Parameters . . . . . . . . . . . . . . . . . . . . . . 230
11.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
11.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
11.8 Command Summary . . . . . . . . . . . . . . . . . . . . . 243
12 Random Forests 245
12.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
12.2 Knowledge Representation . . . . . . . . . . . . . . . . . . 247
12.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
12.4 Tutorial Example . . . . . . . . . . . . . . . . . . . . . . . 249
12.5 Tuning Parameters . . . . . . . . . . . . . . . . . . . . . . 261
12.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
12.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
12.8 Command Summary . . . . . . . . . . . . . . . . . . . . . 268
13 Boosting 269
13.1 Knowledge Representation . . . . . . . . . . . . . . . . . . 270
13.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
13.3 Tutorial Example . . . . . . . . . . . . . . . . . . . . . . . 272
13.4 Tuning Parameters . . . . . . . . . . . . . . . . . . . . . . 285
13.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
13.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
13.7 Command Summary . . . . . . . . . . . . . . . . . . . . . 291
14 Support Vector Machines 293
14.1 Knowledge Representation . . . . . . . . . . . . . . . . . . 294
14.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
14.3 Tutorial Example . . . . . . . . . . . . . . . . . . . . . . . 299
14.4 Tuning Parameters . . . . . . . . . . . . . . . . . . . . . . 302
14.5 Command Summary . . . . . . . . . . . . . . . . . . . . . 304
III Delivering Performance 305
15 Model Performance Evaluation 307
15.1 The Evaluate Tab: Evaluation Datasets . . . . . . . . . . 308
15.2 Measure of Performance . . . . . . . . . . . . . . . . . . . 312
15.3 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . 314
15.4 Risk Charts . . . . . . . . . . . . . . . . . . . . . . . . . . 315
15.5 ROC Charts . . . . . . . . . . . . . . . . . . . . . . . . . . 320
15.6 Other Charts . . . . . . . . . . . . . . . . . . . . . . . . . 320
15.7 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
16 Deployment 323
16.1 Deploying an R Model . . . . . . . . . . . . . . . . . . . . 323
16.2 Converting to PMML . . . . . . . . . . . . . . . . . . . . 325
16.3 Command Summary . . . . . . . . . . . . . . . . . . . . . 327
IV Appendices 329
A Installing Rattle 331
B Sample Datasets 335
B.1 Weather . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
B.1.1 Obtaining Data . . . . . . . . . . . . . . . . . . . . 336
B.1.2 Data Preprocessing . . . . . . . . . . . . . . . . . . 339
B.1.3 Data Cleaning . . . . . . . . . . . . . . . . . . . . 339
B.1.4 Missing Values . . . . . . . . . . . . . . . . . . . . 341
B.1.5 Data Transforms . . . . . . . . . . . . . . . . . . . 343
B.1.6 Using the Data . . . . . . . . . . . . . . . . . . . . 345
B.2 Audit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
B.2.1 The Adult Survey Dataset . . . . . . . . . . . . . . 347
B.2.2 From Survey to Audit . . . . . . . . . . . . . . . . 348
B.2.3 Generating Targets . . . . . . . . . . . . . . . . . . 349
B.2.4 Finalising the Data . . . . . . . . . . . . . . . . . . 354
B.2.5 Using the Data . . . . . . . . . . . . . . . . . . . . 354
B.3 Command Summary . . . . . . . . . . . . . . . . . . . . . 354
References 357
Index 365
· · · · · · (收起)