目 录
BIG
DATA
CONTENTS
基础篇
第1章 大数据概述 ··································································003
1.1 数据和大数据 ·········································································003
1.1.1 数据的高速增长 ·······························································003
1.1.2 大数据 ···········································································004
1.1.3 科学的范式 ·····································································006
1.2 大数据从哪里来 ······································································007
1.3 大数据的应用场景 ···································································008
1.4 大数据对思维方式的影响 ··························································010
1.5 数据挖掘与机器学习 ································································011
1.6 数据科学项目的基本流程 ··························································012
1.7 数据安全和大数据伦理 ·····························································013
1.7.1 数据安全 ········································································013
1.7.2 大数据伦理 ·····································································015
1.8 国家层面的大数据问题 ·····························································016
1.8.1 数据主权 ········································································016
1.8.2 大数据与国家治理 ····························································017
1.8.3 大数据重塑世界新格局 ······················································018
1.8.4 中国国家大数据战略 ·························································019
1.9 云计算 ··················································································020
1.9.1 云计算的特征 ··································································022
1.9.2 云计算的典型服务模式 ······················································022
1.9.3 云计算服务部署的环境 ······················································023
BIG
DATA
大数据应用基础教程
IV
1.9.4 云计算和大数据的关系 ······················································023
1.10 物联网 ·················································································023
1.11 数字经济 ··············································································025
1.11.1 大数据与数字经济 ··························································026
1.11.2 进一步推动我国数字经济发展 ···········································029
本章小结 ·······················································································030
习题 ·····························································································032
第2章 Python及常用类库 ························································033
2.1 Python简介 ···········································································033
2.1.1 Python的诞生 ·································································033
2.1.2 Python社区 ····································································034
2.1.3 Python的版本 ·································································034
2.1.4 使用Python进行数据分析的原因 ······································036
2.2 Python的安装与运行 ·······························································037
2.2.1 Anaconda简介及安装 ························································037
2.2.2 Python的运行 ·································································041
2.2.3 小结 ··············································································046
2.3 Python语言基础 ·····································································046
2.3.1 数据结构 ········································································046
2.3.2 代码结构 ········································································058
2.3.3 小结 ··············································································069
2.4 Python数据分析的常用类库 ······················································069
2.4.1 NumPy简介 ····································································069
2.4.2 pandas简介 ·····································································076
2.4.3 小结 ··············································································095
本章小结 ·······················································································095
习题 ·····························································································096
数据分析篇
第3章 数据获取 ·····································································101
3.1 数据来源 ···············································································101
3.2 网络数据爬取 ·········································································103
BIG
DATA
目
录
V
3.2.1 网络爬虫概述 ··································································103
3.2.2 网页访问的基础知识 ·························································104
3.2.3 网页数据爬取 ··································································109
3.2.4 网页内容解析 ··································································111
3.2.5 常见的“爬取与反爬”攻防策略 ··········································115
3.3 网络数据采集器 ······································································118
3.3.1 常见采集器 ·····································································118
3.3.2 八爪鱼采集案例 ·······························································118
3.4 使用Selenium获取数据 ···························································122
3.4.1 安装Selenium ··································································122
3.4.2 使用Selenium获取页面元素 ···············································124
3.4.3 Selenium应用:链家二手房数据获取 ····································126
本章小结 ·······················································································130
习题 ·····························································································130
第4章 数据存储 ·····································································131
4.1 文件 ·····················································································131
4.2 传统数据库技术 ······································································133
4.2.1 数据库管理系统 ·······························································133
4.2.2 数据库的概念模型 ····························································134
4.2.3 关系型数据库 ··································································135
4.2.4 结构化查询语言SQL ························································136
4.2.5 MySQL数据库管理 ··························································137
4.2.6 基于MySQL monitor的基本数据库操作 ································141
4.2.7 基于HeidiSQL的基本数据库操作 ········································145
4.3 NoSQL数据库 ········································································148
4.3.1 NoSQL的发展背景 ···························································148
4.3.2 NoSQL数据库的类型 ························································149
本章小结 ·······················································································152
习题 ·····························································································152
第5章 数据预处理 ··································································153
5.1 数据质量问题 ·········································································153
5.1.1 现实世界的“脏”数据 ······················································153
5.1.2 数据质量问题的产生原因 ···················································155
BIG
DATA
大数据应用基础教程
5.1.3 数据质量审核 ··································································156
5.2 数据预处理技术 ······································································158
5.2.1 数据清洗 ········································································158
5.2.2 数据集成 ········································································159
5.2.3 数据变换 ········································································160
5.2.4 数据归约 ········································································161
5.3 预处理案例 ············································································162
本章小结 ·······················································································166
习题 ·····························································································166
第6章 数据可视化 ··································································167
6.1 数据可视化概述 ······································································167
6.1.1 什么是数据可视化 ····························································167
6.1.2 常用的数据可视化工具 ······················································168
6.1.3 Python可视化工具库 ························································169
6.2 Matplotlib数据可视化 ·······························································170
6.2.1 Matplotlib绘图基础 ··························································170
6.2.2 Matplotlib常用绘图 ··························································172
6.2.3 使用mplot3d绘制3D图形 ·················································180
6.3 pandas数据可视化 ··································································185
6.3.1 pandas绘图基础 ·······························································185
6.3.2 pandas常用绘图 ·······························································186
6.4 seaborn数据可视化 ·································································191
6.4.1 seaborn绘图基础 ······························································191
6.4.2 seaborn常用绘图 ······························································197
6.5 pyecharts数据可视化 ·······························································201
6.5.1 pyecharts绘图基础 ···························································201
6.5.2 pyecharts常用绘图 ···························································201
本章小结 ·······················································································208
习题 ·····························································································208
第7章 数据分析方法 ·······························································211
7.1 数据分析方法的数学基础 ··························································211
7.1.1 理解复合函数求导 ····························································211
7.1.2 理解多元函数偏导 ····························································212
BIG
DATA
目
录
7.1.3 理解最小二乘法 ·······························································212
7.1.4 理解梯度 ········································································213
7.1.5 理解概率 ········································································213
7.1.6 理解条件概率 ··································································214
7.1.7 理解贝叶斯公式 ·······························································214
7.2 回归 ·····················································································215
7.2.1 回归的基本概念及方法 ······················································215
7.2.2 回归预测的性能度量 ·························································217
7.2.3 线性回归 ········································································218
7.3 分类 ·····················································································227
7.3.1 分类的基本方法 ·······························································227
7.3.2 分类任务的性能度量 ·························································228
7.3.3 逻辑回归 ········································································229
7.3.4 支持向量机 ·····································································240
7.3.5 决策树理论 ·····································································254
7.3.6 朴素贝叶斯 ·····································································258
7.3.7 k-近邻(k-NN)算法 ·························································262
7.4 聚类 ·····················································································266
7.4.1 聚类算法 ········································································266
7.4.2 K-means聚类算法 ····························································267
7.4.3 K-means聚类案例 ····························································268
7.5 文本分析 ···············································································276
7.5.1 文本分析的基本步骤 ·························································277
7.5.2 文本分析的基本概念 ·························································277
7.5.3 文本分析案例 ··································································278
本章小结 ·······················································································286
习题 ·····························································································286
大数据平台篇
第8章 Linux操作系统基础 ······················································289
8.1 Linux操作系统简介··································································289
8.1.1 操作系统 ········································································289
8.1.2 Linux操作系统 ································································290
BIG
DATA
大数据应用基础教程
8.1.3 大数据平台基于Linux操作系统的原因 ·································293
8.2 Linux基本命令········································································293
8.2.1 目录与文件操作命令 ·························································293
8.2.2 文本过滤与处理 ·······························································298
8.2.3 Shell输入输出命令 ···························································300
8.2.4 进程管理命令 ··································································301
8.2.5 日常操作命令 ··································································303
本章小结 ·······················································································306
习题 ·····························································································306
第9章 大数据管理平台 ····························································307
9.1 应用场景 ···············································································307
9.2 发展历程 ···············································································309
9.3 技术体系 ···············································································311
9.3.1 数据收集层 ·····································································312
9.3.2 数据存储层 ·····································································313
9.3.3 资源管理层 ·····································································315
9.3.4 计算引擎层 ·····································································315
9.3.5 数据分析层 ·····································································317
9.3.6 数据可视化层 ··································································317
9.3.7 大数据管理平台技术栈 ······················································318
本章小结 ·······················································································319
习题 ·····························································································319
第10章 分布式存储 ································································321
10.1 HDFS介绍 ···········································································321
10.2 HDFS基本架构 ·····································································323
10.3 HDFS Shell访问 ···································································325
本章小结 ·······················································································328
习题 ·····························································································328
第11章 分布式处理 ································································329
11.1 分布式计算思想 ·····································································329
11.2 MapReduce ··········································································333
BIG
DATA
目
录
11.2.1 MapReduce介绍 ·····························································333
11.2.2 MapReduce编程模型 ·······················································334
11.2.3 MapReduce程序案例 ·······················································335
11.3 Spark ··················································································341
11.3.1 Spark介绍 ·····································································341
11.3.2 Spark编程模型 ·······························································342
11.3.3 Spark程序案例 ·······························································345
11.4 Spark相对于Hadoop的优势 ···················································352
本章小结 ·······················································································353
习题 ·····························································································353
参考文献 ·················································································355
附录A 基于虚拟机的Linux系统安装 ··········································359
A.1 虚拟机技术概述 ······································································359
A.2 虚拟机托管软件安装 ································································360
A.3 虚拟机Linux安装 ···································································362
附录B Hadoop及Spark安装 ····················································371
B.1 集群基础配置 ·········································································371
B.2 Hadoop安装 ··········································································375
B.3 Spark安装·············································································380
