1. 数据分析和可视化

在做数据模型的选择之前,通常需要对数据进行可视化,以寻找数据之间的可见关系,比如两个数据是否存在线性关系。这里列举方便快捷的可视化操作,来帮助我们快速找到数据之间的关系。

1.1. 单变量

1.1.1. 箱图

1
2
3
# dataset为dataframe
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()

下图是绘制出的box图,图形中,绿线为平均值,方框为25%与75%数值范围,上下边界值,圆圈标记异常值。

机器学习和scikit-learn库基础学习笔记_001.png

1.1.2. 柱状图

1
2
3
4
...
# histograms
dataset.hist()
plt.show()

绘制柱状图

机器学习和scikit-learn库基础学习笔记_002.png

1.2. 多变量

1.2.1. 变量之间的数据分布

1
2
3
4
5
from pandas.plotting import scatter_matrix
...
# scatter plot matrix
scatter_matrix(dataset)
plt.show()

绘制关系图

机器学习和scikit-learn库基础学习笔记_003.png

1.2.2. 变量相关性

显示相关性数据及热图,数越大相关

1
2
3
4
5
6
7
import matplotlib.pyplot as plt
import seaborn as sns
# dataset is dataframe
correlation = dataset.corr()
# display(correlation)
plt.figure(figsize=(14, 12))
heatmap = sns.heatmap(correlation, annot=True, linewidths=0, vmin=-1, cmap="RdBu_r")

机器学习和scikit-learn库基础学习笔记_004.png

1.2.3. 两个变量之间的对比

1
2
3
4
5
6
7
8
9
10
11
12
13
#Visualize the co-relation between pH and fixed Acidity

#Create a new dataframe containing only pH and fixed acidity columns to visualize their co-relations
fixedAcidity_pH = dataset[['pH', 'fixed acidity']]

#Initialize a joint-grid with the dataframe, using seaborn library
gridA = sns.JointGrid(x="fixed acidity", y="pH", data=fixedAcidity_pH, size=6)

#Draws a regression plot in the grid
gridA = gridA.plot_joint(sns.regplot, scatter_kws={"s": 10})

#Draws a distribution plot in the same grid
gridA = gridA.plot_marginals(sns.distplot)

机器学习和scikit-learn库基础学习笔记_005.png

2. 模型选择

sklearn提许多模型,对于不同的数据集,不同的模型训练之后会产生不同的结果。

常用的模型和导入方式

1
2
3
4
5
6
7
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

3. 模型训练

对于一个选择的模型,要对其进行训练。首先,需要将数据集分为训练集和验证集,然后训练模型,输出精确度。

1
2
3
4
5
6
7
8
9
from sklearn.model_selection import train_test_split
...
#数据集提取x和y,x对应属性,y对应分类
X_array = dataset.values[:,:4]
y_array = dataset.values[:,4]
#按照1:4获取验证集和训练集
X_train,X_test,y_train,y_test = train_test_split(X_array,y_array,test_size=0.2,shuffle=True)
model= SVC()
model.fit(X_train,y_train)

4. 模型打分,预测,和保存

1
2
3
4
5
6
7
8
9
10
11
12
print(model.score(X_test,y_test))
prediction = model.predict(X_test)
import joblib

# save the model to disk
filename = 'finalized_model.sav'
joblib.dump(model, filename)

# some time later...

# load the model from disk
loaded_model = joblib.load(filename)

5. 参考

Your First Machine Learning Project in Python Step-By-Step

How to Use Data Science to Understand What Makes Wine Taste Good

Save and Load Machine Learning Models in Python with scikit-learn

评论