Machine Learning for Developers
上QQ阅读APP看书,第一时间看更新

Normalization or standardization

This technique aims to give the dataset the properties of a normal distribution, that is, a mean of 0 and a standard deviation of 1.

The way to obtain these properties is by calculating the so-called z scores, based on the dataset samples, with the following formula:

Let's visualize and practice this new concept with the help of scikit-learn, reading a file from the MPG dataset, which contains city-cycle fuel consumption in miles per gallon, based on the following features: mpg, cylinders, displacementhorsepower, weight, acceleration, model year, origin, and car name.

from sklearn import preprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df=pd.read_csv("data/mpg.csv")
plt.figure(figsize=(10,8))
print df.columns
partialcolumns = df[['acceleration', 'mpg']]
std_scale = preprocessing.StandardScaler().fit(partialcolumns)
df_std = std_scale.transform(partialcolumns)
plt.scatter(partialcolumns['acceleration'], partialcolumns['mpg'], color="grey", marker='^')
plt.scatter(df_std[:,0], df_std[:,1])
The following picture allows us to compare the non normalized and normalized data representations:
Depiction of the original dataset, and its normalized counterpart.
It's very important to have an account of the denormalizing of the resulting data at the time of evaluation so that you do not lose the representative of the data, especially if the model is applied to regression, when the regressed data won't be useful if not scaled.