Machine Learning for Developers
上QQ阅读APP看书,第一时间看更新

One hot encoding

Numerical or categorical information can easily be normally represented by integers, one for each option or discrete result. But there are situations where bins indicating the current option are preferred. This form of data representation is called one hot encodingThis encoding simply transforms a certain input into a binary array containing only zeros, except for the value indicated by the value of a variable, which will be one.

In the simple case of an integer, this will be the representation of the list [1, 3, 2, 4] in one hot encoding:

[[0 1 0 0 0]
[0 0 0 1 0]
[0 0 1 0 0]
[0 0 0 0 1]]

Let's perform a simple implementation of a one hot integer encoder for integer arrays, in order to better understand the concept:

import numpy as np
def get_one_hot(input_vector):
result=[]
for i in input_vector:
newval=np.zeros(max(input_vector))
newval.itemset(i-1,1)
result.append(newval)
return result

In this example, we first define the get_one_hot function, which takes an array as input and returns an array.

What we do is take the elements of the arrays one by one, and for each element in it, we generate a zero array with length equal to the maximum value of the array, in order to have space for all possible values. Then we insert 1 on the index position indicated by the current value (we subtract 1 because we go from 1-based indexes to 0-based indexes).

Let's try the function we just wrote:

get_one_hot([1,5,2,4,3])

#Out:
[array([ 1., 0., 0., 0., 0.]),
array([ 0., 0., 0., 0., 1.]),
array([ 0., 1., 0., 0., 0.]),
array([ 0., 0., 0., 1., 0.]),
array([ 0., 0., 1., 0., 0.])]