Training Systems Using Python Statistical Modeling
上QQ阅读APP看书,第一时间看更新

Preprocessing the data

Now let's open up the Jupyter Notebook and get started on our first program, using the methods that we discussed in the previous section:

  1. The first thing we need to do is load the various libraries that we need. We will also load the iris dataset from the scikit-learn library, using the following code:

  1. After importing all the required libraries and the dataset, we will go ahead and create an object called iris_obj, which loads the iris dataset into an object. Then, we will go ahead and use the data method to preview the dataset; and this results in the following output:

Notice that it's a NumPy array. This contains a lot of the data that we want, and each of these columns corresponds to a feature.

  1. We will now see what those feature names are in the following output:

As you can see here, the first column shows the sepal length, the next column shows the sepal width, the third column shows the petal length, and the final column shows the petal width.

  1. Now, there is a fifth column that is not displayed hereit's referred to as the target column. This is stored in a separate array; we will now look at this column as follows:

This displays the target column in an array.

  1. Now, if you want to see the labels of the array header, we can use the following code:

As you can see, the target column consists of data with three different labels. The flowers come from either the setosa, the versicolor, or the virginica species.

  1. Our next step is to take this dataset and turn it into a pandas DataFrame, using the following code:

This results in the following output:

As you can see, we have successfully loaded the data into a DataFrame.

  1. We can see that the species column still shows the various species using numeric values. So, we will replace the final column, which indicates the various species, with strings that indicate the values, rather than numbers, using the following code block:

The following screenshot shows the result:

As you can see, the species column now has the actual species namesthis makes it much easier to work with the data.

Now, for this dataset, the fact that each flower comes from a different species suggests that we may want to actually group the data when we're doing statistical summariestherefore, we can try grouping by species.

  1. So, we will now group the dataset values using the species column as the anchor, and then print out the details of each group to make sure that everything is working. We will use the following lines of code to do so:

This results in the following output:

Now that the data has been loaded and set up, we will use it to perform some basic statistical operations in the next section.