Practical Data Science Cookbook（Second Edition）

上QQ阅读APP看书，第一时间看更新

How to do it...

Generally speaking, the data for the United States is broken up into six groups:

Top 10 percent income share
Top 5 percent income share
Top 1 percent income share
Top 0.5 percent income share
Top 0.1 percent income share
Average income share

These groups reflect aggregations for data points collected in those specific bins. An easy and quick first analysis is to simply plot these percentages of income shares over time for each of the top income groups. Since plotting several time series is going to be a common task, let's once again create a helper function that wraps matplotlib and generates a line chart for each time series that is passed to it:

In [24]: def linechart(series, **kwargs): 
    ...: fig = plt.figure() 
    ...: ax = plt.subplot(111) 
    ...: for line in series: 
    ...: line = list(line) 
    ...: xvals = [v[0] for v in line] 
    ...: yvals = [v[1] for v in line] 
    ...: ax.plot(xvals, yvals) 
    ...: if 'ylabel' in kwargs: 
    ...: ax.set_ylabel(kwargs['ylabel']) 
    ...: if 'title' in kwargs: 
    ...: plt.title(kwargs['title']) 
    ...: if 'labels' in kwargs: 
    ...: ax.legend(kwargs.get('labels')) 
    ...: return fig 
    ...:

This function is very simple. It creates a matplotlib.pyplot figure as well as the axis subplot. For each line in the series, it gets the x axis values (remember that the first item in our time series time tuple is Year) as the first item of the tuple and the y axis, which is the second value. It splits these into separate generators and then plots them on the figure. Finally, any options we want for our chart, such as labels or legends, we can simply pass as keyword arguments, and our function will handle them for us! The following steps will walk you through this recipe of application-oriented analysis.

In order to generate our chart, we simply need to use our time series function on the columns we would like and pass them to the linechart function. This simple task is now repeatable, and we'll use it a few times for the next few charts:

In [25]: def percent_income_share(source): 
    ...: """ 
    ...: Create Income Share chart 
    ...: """ 
    ...: columns = ( 
    ...: "Top 10% income share", 
    ...: "Top 5% income share", 
    ...: "Top 1% income share", 
    ...: "Top 0.5% income share", 
    ...: "Top 0.1% income share", 
    ...: ) 
    ...: source = list(dataset(source)) 
    ...: return linechart([timeseries(source, col) for col in columns], 
    ...: labels=columns, 
    ...: , 
    ...: ylabel="Percentage") 
    ...:  
    ...:

Note that I wrapped the generation of this chart in a function as well; this way, we modify the chart as needed, and the function wraps the configuration and generation of the chart itself. The function identifies the columns for the line series and then fetches the dataset. For each column, it creates a time series and then passes these time series to our linechart function with our configuration options:

To generate the plot, we define the input parameter to the percent_income_source function:

In [26]: percent_income_share(data_file) 

    ...:  
Out[26]:

The following screenshot shows the result, and you will use a similar pattern in the rest of this chapter to use the functions to create the needed plots:

This graph tells us that the raw percentages for the income groups tend to move in the same direction. When one group's income increases, the other groups' incomes also increase. This seems like a good sanity check as folks who are in the top 0.1 percent income bracket are also in the top 10 percent income bracket, and they contribute a lot to the overall mean for each bin. There is also a clear, persistent difference between each of the lines.

Looking at the raw percentages is useful, but we may also want to consider how the percentages have changed over time, relative to what the average percentage has been for that income group. In order to do this, we can calculate the means of each group's percentages and then divide all of the group's values by the mean we just calculated.

Since mean normalization is another common function that we might want to perform on a range of datasets, we will once again create a function that will accept a time series as input and return a new time series whose values are divided by the mean:

In [27]: def normalize(data): 

    ...: """ 

    ...: Normalizes the data set. Expects a timeseries input 
    ...: """ 
    ...: data = list(data) 
    ...: norm = np.array(list(d[1] for d in data), dtype="f8") 
    ...: mean = norm.mean() 
    ...: norm /= mean 
    ...: return zip((d[0] for d in data), norm) 
    ...:

We can now easily write another function that takes these columns and computes the mean normalized time series:

In [28]: def mean_normalized_percent_income_share(source): 
    ...: columns = ( 
    ...: "Top 10% income share", 
    ...: "Top 5% income share", 
    ...: "Top 1% income share", 
    ...: "Top 0.5% income share", 
    ...: "Top 0.1% income share", 
    ...: ) 
    ...: source = list(dataset(source)) 
    ...: return linechart([normalize(timeseries(source, col)) for col in columns], 
    ...: labels=columns, 
    ...: , 
    ...: ylabel="Percentage") 
    ...: mean_normalized_percent_income_share(data_file) 
    ...: plt.show() 
    ...:

Note how the following command snippet is very similar to the previous function, except when it performs the normalization:

>>> fig = mean_normalized_percent_income_share(DATA) 
>>> fig.show()

The preceding commands give us the following graph:

This graph shows us that the wealthier the group, the larger the percentage-wise swings we tend to see in their incomes.

The dataset also breaks the group's income into categories, such as income that includes capital gains versus income without capital gains. Let's take a look at how each group's capital gains income fluctuates over time.

Another common functionality is to compute the difference between two columns and plot the resulting time series. Computing the difference between two NumPy arrays is also very easy, and since it is common for our task, we write yet another function to do the job:

In [29]: def delta(first, second): 

    ...: """ 

    ...: Returns an array of deltas for the two arrays. 
    ...: """ 
    ...: first = list(first) 
    ...: years = yrange(first) 
    ...: first = np.array(list(d[1] for d in first), dtype="f8") 
    ...: second = np.array(list(d[1] for d in second), dtype="f8") 
    ...: if first.size != second.size: 
    ...: first = np.insert(first, [0,0,0,0], [None, None, None,None]) 
    ...: diff = first - second 
    ...: return zip(years, diff) 
    ...:

Furthermore, the following is an appropriate helper function:

In [30]: def yrange(data): 
    ...: """ 
    ...: Get the range of years from the dataset 
    ...: """ 
    ...: years = set() 
    ...: for row in data: 
    ...: if row[0] not in years: 
    ...: yield row[0] 
    ...: years.add(row[0])

This function once again creates NumPy arrays from each dataset, casting the datatype to floats. Note that we need to get the list of years from one of the datasets, so we gather it from the first dataset.

We also need to keep in mind that first.size needs to be the same as second.size, for example, that each array shares the same dimensionality. The difference is computed and the years are once again zipped to the data to form a time series:

In [31]: def capital_gains_lift(source): 
    ...: """ 
    ...: Computes capital gains lift in top income percentages over time chart 
    ...: """ 
    ...: columns = ( 
    ...: ("Top 10% income share-including capital gains",  
    ...: "Top 10% income share"), 
    ...: ("Top 5% income share-including capital gains",  
    ...: "Top 5% income share"), 
    ...: ("Top 1% income share-including capital gains",  
    ...: "Top 1% income share"), 
    ...: ("Top 0.5% income share-including capital gains",  
    ...: "Top 0.5% income share"), 
    ...: ("Top 0.1% income share-including capital gains",  
    ...: "Top 0.1% income share"), 
    ...: ("Top 0.05% income share-including capital gains", 
    ...: "Top 0.05% income share"), 
    ...: ) 
    ...: source = list(dataset(source)) 
    ...: series = [delta(timeseries(source, a), timeseries(source,b)) 
    ...: for a, b in columns] 
    ...: return linechart(series,labels=list(col[1] for col in 
    ...: columns),  
    ...: , 
    ...: ylabel="Percentage Difference") 
    ...: capital_gains_lift(data_file) 
    ...: plt.show() 
    ...:

The preceding code stores the columns as tuples of two columns-first and second-and then uses the delta function to compute the difference between the two. Like our previous graphs, it then creates a line chart as shown here:

This is interesting as the graph shows the volatility of the capital gains income over time. If you are familiar with U.S. financial history, you can see the effect on the capital gains income of the well-known stock market booms and busts in this chart.