What is a DataFrame?_Julia for Data Science-QQ阅读女生中文短篇网

上QQ阅读APP看书，第一时间看更新

What is a DataFrame?

A DataFrame is a data structure that has labeled columns, which individually may have different data types. Like a SQL table or a spreadsheet, it has two dimensions. It can also be thought of as a list of dictionaries, but fundamentally, it is different.

DataFrames are the recommended data structure for statistical analysis. Julia provides a package called DataFrames.jl , which have all necessary functions to work with DataFrames.

Julia's package, DataFrames, provides three data types:

NA: A missing value in Julia is represented by a specific data type, NA.
DataArray: The array type defined in the standard Julia library, though it has many features, doesn't provide any specific functionalities for data analysis. DataArray provided in DataFrames.jl provides such features (for example, if we required to store in an array some missing values).
DataFrame: DataFrame is 2-D data structure, like spreadsheets. It is much like R or pandas's DataFrames, and provides many functionalities to represent and analyze data.

The NA data type and its importance

In the real world, we come across data with missing values. It is very common but it's not provided in Julia by default. This functionality is added using the DataFrames.jl package. The DataFrames package brings with it DataArray packages, which provide NA data type. Multiple dispatch is one of the most powerful features of Julia and NA is one such example. Julia has NA type, which provides the singleton object NA that we are using to represent missing values.

Why is the NA data type needed?

Suppose, for example, we have a dataset having floating-point numbers:

julia> x = [1.1, 2.2, 3.3, 4.4, 5.5, 6.6]

This will create a six-element Array{Float64,1}.

Now, suppose this dataset has a missing value at position [1]. That means instead of 1.1, there is no value. This cannot be represented by the array type in Julia. When we try to assign an NA value, we get this error:

julia> x[1] = NA LoadError: UndefVarError: NA not defined while loading In[2], in expression starting on line 1

Therefore, right now we cannot add NA values to the array that we have created.

So, to load the data into an array that does have NA values, we use DataArray. This enables us to have NA values in our dataset:

julia> using DataArrays julia> x = DataArray([1.1, 2.2, 3.3, 4.4, 5.5, 6.6])

This will create a six-element DataArrays.DataArray{Float64,1}.

So, when we try to have an NA value, it gives us:

julia> X[1] = NA NA julia> x 6-element DataArrays.DataArray{Float64,1}: 1.1 2.2 3.3 4.4 5.5 6.6

Therefore, by using DataArrays, we can handle missing data. One more feature provided is that NA doesn't always affect functions applied on the particular dataset. So, the method that doesn't involve an NA value or is not affected by it can be applied on the dataset. If it does involve the NA value, then it will give NA as the result.

In the following example, we are applying the mean function and true || x. The mean function doesn't work as it involves an NA value, but true || x works as expected:

julia> true || x True julia> true && x[1] NA julia> mean(x) NA julia> mean(x[2:6]) 4.4

DataArray – a series-like data structure

In the previous section, we discussed how DataArrays are used to store datasets containing missing (NA) values, as Julia's standard Array type cannot do so.

There are other features similar to Julia's Array type. Type aliases of Vector (one-dimensional Array type) and Matrix (two-dimensional Array type) are DataVector and DataMatrix provided by DataArray.

Creating a 1-D DataArray is similar to creating an Array:

julia> using DataArrays
julia> dvector = data([10,20,30,40,50])
5-element DataArrays.DataArray{Int64,1}:
10
20
30
40
50

Here, we have NA values, unlike in Arrays. Similarly, we can create a 2-D DataArray, which will be a DataMatrix:

julia> dmatrix = data([10 20 30; 40 50 60])
2x3 DataArrays.DataArray{Int64,2}:
10 20 30
40 50 60
julia> dmatrix[2,3]
60

In the previous example, to the calculate mean, we used slicing. This is not a convenient method to remove or not to consider the NA values while applying a function. A much better way is to use dropna:

julia> dropna(x)
5-element Array{Float64,1}:
2.2
3.3
4.4
5.5
6.6

DataFrames – tabular data structures

Arguably, this is the most important and commonly used data type in statistical computing, whether it is in R (data.frame) or Python (Pandas). This is due to the fact that all the real-world data is mostly in tabular or spreadsheet-like format. This cannot be represented by a simple DataArray:

julia> df = DataFrame(Name = ["Ajava Rhodiumhi", "Las Hushjoin"],
            Count = [14.04, 17.3],
            OS = ["Ubuntu", "Mint"])

This dataset, for example, can't be represented using DataArray. The given dataset has the following features because it cannot be represented by DataArray:

This dataset has different types of data in different columns. These different data types in different columns cannot be represented using a matrix. Matrix can only contain values of one type.
It is a tabular data structure and records have relations with other records in the same row of different columns. Therefore, it is a must that all the columns are of the same length. Vectors cannot be used because same-length columns cannot be enforced using them. Therefore, a column in DataFrame is represented by DataArray.
In the preceding example, we can see that the columns are labeled. This labeling helps us to easily become familiar with the data and access it without the need to remember its exact positions. So, the columns are accessible using numerical indices and also by their label.

Therefore, due to these reasons, the DataFrames package is used. So, DataFrames are used to represent tabular data having DataArrays as columns.

In the given example, we constructed a DataFrame by:

julia> df = DataFrame(Name = ["Ajava Rhodiumhi", "Las Hushjoin"], Count = [14.04, 17.3], OS = ["Ubuntu", "Mint"])

Using the keyword arguments, column names can be defined.

Let's take another example by constructing a new DataFrame:

julia> df2 = DataFrame() julia> df2[:X] = 1:10 julia> df2[:Y] = ["Head", "Tail", "Head", "Head", "Tail", "Head", "Tail", "Tail", "Head", "Tail"] julia> df2

To find out the size of the DataFrame created, we use the size function:

julia> size(df2) (10, 2)

Here, 10 refers to the number of rows and 2 refers to the number of columns.

To view the first few lines of the dataset, we use head(), and for the last few lines, we use the tail() function:

Julia> head(df2)

As we have given names to the columns of the DataFrame, these can be accessed using these names.

For example:

julia> df2[:X] 10-element DataArrays.DataArray{Int64,1}: 1 2 3 4 5 6 ...

This simplifies access to the columns as we can give meaningful names to real-world datasets that have numerous columns without the need to remember their numeric indices.

If needed, we can also rename using these columns by using the rename function:

Julia> rename!(df2, :X, :newX)

If there is a need to rename multiple columns, then it is done by using this:

julia> rename!(df2, {:X => :newX, :Y => :newY})

But right now, we are sticking to old column names for ease of use.

Julia also provides a function called describe(), which summarizes the entire dataset. For a dataset with many columns, it can turn out to be very useful:

julia> describe(df2) X Min 1.0 1st Qu. 3.25 Median 5.5 Mean 5.5 3rd Qu. 7.75 Max 10.0 NAs 0 NA% 0.0% Y Length 10 Type ASCIIString NAs 0 NA% 0.0% Unique 2

Installation and using DataFrames.jl

Installation is quite straightforward as it is a registered Julia package:

Julia> Pkg.update() julia> Pkg.add("DataFrames")

This adds all the required packages to the current namespace. To use the DataFrames package:

julia> using DataFrames

It is also good to have classical datasets that are common for learning purposes. These datasets can be found in the RDatasets package:

Julia> Pkg.add("RDatasets")

The list of the R packages available can be found using:

julia> Rdatasets.packages()

Here, you can see this:

datasets - The R Datasets Package

It contains datasets available to R. To use this dataset, simply use the following:

using RDatasets iris_dataset = dataset("datasets", "iris")

Here, dataset is the function that takes two arguments.

The first argument is the name of the package and the second is the name of the dataset that we want to load.

In the following example, we loaded the famous iris dataset into the memory. You can see that the dataset() function has returned a DataFrame. The dataset contains five columns: SepalLength, SepalWidth, PetalLength, PetalWidth, and Species. It is quite easy to understand the data. A large number of samples have been taken for every species, and the length and width of sepal and petal have been measured, which can be used later to distinguish between them:

Actual data science problems generally do not deal with the artificial randomly generated data or data read through the command line. But they work on data that is loaded from files or any other external source. These files can have data in any format and we may have to process it before loading it to the dataframe.

Julia provides a readtable() function that can be used to read a tabular file in a dataframe. Generally, we come across datasets in comma-separated or tab-separated formats (CSV or TSV). The readtable() works perfectly with them.

We can give the location of the file as UTF8String and the separator type to the readtable() function as arguments. The default separator type is comma (',') for CSV, tab ('\t') for TSV, and whitespace (' ') for WSV.

In the following example, we load the sample iris dataset into a dataframe using the readtable() function.

Although the iris dataset is available in the RDatasets package, we will download the CSV to work with the external datasets. The iris CSV can be downloaded from https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/data/iris.csv.

Remember to put the downloaded CSV into the current working directory (from where the REPL was started—generally it is the ~/home/<username> directory):

julia> using DataFramesjulia> df_iris_sample = readtable("iris_sample.csv", separator = ',') julia> df_iris_sample

It is the same dataset that we used in the previous example, but now we are loading the data from a CSV file.

The readtable() is used in a similar way for other text-based datasets such as TSV, WSV, or TXT. Suppose the same iris dataset is in TSV, WSV, or TXT format. It will be used in a similar way:

julia> df_iris_sample = readtable("iris_dataset.tsv", separator='\t')

And for example, if we have a dataset without a header and separated by ;, we would use readtable() as follows:

julia> df_random_dataset = readtable("random_dataset.txt", header=false, separator=';')

The readtable() exploits Julia's functionality of multiple dispatch and has been implemented with different method behaviors:

julia> methods(readtable)
3 methods for generic function readtable:
readtable(io::IO) at /home/anshul/.julia/v0.4/DataFrames/src/dataframe/io.jl:820
readtable(io::IO, nbytes::Integer) at /home/anshul/.julia/v0.4/DataFrames/src/dataframe/io.jl:820
readtable(pathname::AbstractString) at /home/anshul/.julia/v0.4/DataFrames/src/dataframe/io.jl:930

We can see that there are three methods for the readtable() function.

These methods implement some of the advanced options to ease the loading and to support various kinds of data formats:

header::Bool: In the iris example we used, we had headers such as Sepal Length, Sepal Width, and so on, which makes it easier to describe the data. But headers are not always available in the dataset. The default value of header is true; therefore, whenever headers are not available, we pass the argument as false.
separator::Char: Data in a file must have been organized in the file in a way to form a tabular structure. This is generally by using ,, \t, ;, or combinations of these sometimes. The readtable() guesses the separator type by the extension of the file, but it is a good practice to provide it manually.
nastrings::Vector{ASCIIString}: Suppose there are missing values or some other values and we want NA to replace them. This is done using nastrings. By default, it takes empty records and replaces them with NA.
truestrings::Vector{ASCIIString}: This transforms the strings to Boolean, true. It is used when we want a set of strings to be treated as true in the dataset. By default, True, true, T, and t are transformed if no argument is given.
- falsestrings::Vector{ASCIIString}: This works just like truestrings but transforms the strings to Boolean, false. By default, False, false, F, and f are transformed if no argument is given.
nrows::Int: If we want only a specific number of rows to be read by readtable(), we use nrows as the argument. By default, it is -1, which means that readtable() will read the whole file.
names::Vector{Symbol}: If we want some specific names for our columns, different from what is mentioned in the header, then we use names. Here, we pass a vector having the names of the columns that we want to use. By default, it is [], which means the names in the headers should be used if they are there; otherwise, the numeric indices must be used.
eltypes::Vector{DataType}: We can specify the column types by passing a vector, by using eltypes. It is an empty vector ([]) by default if nothing is passed.
allowcomments::Bool: In the dataset, we may have records having comments with them. These comments can be ignored. By default, it is false.
commentmark::Char: If we are using allowcomments, we will also have to mention the character (symbol) where the comment starts. By default, it is #.
ignorepadding::Bool: Our dataset might not be as perfect as we want. The records may contain whitespace characters on either side. This can be ignored using ignorepadding. By default, it is true.
skipstart::Int: Our dataset can have some rows describing the data with the header that we might not want, or we just want to skip the first few rows. This is done by skipstart, by specifying the number of rows to skip. By default, it is 0 and will read the entire file.
skiprows::Vector{Int}: If want to skip some specific rows in the data then skiprows is used. We only need to specify the indices of the rows in a vector that we want to skip. By default, it is [] and will read the entire file.
skipblanks::Bool: As mentioned earlier, our dataset may not be perfect. There can be some blank lines if we have scraped the data from the Web or extracted the data from other sources. We can skip these blank lines by using skipblanks. By default it is true, but we can choose otherwise if we do not want it.
encoding::Symbol: We can specify the encoding of the file if it is other than UTF8.

Writing the data to a file

We may also want to output our results or transform a dataset and store it in a file. In Julia we do this by using the writetable() function. It is very similar to the readtable() function that we discussed in the last section.

For example, we want to write the df_iris_sample dataframe into a CSV file:

julia> writetable("output_df_iris.csv", df_iris_sample)

This is the way of writing to a file with the default set of arguments. One visible difference is that we are passing the dataframe that we want to write with the name of the file that we want to write to.

writetable() also accepts various arguments such as readtable().

We could have also written the previous statement like this with the separator defined:

julia> writetable("output_df_iris.csv", df_iris_sample, separator = ',')

Similarly, we can have a header and quote marks in the arguments.

Working with DataFrames

We will follow or inherit some of the traditional strategies to manipulate the data. We will go through these strategies and methods in this section and discuss how and why they are important to data science.

Understanding DataFrames joins

While working with multiple datasets, we often need to merge the datasets in a particular fashion to make the analysis easier or to use it with a particular function.

We will be using the Road Safety Data published by the Department for Transport, UK, and it is open under the OGL-Open Government Licence.

The datasets can be found here: https://data.gov.uk/dataset/road-accidents-safety-data.

We will be using two datasets:

Road Safety: Accidents 2015
Road Safety: Vehicles 2015

Note

DfTRoadSafety_Accidents_2015 contains columns such as Accident_Index, Location_Easting_OSGR, Location_Northing_OSGR, Longitude, Latitude, Police_Force, Accident_Severity, Number_of_Vehicles, Number_of_Casualties, Date, Day_of_Week, Time, and so on. DfTRoadSafety_Vehicles_2015 contains columns such as Accident_Index, Vehicle_Reference, Vehicle_Type, Towing_and_Articulation, Vehicle_Manoeuvre, Vehicle_Location-Restricted_Lane, Junction_Location, Skidding_and_Overturning, Hit_Object_in_Carriageway, and so on.

We can see that Accident_Index is a common field and is unique. It is used as the index in the dataset.

First we will be making the DataFrames package available and then we will load the data. We load the data into two different dataframes using the readtable function that we discussed earlier:

julia> using DataFrames julia> DfTRoadSafety_Accidents_2015 = readtable("DfTRoadSafety_Accidents_2015.csv") julia> head(DfTRoadSafety_Accidents_2015)

The first dataset is loaded into the DataFrame and we try getting information about the dataset using head. It gives a few starting columns.

If we are more interested in knowing the names of the columns, we can use the names function:

julia> names(DfTRoadSafety_Accidents_2015) 32-element Array{Symbol,1}: :_Accident_Index :Location_Easting_OSGR :Location_Northing_OSGR :Longitude :Latitude :Police_Force :Accident_Severity :Number_of_Vehicles :Number_of_Casualties :Date :Day_of_Week :Time :Local_Authority_District_ :x2nd_Road_Class :x2nd_Road_Number :Pedestrian_Crossing_Human_Control :Pedestrian_Crossing_Physical_Facilities :Light_Conditions :Weather_Conditions :Road_Surface_Conditions :Special_Conditions_at_Site :Carriageway_Hazards :Urban_or_Rural_Area :Did_Police_Officer_Attend_Scene_of_Accident :LSOA_of_Accident_Location

Similarly, we will be loading the second dataset in a dataframe:

julia> DfTRoadSafety_Vehicles_2015 = readtable("DfTRoadSafety_Vehicles_2015.csv")

The second dataset is loaded into the memory.

Later we will delve deeper, but for now let's do a full join between the two datasets. A join between these two datasets will tell us which accident involved which vehicles:

julia> DfTRoadSafety_Vehicles_2015 = readtable("DfTRoadSafety_Vehicles_2015.csv") julia> full_DfTRoadSafety_2015 = join(DfTRoadSafety_Accidents_2015, DfTRoadSafety_Vehicles_2015, on = :_Accident_Index)

We can see that the full join has worked. Now we have the data, which can tell us the time of the accident, the location of the vehicle, and many more details.

The benefit is that the join is really easy to do and is really quick, even over large datasets.

We have read about other joins available in relation databases. Julia's DataFrames package provides these joins too:

Inner join: The output, which is the DataFrame, contains only those rows that have keys in both the dataframes.
Left join: The output DataFrame has the rows for keys that are present in the first (left) DataFrame, irrespective of them being present in the second (right) DataFrame.
Right join: The output DataFrame has the rows for keys that are present in the second (right) DataFrame, irrespective of them being present in the first (left) DataFrame.
Outer join: The output DataFrame has the rows for the keys that are present in the first or second DataFrame, which we are joining.
Semi join: The output DataFrame has only the rows from the first (left) DataFrame for the keys that are present in both the first (left) and second (right) DataFrames. The output contains only rows from the first DataFrame.
Anti join: The output DataFrame has the rows for keys that are present in the first (left) DataFrame but rows for the same keys are not present in the second (right) DataFrame. The output contains only rows from the first DataFrame.
Cross join: The output DataFrame has the rows that are the Cartesian product of the rows from the first DataFrame (left) and the second DataFrame (right).

Cross join doesn't involve a key; therefore it is used like this:

julia> cross_DfTRoadSafety_2014 = join(DfTRoadSafety_Accidents_2014, DfTRoadSafety_Vehicles_2014, kind = :cross)

Here we have used the kind argument to pass the type of join that we want. Other joins are also done using this argument.

The kind of join that we want to use is done using the kind argument.

Let's understand this using a simpler dataset. We will create a dataframe and will apply different joins on it:

julia> left_DfTRoadSafety_2014 = join(DfTRoadSafety_Accidents_2014, DfTRoadSafety_Vehicles_2014, on = :_Accident_Index, kind = :left)

For left join, we can use:

julia> Cities = ["Delhi","Amsterdam","Hamburg"][rand(1:3, 10)] 
 
julia> df1 = DataFrame(Any[[1:10], Cities, 
        rand(10)], [:ID, :City, :RandomValue1]) 
 
julia> df2 = DataFrame(ID = 1:10, City = Cities, 
        RandomValue2 = rand(100:110, 10))

This created two dataframes having 10 rows. The first dataframe, df1, has three columns: ID, City, and RandomValue1. The second dataframe has df2 with three columns: ID, City, and RandomValue2.

Applying full join, we can use:

julia> full_df1_df2 = join(df1,df2, on = [:ID, :City])

We have used two columns to apply the join.

This will generate:

Other joins can also be applied using the kind argument. Let's go through our old dataset of accidents and vehicles.

The different joins using kind are:

julia> right_DfTRoadSafety_2014 = join(DfTRoadSafety_Accidents_2014, DfTRoadSafety_Vehicles_2014, on = :_Accident_Index, kind = :right) 
 
julia> inner_DfTRoadSafety_2014 = join(DfTRoadSafety_Accidents_2014, DfTRoadSafety_Vehicles_2014, on = :_Accident_Index, kind = :inner) 
 
julia> outer_DfTRoadSafety_2014 = join(DfTRoadSafety_Accidents_2014, DfTRoadSafety_Vehicles_2014, on = :_Accident_Index, kind = :outer) 
 
julia> semi_DfTRoadSafety_2014 = join(DfTRoadSafety_Accidents_2014, DfTRoadSafety_Vehicles_2014, on = :_Accident_Index, kind = :semi) 
 
julia> anti_DfTRoadSafety_2014 = join(DfTRoadSafety_Accidents_2014, DfTRoadSafety_Vehicles_2014, on = :_Accident_Index, kind = :anti)

The Split-Apply-Combine strategy

A paper was published by Hadley Wickham (Wickham, Hadley. "The split-apply-combine strategy for data analysis." Journal of Statistical Software 40.1 (2011): 1-29), defining the Split-Apply-Combine strategy for data analysis. In this paper, he explained why it is good to break up a big problem into manageable pieces, independently operate on each piece, obtain the necessary results, and then put all the pieces back together.

This is needed when a dataset contains a large number of columns and for some operations all the columns are not necessary. It is better to split the dataset and then apply the necessary functions; and we can always put the dataset back together.

This is done using the by function by takes three arguments:

DataFrame (this is the dataframe that we would be splitting)
The column name (or numerical index) on which the DataFrame would be split
A function that can be applied on every subset of the DataFrame

Let's try to apply by to our same dataset:

The aggregate() function provides an alternative to apply the Split-Apply-Combine strategy. The aggregate() function uses the same three arguments:

DataFrame (this is the DataFrame that we would be splitting)
The column name (or numerical index) on which the DataFrame would be split
A function that can be applied on the every subset of the DataFrame

The function provided in the third argument is applied to every column, which wasn't used in splitting up the DataFrame.

Reshaping the data

The use case may require data to be in a different shape than we currently have. To facilitate this, Julia provides reshaping of the data.

Let's use the same dataset that we were using, but before that let's check the size of the dataset:

julia> size(DfTRoadSafety_Accidents_2014) (146322,32)

We can see that there are greater than 100,000 rows. Although we can work on this data, for simplicity of understanding, let's take a smaller dataset.

Datasets provided in RDataset are always good to start with. We will use the tried and tested iris dataset for this.

We will import RDatasets and DataFrames (if we have started a new terminal session):

julia> using RDatasets, DataFrames

Then, we will load the iris dataset into a DataFrame. We can see that the dataset has 150 rows and 5 columns:

Now we use the stack() function to reshape the dataset. Let's use it without any arguments except the DataFrame.

Stack works by creating a dataframe for categorical variables with all of the information one by one:

We can see that our dataset has been stacked. Here we have stacked all the columns. We can also provide specific columns to stack:

Julia> iris_dataframe [:id] = 1:size(iris_dataframe, 1) # create a new column to track the id of the row Julia> iris_stack = (iris_dataframe, [1:4])

The second argument depicts the columns that we want to stack. We can see in the result that column 1 to 4 have been stacked, which means we have reshaped the dataset into a new dataframe:

Julia> iris_stack = stack(iris_dataframe, [1:4]) Julia> size(iris_stack) (600,4) Julia> head(iris_stack)

We can see that there is a new column :id. That's the identifier of the stacked dataframe. Its value is repeated the number of times the rows are repeated.

As all the columns are included in the resultant DataFrame, there is repetition for some columns. These columns are actually the identifiers for this DataFrame and are denoted by the column (id). Other than the identifiers column (:id), there are two more columns, :variable and :values. These are the columns that actually contain the stacked values.

We can also provide a third argument (optional). This is the column whose values are repeated. Using this, we can specify which column to include and which not to include.

The melt() function is similar to the stack function but has some special features. Here we need to specify the identifier columns and the rest are stacked:

Julia> iris_melt = stack(iris_dataframe, [1:4])

The remaining columns are stacked with the assumption that they contain measured variables.

Opposite to stack and melt is unstack, which is used to convert from a long format to wide format. We need to specify the identifier columns and variable/value columns to the unstack function:

julia> unstack(iris_melt, :id, :variable, :value)

:id (identifier) in the arguments of the unstack can be skipped if the remaining columns are unique:

julia> unstack(iris_melt, :variable, :value)

meltdf and stackdf are two additional functions that work like melt and stack but also provide a view into the original wide DataFrame:

Julia> iris_stackdf = stackdf(iris_dataframe)

This seems exactly similar to the stack function, but we can see the difference by looking at their storage representation.

To look at the storage representation, dump is used. Let's apply it to the stack function:

Here, we can see that :variable is of type Array(Symbol,(600,))
:value is of type DataArrays.DataArray{Float64,1}(600)
Identifier (:Species) is of type DataArrays.PooledDataArray{ASCIIString,UInt8,1}(600)

Now, we will look at the storage representation of stackdf:

Here, we can see that:

:variable is of type DataFrames.RepeatedVector{Symbol}. Variable is repeated n times, where n refers to the number of rows in the original AbstractDataFrame.
:value is of type DataFrames.StackedVector. This facilitates the view of the columns stacked together as in the original DataFrame.
Identifier (:Species) is of type Species: DataFrames.RepeatedVector{ASCIIString}. The original column is repeated n times where n is the number of the columns stacked.

Using these AbstractVectors, we are now able to create views, thus saving memory by using this implementation.

Reshaping functions don't provide the capabilities to perform aggregation. So to perform aggregation, a combination of the Split-Apply-Combine strategy with reshaping is used.

We will use iris_stack:

julia> iris_stack = stack(iris_dataframe)

Here, we created a new column having the mean values of the columns according to the species. We can now unstack this.

Sorting a dataset

Sorting is one of the most used techniques in data analysis. Sorting is facilitated in Julia by calling the sort or sort! function.

The difference between the sort and sort! is that sort! works in-place, which sorts the actual array rather than creating a copy.

Let's use the sort! function on the iris dataset:

We can see that the columns are not sorted according to [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]. But these are actually sorted according to the :Species column.

The sorting function takes some arguments and provides a few features. For example, to sort in reverse, we have:

julia> sort!(iris_dataframe, rev = true)

To sort some specific columns, we have:

julia> sort!(iris_dataframe, cols = [:SepalLength, :PetalLength])

We can also use the by function with sort! to apply another function on the DataFrame or the single column.

order is used to specify ordering a specific column amongst a set of columns.

Formula - a special data type for mathematical expressions

Data science involves various statistical formulas to get insights from data. The creation and application of these formulas is one of the core processes of data science. It maps input variables with some function and mathematical expression to an output.

Julia facilitates this by providing a formula type in the DataFrame package, which is used with the symbol ~. ~ is a binary operator. For example:

julia> formulaX = A ~ B + C

For statistical modeling, it is recommended to use ModelMatrix, which constructs a Matrix{Float64}, making it more suited to fit in a statistical model. Formula can also be used to transform to a ModelFrame object from a DataFrame, which is a wrapper over it, to meet the needs of statistical modeling.

Create a dataframe with random values:

Use formula to transform it into a ModelFrame object:

Creating a ModelMatrix from a ModelFrame is quite easy:

There is an extra column containing only value = 1.0. It is used in a regression model to fit an intercept term.

Pooling data

To analyze huge datasets efficiently, PooledDataArray is used. DataArray uses an encoding that represents a full string for every entry of a vector. This is not very efficient, especially for large datasets and memory-intensive algorithms.

Our use case more often deals with factors involving a small number of levels:

PooledDataArray uses indices in a small pool of levels instead of strings to represent data efficiently.

PooledDataArray also provides us with the functionality to find out the levels of the factor using the levels function:

PooledDataArray even provides a compact function to efficiently use memory:

Julia> pooleddatavector = compact (pooleddatavector)

Then, it provides a pool function for converting a single column when factors are encoded not in PooledDataArray columns but in DataArray or DataFrame:

Julia>  pooleddatavector = pool(datavector)

PooledDataArray facilitates the analysis of categorical data, as columns in ModelMatrix are treated as 0-1 indicator columns. Each of the levels of PooledDataArray is associated with one column.

Web scraping

Real-world use cases also include scraping data from the Web for analysis. Let's build a small web scraper to fetch Reddit posts.

For this, we will need the JSON and Requests packages:

julia> Pkg.add("JSON") julia> Pkg.add("Requests") # import the required libraries julia> using JSON, Requests # Use the reddit URL to fetch the data from julia> reddit_url = https://www.reddit.com/r/Julia/ # fetch the data and store it in a variable julia> response = get("$(reddit_url)/.json") Response(200 OK, 21 headers, 55426 bytes in body) # Parse the data received using JSON.parse julia> dataReceived = JSON.parse(Requests.text(response)) # Create the required objects julia> nextRecord = dataReceived["data"]["after"] julia> counter = length(dataReceived["data"]["children"])

Here, we defined a URL from where we will be scraping the data. We are scraping from Julia's section on Reddit.

Then, we are getting the content from the defined URL using the get function from the Requests package. We can see that we've got response 200 OK with the data:

julia> statuscode(response) 200 julia> HttpCommon.STATUS_CODES[200] "OK"

We then parse the JSON data received using the JSON parser provided by the JSON package of Julia. We can now start reading the record.

We can store the data received in an Array or DataFrame (depending on the use case and ease of use). Here, we are using an Array to store the parsed data. We can check the data stored in an Array.

Suppose we only need to see the title of these posts and know what we have scraped; we just need to know in which column they are.

We can now see the title of the Reddit posts. But what if we had too many columns or we had some missing values? DataFrames would definitely be a better option.