
Summary
In this chapter, we learned what data munging is and why it is necessary for data science. Julia provides functionalities to facilitate data munging with the DataFrames.jl package, with features such as these:
NA
: A missing value in Julia is represented by a specific data type, NA.DataArray
: DataArray provided in theDataFrames.jl
provides features such as allowing us to store some missing values in an array.DataFrame
: DataFrame is 2-D data structure like spreadsheets. It is very similar to R or pandas's dataframes, and provides many functionalities to represent and analyze data. DataFrames has many features well suited for data analysis and statistical modeling.- A dataset can have different types of data in different columns.
- Records have a relation with other records in the same row of different columns of the same length.
- Columns can be labeled. Labeling helps us to easily become familiar with the data and access it without the need to remember their numerical indices.
We learned about importing data from a file using the readtable()
function and exporting data to a file. The readtable()
function provides flexibility when using many arguments.
We also explored joining of datasets, such as RDBMS tables. Julia provides various joins that we can exploit according to our use case.
We discussed the Split-Apply-Combine Strategy, one of the most widely used techniques deployed by data scientists, and why it is needed. We went through reshaping or pivoting data using stack and melt (stackdf, meltdf) functions and explored the various possibilities involved. We were also introduced to PooledDataArray
and learned why it is required for efficient memory management.
We were introduced to web scraping, which is sometimes a must for a data scientist to gather data. We also used the Requests package to fetch an HTTP response.