Скачать книгу

sophisticated procedures such as regression-based imputation do exist. These methods play important roles mainly in medical and scientific studies, where data collection from patients or subjects is often costly. In most industrial data analytics applications where data are typically abundant, simpler methods of handling missing values are usually sufficient.

      2.1 Data Visualization

      Data visualization is used to represent the data using graphical methods. It is one of the most effective and intuitive ways to explore the important patterns in the data such as data distribution, relationship among variables, surprising clusters, and outliers. Data visualization is a fast-growing area and a large number and variety of tools have been developed. This section discusses some of the most basic and useful types of graphical methods or data plots for industrial data analytics applications.

      2.1.1 Distribution Plots for a Single Variable

      Bar charts can be used to display the distribution of a categorical variable, while histograms and box plots are useful tools to display the distribution of a numerical variable.

      Distribution of A Categorical Variable – Bar Chart

      bodystyle.freq <- table(auto.spec.df$body.style)

      barplot(bodystyle.freq, xlab = "Body Style",

       ylim = c(0, 100))

      Figure 2.1 Bar chart of car body style.

      Distribution of Numerical Variables – Histogram and Box Plot

      A histogram can be used to approximately represent the distribution of a numerical variable with continuous values. A histogram can be considered as a bar chart extended to continuous numerical variables. To draw a histogram, the entire range of the variable in the data set is divided into a number of consecutive equal sized intervals. Then a “bar” is shown for each interval to represent the number of observations in the interval.

      Figure 2.3 Histograms and box plots of three numerical variables.

      oldpar <- par(mfrow=c(2,3)) # split the plot into panels hist(auto.spec.df$length, xlab = "Length",

      2.1.2 Plots for Relationship Between Two Variables

      The relationship between variables is one of the most useful patterns in industrial data analytics applications. For example, we are often interested in predicting a particular variable of interest, which is referred to as the response variable, based on available input information represented by a number of variables that are referred to as the predictor variables. In this situation, the relationship between the response variable and the predictor variables can help identify the most important predictors. Plotting of two variables can also be used to detect redundant variables and outliers in a data set. Depending on the types of variables being compared, different plots can be used to study the relationship between the variables.

      Relationship Between Two Numerical Variables – Scatter Plot

      In a scatter plot, each observation is represented by a point whose coordinates are the values for the two variables of this observation. The following R codes draw the scatter plot for two numerical variables, horsepower and highway.mpg, of the auto_spec data set.

      plot(auto.spec.df$highway.mpg ~ auto.spec.df$horsepower,

       xlab = "Horsepower", ylab = "Highway MPG")

      Figure 2.4 Scatter plot of highway MPG versus horsepower.

      Relationship Between A Numerical Variable and A Categorical Variable – Side-by-Side Box Plot

Скачать книгу