Скачать книгу

      1 Manually sketch the scatter plot for x1 and x2.Manually sketch the mosaic plot for x3 and x4.

      1 Consider the data set in Exercise 3. Manually calculate the sample mean vector, the sample covariance matrix, and the sample correlation matrix of x = (x1 x2)T.

      2 Consider the auto_spec data set in the file auto_spec.csv. Use R to draw appropriate plots to display the following information and comment on any patterns that can be found from the plots.Distribution of the variables fuel.type and aspiration.Distribution of each of the following three variables: width, height, and highway.mpg. Use two types of plots for each variable.How does the horsepower affect the city.mpg?The relationship between horsepower and body.style.The relationship between body.style and fuel.type.

      3 For the auto_spec data, use R to create a new variable named cat.mpg, which is equal to “high” if highway.mpg is at least 30, and “low” otherwise.Using R, create a scatter plot of horsepower versus curb.weight, color-coded by the variable cat.mpg. Format the plot with appropriate labels and legend.Use R to find the sample mean vector, the sample covariance matrix, and the sample correlation matrix of highway.mpg and city.mpg.Assume that 75% of the mileage of a car is on a highway and 25% is on local roads, using the results from part (b), manually calculate the sample mean and sample variance of the overall average MPG of the cars in this data set.Use R to calculate the overall average MPG of each car in the data set based on the assumption in part (c). Then use R to find the sample mean and sample variance of the overall average MPG. Compare with the results in part (c).

      4 Hot rolling is among the key steel-making processes that convert cast or semi-finished steel into finished products. A typical hot rolling process usually includes a melting division and a rolling division. The melting division is a continuous casting process that melts scrapped metals and solidifies the molten steel into semi-finished steel billet; the rolling division will further squeeze the steel billet by a sequence of stands. Each stand is composed of several rolls. The final long thin steel billet is coiled for transportation convenience and thus is often called a coil. Due to the recent development of computer and sensor technology, the whole hot rolling process is highly automated and monitored by a large number of sensors. Various types of sensors (optical sensor, temperature sensor, force sensor, etc.) are installed in the hot rolling process. The last rolling stands are equipped with some infrared sensors. These sensors take photos of the steel billets, and then the photos are processed to see if any defects are produced. We focus on two types of defect: checkings and seams.The file hotrolling_defects.csv contains the numbers of checkings and seams of 754 billets. Use R to generate two new variables corresponding to whether a billet has at least one checking defect and whether it has at least one seams defect, respectively. Use appropriate plots to visualize the distribution of each of these two new variables and the relationship between them.The file stand_5_side_temp.csv contains side temperature measurements when a steel billet is passing stand 5 of the rolling division. The side temperature is measured at 79 evenly spaced locations along the stand. Use R to draw a scatter plot matrix for the side temperature measurements at the first five locations of stand 5. Comment on noticeable patterns in relationship among the first five temperature variables.Use R to find the sample mean vector, the sample covariance matrix, and sample correlation matrix of the side temperature measurements at the first five locations of stand 5.Use R to draw a heatmap for the correlation of the side temperature measurements at the first 20 locations of stand 5. Which locations have the highest correlation in side temperature measurements?

      Informally, a random variable can be described as a variable whose value depends on the outcome of a random or chance phenomenon. Some examples of random variables are: the highway MPG of a new car randomly sampled from all cars on sale, the quality measurement of a product randomly sampled from a production line, the temperature measurement at a particular moment and location of a machine where temperature randomly varies over time. Due to the ubiquitous uncertainty and variation existing in industrial systems and processes, most variables of interest in industrial data analytics applications can be considered as random variables. Many industrial data analytics problems involve multiple random variables, which form a vector of random variables, also called as a random vector. In this chapter, we study the concept of random vectors and the multivariate normal distribution, the most commonly used model for a random vector.

      3.1 Random Vectors

      A random vector is a vector of random variables. Let X=(X1 X2 . . . Xp)T denote a random vector. The mean or expected value of a random vector is the vector of the mean values of each of its elements. The mean vector of X can be written as

bold italic mu table row cell equals straight E open parentheses straight X close parentheses equals open parentheses table row cell straight E left parenthesis straight X subscript 1 right parenthesis end cell row cell straight E left parenthesis straight X subscript 2 right parenthesis end cell row straight vertical ellipsis row cell straight E left parenthesis straight X subscript straight p right parenthesis end cell end table close parentheses comma end cell end table table row cell mu subscript i equals E left parenthesis X subscript i right parenthesis equals left curly bracket table row cell integral from negative infinity to infinity of x subscript i f subscript i left parenthesis x subscript i right parenthesis d x subscript i end cell cell if space X subscript i space is space straight a space continuous space random space variable end cell row cell sum for x subscript i of x subscript i p subscript i left parenthesis x subscript i right parenthesis end cell cell if space X subscript i space is space straight a space discrete space random space variable end cell end table end cell end table comma

      where fi,(xi) is the probability density function of Xi if Xi is continuous and pi(Xi is the probability mass function of Xi if Xi is discrete. The µi is also called the population mean of Xi because it is the mean of Xi over all possible values in the population. Similarly, the mean vector µ is the population mean vector of X.

      To further explain the relationship and difference between the population mean and the sample mean introduced in Section 2.2, we first consider a univariate random variable X and its population mean μ. Consider a random sample of observations from the population, say, X1, X2,…, Xn. The sample mean x with bar on top equals 1 over n sum from i equals 1 to n of X subscript i is a random variable because the observations X1, X2,…, Xn are all random variables with values varying from sample to sample. For example, let X represent the measured intensity of the current of a wafer produced by a semiconductor manufacturing process. Then we take a random sample of n = 10 wafers from this process and compute the sample mean of the measured intensities of the current and get the result = 1.02. Now we repeat this process, taking a second sample of n = 10 wafers from the same process and the resulting sample mean is 1.04. The sample means differ from sample to sample because they are random variables. Consequently, the sample mean, and any other function of the random observations, is a random variable. On the other hand, the population mean µ does not depend on the samples and is a (usually unknown) constant. When we take a sample with very large sample size n, the sample mean will be very close to the population mean µ with high probability. As the sample mean is a random variable, we can evaluate its mean and variance. It is easy to see that E() = µ and var ()

Скачать книгу