Скачать книгу

0).

Schematic illustration of the nearest neighbor (NN) classification algorithm in two dimensions (features x1 and x2).

      More formally, we can define the 1‐NN algorithm as follows:

       Training algorithm: for i = 1, n in the n ‐dimensional training data set D ∣ D ∣ = n : •store training example 〈x [i], f(x [i])〉 Prediction algorithm: closest point≔ None closest distance≔∞ •for i = 1, n: ‐current distance≔d(x [i], x [q]) ‐if current distance < closest distance: * closest distance≔ current distance * closest point≔x [i] prediction h(x [q]) is the target value of closest point

      Unless noted otherwise, the default distance metric (in the context of this section) of NN algorithms is the Euclidean distance (also called L2 distance), which computes the distance between two points, x[a] and x[b]:

d left-parenthesis normal x Superscript left-bracket a right-bracket Baseline comma normal x Superscript left-bracket b right-bracket Baseline right-parenthesis equals StartRoot sigma-summation Underscript j equals 1 Overscript m Endscripts left-parenthesis x Subscript j Superscript left-bracket a right-bracket Baseline minus x Subscript j Superscript left-bracket b right-bracket Baseline right-parenthesis squared EndRoot period Schematic illustration of the plane partitioning of a two-dimensional dataset.

      This partitioning of regions on a plane in 2D is also called a Voronoi diagram or Voronoi tessellation. Given a discrete set of points, a Voronoi diagram can also be obtained by a process known as Delaunay triangulation by connecting the centers of the circumcircles.

      k‐Means clustering: In this section, we will cover k‐means clustering and its components. We will look at clustering, why it matters, and its applications, and then dive into k‐means clustering (including how to perform it in Python on a real‐world dataset).

Schematic illustration of the nearest neighbor (NN) decision boundary.

      Example 2.3

      A university wants to offer to its customers (companies in industry) new continuous education type of courses. Currently, they look at the details of each customer and based on this information, decide which offer should be given to which customer. The university can potentially have a large number of customers. Does it make sense to look at the details of each customer separately and then make a decision? Certainly not! It is a manual process and will take a huge amount of time. So what can the university do? One option is to segment its customers into different groups. For instance, each department of the university can group the customers for their field based on their technological level (tl; general education background of the employees), say, three groups high (htl), average (atl), and low (ltl). The department can now draw up three different strategies (courses of different level of details) or offers, one for each group. Here, instead of creating different strategies for individual customers, they only have to formulate three strategies. This will reduce the effort as well as the time.

      The groups indicated in the example above are known as clusters, and the process of creating these groups is known as clustering. Formally, we can say that clustering is the process of dividing the entire data into groups (also known as clusters) based on the patterns in the data. Clustering is an unsupervised learning problem! In these problems, we have only the independent variables and no target/dependent variable. In clustering, we do not have a target to predict. We look at the data and then try to club similar observations and form different groups. Hence, it is an unsupervised learning problem.

      In order to discuss the properties of the cluster, let us further extend the previous example to include one more characteristic of the customer:

      Example 2.4

      Now, the department can offer a rather focused, high‐level course to cluster C4 since the customer employees have high‐level general technical knowledge with expertise that is close to the content of the course. On the other hand, for cluster C1, the department should offer a course on the broader subject with fewer details.

Schematic illustration of example of clustering. Schematic illustration of k-Means algorithm.

      In k‐means, each cluster is associated with a centroid. The main objective of the k‐means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.

Скачать книгу