ARCHIVE - ML: Lecture 14 | Computing Practice & Theory

This week

in Witten: Chapter 6 (linear methods, instance-based methods, clustering) 6.2, 6.4, 6.5, 6.8
If there is a strike, we will meet on Red Square
Thursday, quiz, lecture on stats and ML, followed by induction lab.

probability

Now you try it for two trials (Binomial, n=2)

                                 deviations       squared deviations
      X_i  |  P(x)  | X_iP(X_i) |   X_i - μ      |  (X_i - μ)²
      0     0.04
      1     0.32
      2
                  --------
                   μ =

Classification Rules

You can generate rules from decision trees, but you may get too many.
Prune the decision tree first, or successively generate deeper partial trees, generate rules at each step.
You should use independent subsets for pruning the tree.
It is not easy to measure how good an individual rule is, because it interacts with other rules.

Linear methods

You have seen with the Iris data that in some cases it is possible to separate the instances of different classes by a line or hyperplane. If that is not possible, it still may be possible using a curve or curved surface. This can be done by transforming the data using a non-linear mapping. Often, this adds to the dimensionality of the data with synthetic attributes

examples

           x         x

                o
 
           x         x
               (a)

           x         o

           o         x
               (b)

For example (a), we can add a feature a₃ = a₁² + a₂², then we can separate the sets
by a horizontal plane.

Here is an example of a transformed hyperplane:

  x = w₁a₁³ + w₂a₁²a₂ + w₃a₁a₂² + w₄a₂³

support vector machines: if two sets are linearly separable by a hyperplane, then we should be able to choose a hyperplane that
maximizes the minimum distance to each of the sets. This is also called the maximum margin hyperplane.
The points (vectors) that are closest to this hyperplane are called support vectors. See P 225.
Computing the SVM from the training data is slow.
Instead of transforming the attribute space, on could often just transform the condition (distance function).
This is called the ‘kernel trick’. One can operate on each point with a kernel function, e.g. a polynomial, or
one can construct a product function that computes the same thing. In the previous example,
K(x,y) = (x, y, x² + y²)
Sigmoid function: 1/(1 + e^-x)
Multilayer perceptron, feedforward neural network.
backpropagation, learning rate. For each misclassified instance, adjust the weights [for the hyperplane] to make it a little closer. In Chapter 4 (P 128) it was based on the learning rate α and the following formula, i.e. add or subtract
```
w_i += α ⋅ a_i² or w_i -= α ⋅ a_i²
   
```
However, the actual formula used is more complicated because of the sigmoid function
gradient descent and stochastic gradient descent,

ML: Lecture 14

This week

probability

Classification Rules

Linear methods

Other Resources

Program Resources