machine learning
Bethany Leffler stopped by last week to deliver a talk about machine learning. We’re all pretty comfortable with systems where you predefine a (possibly very complex) set of rules to handle whatever data will be thrown at it. Bethany’s area of expertise is in situations where you want the machine to figure things out for itself instead of just following explicit orders. We’ve been getting more and more interested in that around here, for completely legitimate purposes and not at all for dubious SkyNet-style applications.
My primary goal was to get a handle on the vocabulary so that I could ask semi-intelligent questions to get more information. That worked out well, because Bethany gave us an excellent high-level introduction to the field.
- supervised learning – the system uses a reference set of labelled data to classify new data points
- unsupervised learning – no reference data set
- reinforcement learning – the system gets feedback, positive or negative, from its decisions and optimizes its behaviour accordingly
We talked about classification problems a lot. Imagine an 2D space like the one below with a bunch of data points that we can intuitively group into clusters. This could represent some set of objects you’re trying to group into categories based on two measured properties, say a voice recognition system that figures out words based on pitch and length (a bad example). The question is how to draw boundaries around those clusters, especially when some points are reasonably close to two or more clusters.

Some unsupervised learning algorithms:
The k-Means algorithm made the most sense to me. It works by first picking cluster centres at random and associating each data point with a cluster according to the nearest centre. It then refines the cluster centres by computing the centroid of the points within each cluster. These two steps are repeated until the solution converges (i.e. the centres stop moving around). Unfortunately, you need to tell the algorithm how many clusters there are at the beginning and the system doesn’t always converge. These problems are general to unsupervised learning. K-means is cheap enough that you can run it multiple times with random starting centre points and compare the outcomes to increase confidence in your solution.
Some supervised learning algorithms:
The k-nearest neighbours algorithm seems pretty straightforward here. The training data are a bunch of labelled data points. For each test input, you find the k nearest labelled neighbours. The test input is then assigned to the same cluster as the majority of those k nearest neighbours. If you put the newly labelled input into the pile of reference data (i.e. let the system learn as it goes), this can make your clusters drift over time. I suspect this can get dangerous if you’re not careful.
Bayesian classification uses P(C|x)= P(x|C)P(C)/P(x) which sounds vaguely familiar from the old stats and physics courses. Best to think about that some more before discussing it. Decision trees looked more straightforward to read, but I think I’d have a hard time implementing something that creates its own. I’ve had very mild introductions to neural networks a few times and they always seem like a whole mess of complexity that I don’t want to deal with.
Bethany talked about some of the difficulties with supervised learning, such as the amount of training data and time required, along with the associated problems of overfitting or underfitting the data. If your training data are too closely coupled, you risk misclassifying input data that’s a bit further out. Similarly, if your training data are too loose, you risk problems of cluster boundaries overlapping. On the topic of the amount of data and time required, Bethany made the excellent point that people don’t like the effort required for supervised machine learning, but we spend orders of magnitude more on training babies and then marvel at how much they learn. That’s not to say that we should spend more time with machines instead of babies; rather, it’s just important to realize that even an exquisitely refined and capable machine like a human brain still needs a lot of training input before it can do anything useful.
Reinforcement learning algorithms fall into two categories:
- model-based
- model-free
Those reminded me of model predictive control and other feedback systems like PID from the control theory courses I took in school. In the machine learning context, you also want the system to guess when it encounters unknown data. In reinforcement learning, the machine gets a particular reward or punishment for classifying/misclassifying each new input, and it classifies new input based on its own self-interest. In the model-free algorithms (e.g. Q-learning, function approximation, policy iteration), the machine has some way of learning based only on the rewards and punishments of its classifications. For unknown data, they do things like guess randomly (Q-learning) or work out probabilities of reward/punishment based on past decisions (policy iteration). In the model-based algorithms (R-max, KWIK), the machine has some model of the environment that gets refined based on input. This can take a lot of work from the programmer to determine a good model, but it sounds like a properly tuned algorithm can classify very reliably.
I hope I’ve described that reasonably accurately. I did a bunch of probability in various math and physics courses back in university, so while reading about these topics, they dance between understandable and just plain Greek. I also built an autonomous robot for a course, so I appreciate just how hard it is to get machines to do anything remotely intelligent-looking in the real world.
A big thanks to Bethany for her informative talk! I’m sure it will fuel many arguments around the water cooler.
Tags: machine learning, tech talks
