Machine Learning: Linear Regression (Supervised Learning)


1. Model Representation

First, we will establish notations for future use.

  • Input variables (input feature) as x(i)
  • Output that we are trying to predict as y(i)
  • A pair of (x(i), y(i)) is called a training example
  • A list of m-training examples is called a training set

(*Note: The superscript i is the index if the example in the set)

A typical Machine Learning Process is as follow:


(Source: Machine Learning Blog Series –

  1. In the very first step, we will be given raw data.
  2. Then from that raw data, we will build our training set.
  3. After that, a training algorithm will be build based on the data. Our goal is to learn a function h: X → Y so that h(x) or so called a hypothesis is a “good” predictor or not.
  4. Finally, we can use it to predict future corresponding values of y that are not in the training  set.

We can also model our process as follow: h6qtdzmyeeaagxl7xdfkxa_2f0f671110e8f7446bb2b5b2f75a8874_screenshot-2016-10-23-20-14-58

(Source:  Andrew Ng’s Machine Learning course on Coursera)

When the target variable that we’re trying to predict is continuous, such as predict price of a house, we call the learning problem a regression problem.

When y can take on only a small number of discrete values, such as  classify whether a mail is spam or not, we call it a classification problem.

2. Linear Regression with one variable (Univariate)

Our first learning algorithm is the Linear Regression Algorithm. In this algorithm, given a set of data, we will have to find a line that “best fits” with the given examples.


Linear Regression Problem

As we can see in the picture, the red line is the “best fitting” line for our training set. As describe in the process, our main target is to build a algorithm that will learn to fit with the data set through measure the accuracy of a hypothesis or in the other words:

How to find the equation of the “best fitting” line ?

One approach is using the Cost Function. This takes an average difference of all the results of the hypothesis with inputs from x’s and the actual output y’s.

Cost function:

J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2

where h_\theta (x_{i}) = \theta_0 + \theta_1 x_{i} is the function of the “best fitting” line.

(This function is otherwise called “Squared error function” or “Mean squared error”.)

Now, the problem is to try to minimise the Cost Function J(\theta_0, \theta_1) regarded  \theta_0, \theta_1 . The idea is to choose \theta_0, \theta_1  so that h_\theta (x_{i}) is close to y for our training example (x, y) .

That’s where Gradient Descent comes in. The Gradient Descent algorithm is:

repeat until convergence:

\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)

where j= 0, 1 represents the feature index number.

Remember that, at each iteration j, one must simultaneously update all the parameters. Otherwise, updating a specific parameter prior to calculating another one would yield to a wrong implementation.

The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate.

Plug our Cost Function for the Linear Regression Problem in, we will then get the result for \theta_0, \theta_1

repeat until convergence:

 \theta_0 :=  \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i})

\theta_1 :=  \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right)

Note that we have separated out the two cases for θj into separate equations for θ0  and θ; and that for θ1  we are multiplying xi  at the end due to the derivative.
The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these gradient descent equations, our hypothesis will become more and more accurate.

That’s it!!! After a number of iterations updating θ, it will then converge. Using the result, we will have our “best fitting” line for the training examples and we can predict corresponding values of y.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s