## 1. Model Representation

First, we will establish notations for future use.

- Input variables (input feature) as
**x**^{(i)} - Output that we are trying to predict as
**y**^{(i)} - A pair of
is called**(x**^{(i)}, y^{(i)})**a training example** - A list of
is called**m-training examples****a training set**

*(*Note: The superscript i is the index if the example in the set)*

A typical Machine Learning Process is as follow:

(Source: *Machine Learning Blog Series – Oliviaklose.com*)

- In the very first step, we will be given
**raw data**. - Then from that raw data, we will build our
**training set**. - After that,
**a training algorithm**will be build based on the data. Our goal is to learn a function h: X → Y so that h(x) or so called**a hypothesis**is a “good” predictor or not. - Finally, we can use it to
**predict**future corresponding values of y that are not in the training set.

We can also model our process as follow:

(Source: *Andrew Ng’s Machine Learning course on Coursera*)

When the target variable that we’re trying to predict is **continuous**, such as predict price of a house, we call the learning problem **a regression problem**.

When y can take on only a small number of **discrete values**, such as classify whether a mail is spam or not, we call it **a classification problem**.

## 2. Linear Regression with one variable (Univariate)

Our first learning algorithm is the Linear Regression Algorithm. In this algorithm, given a set of data, we will have to find a line that “best fits” with the given examples.

Linear Regression Problem

As we can see in the picture, the red line is the “best fitting” line for our training set. As describe in the process, our main target is to ** build a algorithm that will learn to fit with the data set through measure the accuracy of a hypothesis** or in the other words:

**How to find the equation of the “best fitting” line ?**

One approach is using the * Cost Function. *This takes an average difference of all the results of the hypothesis with inputs from x’s and the actual output y’s.

Cost function:

where is the function of the “best fitting” line.

*(This function is otherwise called “Squared error function” or “Mean squared error”.)*

Now, the problem is to try to **minimise** **the** **Cost Function ** regarded ** ** . The idea is to choose so that is close to for our training example .

That’s where * Gradient Descent *comes in. The Gradient Descent algorithm is:

repeat until convergence:

where j= 0, 1 represents the feature index number.

Remember that, at each iteration j, one must **simultaneously** update all the parameters. Otherwise, updating a specific parameter prior to calculating another one would yield to a wrong implementation.

The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate.

Plug our Cost Function for the Linear Regression Problem in, we will then get the result for

repeat until convergence:

Note that we have separated out the two cases for θ_{j} into separate equations for θ_{0} and θ_{1 }; and that for θ_{1} we are multiplying x_{i} at the end due to the derivative.

The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these gradient descent equations, our hypothesis will become more and more accurate.

That’s it!!! After a number of iterations updating θ, it will then converge. Using the result, we will have our “best fitting” line for the training examples and we can predict corresponding values of y.