Unlike regression problems where our objective is to predect a scalar value for a given set of values in features, in classification problems our objective is to decide a discrete value from a limited set of choices. Therefore, we are not going to use Linear Regression in classification problems. Based on how many categories we have to classify our input, we have two types of problems; binary classification and multiclass classification.

In binary classification problems, there are only two possible outputs for the classifier; 0 or 1. Then as it is obvious, multi-class classification problems can have multiple output values such as 0, 1, 2, .. however there's a descete set of values, not infinite. When solving these classification problems, we develop machine learning models which can solve binary classification problems. When we have multiple classes in a problem, we use multiple binary classification models to check for each class.

In order to maintain the prediction output between 0 and 1, we are using a zigmoid/logistic function as the hypothesis function. Therefore, we call this technique, Logistic Regression. Hypothesis function of the logistic regression is as follows.

$$h_\theta (x)=\frac{1}{1+e^{-\theta^T x}}$$

The vectorized representation of this hypothesis would look like the following.

$$h=g(X\theta)$$

For any input value in \(x\), this logistic regression hypothesis function outputs a value between 0 and 1. If the value is closer to 1, we can consider it as a classification to the class 1. Similarly, if the value is closer to 0, we can consider the classification as 0. On the other hand, we can consider the output value between 0 and 1 as a percentage probability to classify as the class 1. For example, if we receive the output 0.6, it means there's a \(60\%\) probability for the input data to be classified to class 1. Similarly, if we receive the output as 0.3, it means there's a \(30\%\) probability for the input data to be classified to class 1.

The cost function of the logistic regression is different from the cost function of the linear regression. This cost function is designed in a way where if the logistic regression model makes a prediction with a \(100\%\) accuracy, it generates a zero cost penalty while if it maks aprediction with a \(0\%\) accuracy, it generates an infinite penalty value.

$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m} [y^i log(h_\theta(x^i)) + (1-y^i) log(1-h_\theta(x^i))]$$

A vectorized implementation of the cost function would look like the following.

$$J(\theta)=\frac{1}{m}(-y^Tlog(h) - (1-y^T)log(1-h))$$

In order to adjust the parameter vector \(\theta\) untill it fits properly to the training data set, we need to perform gradient decent algorithm. Following line has to be repeated for each \(j\) simultaneuously which represent the parameters \(\theta\). In this gradient decent algorithm, \(\alpha\) is the learning rate.

$$\theta_j=\theta_j - \frac{\alpha}{m}\sum_{i=1}^{m}(h_\theta(x^i) - y^i)x_j^i$$

A vectorized implementation of the gradient decent algorithm would look like the following.

$$\theta = \theta - \frac{\alpha}{m}X^T(g(X\theta) - y)$$

The above technique works for classifying data into two classes. When we encounter a multiclass classification problem, we should train a logistic regression model for each class. For example, if there are 3 classes, we need three logistic regression models trained to distinguish between the targetted class and the others. In this way, we can solve multiclass classification problems using logistic regression.

Overfitting and Underfitting are two problems which can occur in both Linear Regression and Logistic Regression algorithms. The former problem occurs when our model fits too accurately to the training data set so that in does not represent the general case properly. The latter issue occurs when our model does not even properly fit to the training data set.

The gradient decent algorithm for the linear regression after adding regularization looks like the following. The two steps has to be repeated in each step of the gradient decent. There, \(j\) stands for 1, 2, 3, .. which represent each \(\theta\) parameter in the hypothesis. \(\lambda\) is the regularization parameter.

$$\theta_0 = \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^i) - y^i) x_0^i$$

$$\theta_j = \theta_j - \alpha [(\frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^i) - y^i) x_j^i) + \frac{\lambda}{m} \theta_j]$$

The gradient decent algorithm for the logistic regression after adding regularization looks like the following.

$$\theta_0 = \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^i) - y^i) x_0^i$$

$$\theta_j = \theta_j - \alpha [(\frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^i) - y^i) x_j^i) + \frac{\lambda}{m} \theta_j]$$

Of course, it looks like the regularized linear regression. However, it is important to remember that in this case, the hypotheis function \(h_\theta (x)\) is a logistic function unlike in the linear regression.

**Logistic Regression:**In order to maintain the prediction output between 0 and 1, we are using a zigmoid/logistic function as the hypothesis function. Therefore, we call this technique, Logistic Regression. Hypothesis function of the logistic regression is as follows.

$$h_\theta (x)=\frac{1}{1+e^{-\theta^T x}}$$

The vectorized representation of this hypothesis would look like the following.

$$h=g(X\theta)$$

For any input value in \(x\), this logistic regression hypothesis function outputs a value between 0 and 1. If the value is closer to 1, we can consider it as a classification to the class 1. Similarly, if the value is closer to 0, we can consider the classification as 0. On the other hand, we can consider the output value between 0 and 1 as a percentage probability to classify as the class 1. For example, if we receive the output 0.6, it means there's a \(60\%\) probability for the input data to be classified to class 1. Similarly, if we receive the output as 0.3, it means there's a \(30\%\) probability for the input data to be classified to class 1.

**Cost Function:**The cost function of the logistic regression is different from the cost function of the linear regression. This cost function is designed in a way where if the logistic regression model makes a prediction with a \(100\%\) accuracy, it generates a zero cost penalty while if it maks aprediction with a \(0\%\) accuracy, it generates an infinite penalty value.

$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m} [y^i log(h_\theta(x^i)) + (1-y^i) log(1-h_\theta(x^i))]$$

A vectorized implementation of the cost function would look like the following.

$$J(\theta)=\frac{1}{m}(-y^Tlog(h) - (1-y^T)log(1-h))$$

**Gradient Decent:**In order to adjust the parameter vector \(\theta\) untill it fits properly to the training data set, we need to perform gradient decent algorithm. Following line has to be repeated for each \(j\) simultaneuously which represent the parameters \(\theta\). In this gradient decent algorithm, \(\alpha\) is the learning rate.

$$\theta_j=\theta_j - \frac{\alpha}{m}\sum_{i=1}^{m}(h_\theta(x^i) - y^i)x_j^i$$

A vectorized implementation of the gradient decent algorithm would look like the following.

$$\theta = \theta - \frac{\alpha}{m}X^T(g(X\theta) - y)$$

**Multiclass Classification:**The above technique works for classifying data into two classes. When we encounter a multiclass classification problem, we should train a logistic regression model for each class. For example, if there are 3 classes, we need three logistic regression models trained to distinguish between the targetted class and the others. In this way, we can solve multiclass classification problems using logistic regression.

**Overfitting and Underfitting:**Overfitting and Underfitting are two problems which can occur in both Linear Regression and Logistic Regression algorithms. The former problem occurs when our model fits too accurately to the training data set so that in does not represent the general case properly. The latter issue occurs when our model does not even properly fit to the training data set.

*Regularization*is a nice technique to solve the problem of overfitting. What happens there is we maintain the values of \(\theta\) parameter vector in a smaller range in order to stop the learning model curve from adjusting too agressively. This is achieved by adding extra weights to the cost function. It prevents the medel from overfitting to the training dataset. We can use this technique in both linear regression and logistic regression.**Regularized Linear Regression:**The gradient decent algorithm for the linear regression after adding regularization looks like the following. The two steps has to be repeated in each step of the gradient decent. There, \(j\) stands for 1, 2, 3, .. which represent each \(\theta\) parameter in the hypothesis. \(\lambda\) is the regularization parameter.

$$\theta_0 = \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^i) - y^i) x_0^i$$

$$\theta_j = \theta_j - \alpha [(\frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^i) - y^i) x_j^i) + \frac{\lambda}{m} \theta_j]$$

**Regularized Logistic Regression:**The gradient decent algorithm for the logistic regression after adding regularization looks like the following.

$$\theta_0 = \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^i) - y^i) x_0^i$$

$$\theta_j = \theta_j - \alpha [(\frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^i) - y^i) x_j^i) + \frac{\lambda}{m} \theta_j]$$

Of course, it looks like the regularized linear regression. However, it is important to remember that in this case, the hypotheis function \(h_\theta (x)\) is a logistic function unlike in the linear regression.

~**********~