Thursday, October 26, 2017

3. Notes on Machine Learning: Basics of Neural Networks



Neural Networks is an interesting branch in machine learning which attempts to mimic the functionality of neurons in human brain. A neural network consists of the input feature vector \(X\) to a node, the hypothesis function which is sometimes called the activation function running inside a node and finally the output of the function. Instead of having a single activation unit, we can have multiple layers of activation nodes. The input vector layer is considered as the first layer while there are multiple hidden layers (layer 2, layer 3, etc) before the output layer.

The \(\theta\) parameter set is not a single vector like in linear regression and logistic regression in this case. This time, in neural networks, we have a  \(\theta\) parameter set between every two layers. For example in the above figure, we have three layers, and therefore, we have two \(\theta\) sets. The arrows going from layer 1 to layer 2 represent the parameter set \(\theta^{(1)}\). The arrows going from layer 2 to layer 3 represent the parameter set \(\theta^{(2)}\). The upperscript number within the brackets represent the origin layer this parameter set belongs to. Furthermore, \(\theta^{(1)}\) is a matrix with 3x4 dimentions. There, every raw represents the set of arriows coming from the layer 1 features to a node in layer 2. For example, the element \(\theta^{(1)}_{10}\) represents the arrow to \(a_1^{(2)}\) from \(x_0\). The element \(\theta^{(1)}_{20}\) represents the arrow to \(a_2^{(2)}\) from \(x_0\).

$$\theta^{(1)} = \begin{bmatrix}\theta^{(1)}_{10} & \theta^{(1)}_{11} & \theta^{(1)}_{12} & \theta^{(1)}_{13}\\\theta^{(1)}_{20} & \theta^{(1)}_{21} & \theta^{(1)}_{22} & \theta^{(1)}_{23}\\\theta^{(1)}_{30} & \theta^{(1)}_{31} & \theta^{(1)}_{32} & \theta^{(1)}_{33}\end{bmatrix}$$

Meanwhile \(\theta^{(2)}\) is a raw vetor (1x4) in this case. This is because there are 4 arrows coming from the layer 2 nodes to the layer 1 node.

$$\theta^{(2)} = \begin{bmatrix}\theta^{(2)}_{10} & \theta^{(2)}_{11} & \theta^{(2)}_{12} & \theta^{(2)}_{13}\end{bmatrix}$$


The hypothesis function in thse neural networks is a logistic function just like in the logistic regression.
 $$h_\theta (x) = \frac{1}{1 + e^{- \theta^T x}}$$
For a neural network like the one shown in the above figure, we can calculate the activation and get the final output inthe following way. There, \(a_1^{(2)}\) represent the activation node 1 in the layer 2 (hidden layer). Similarly \(a_2^{(2)}\) represent the activation node 2 in the layer 2 and so on.

$$a_1^{(2)} = g(\theta_{10}^{(1)} x_{0} + \theta_{11}^{(1)} x_{1} + \theta_{12}^{(1)} x_{2} + \theta_{13}^{(1)} x_{3})$$

$$a_2^{(2)} = g(\theta_{20}^{(1)} x_{0} + \theta_{21}^{(1)} x_{1} + \theta_{22}^{(1)} x_{2} + \theta_{23}^{(1)} x_{3})$$

$$a_3^{(2)} = g(\theta_{30}^{(1)} x_{0} + \theta_{31}^{(1)} x_{1} + \theta_{32}^{(1)} x_{2} + \theta_{33}^{(1)} x_{3})$$

$$h_\theta (x) = a_1^{(3)} = g(\theta_{10}^{(2)} a_{0}^{(2)} + \theta_{11}^{(2)} a_{1}^{(2)} + \theta_{12}^{(2)} a_{2}^{(2)} + \theta_{13}^{(2)} a_{3}^{(2)})$$
Since the hypothesis function is a logistic function, the final output we get is a value between 0 and 1. What we do to have a multiclass classifier is, having multiple nodes in the output layer. Then, we get a unique output value from each node in the output layer representing a specific class.

~************~

Tuesday, October 17, 2017

2. Notes on Machine Learning: Logistic Regression



Unlike regression problems where our objective is to predect a scalar value for a given set of values in features, in classification problems our objective is to decide a discrete value from a limited set of choices. Therefore, we are not going to use Linear Regression in classification problems. Based on how many categories we have to classify our input, we have two types of problems; binary classification and multiclass classification.



In binary classification problems, there are only two possible outputs for the classifier; 0 or 1. Then as it is obvious, multi-class classification problems can have multiple output values such as 0, 1, 2, .. however there's a descete set of values, not infinite. When solving these classification problems, we develop machine learning models which can solve binary classification problems. When we have multiple classes in a problem, we use multiple binary classification models to check for each class.

Logistic Regression:

In order to maintain the prediction output between 0 and 1, we are using a zigmoid/logistic function as the hypothesis function. Therefore, we call this technique, Logistic Regression. Hypothesis function of the logistic regression is as follows.
$$h_\theta (x)=\frac{1}{1+e^{-\theta^T x}}$$
The vectorized representation of this hypothesis would look like the following.
$$h=g(X\theta)$$
For any input value in \(x\), this logistic regression hypothesis function outputs a value between 0 and 1. If the value is closer to 1, we can consider it as a classification to the class 1. Similarly, if the value is closer to 0, we can consider the classification as 0. On the other hand, we can consider the output value between 0 and 1 as a percentage probability to classify as the class 1. For example, if we receive the output 0.6, it means there's a \(60\%\) probability for the input data to be classified to class 1. Similarly, if we receive the output as 0.3, it means there's a \(30\%\) probability for the input data to be classified to class 1.

Cost Function:

The cost function of the logistic regression is different from the cost function of the linear regression. This cost function is designed in a way where if the logistic regression model makes a prediction with a \(100\%\) accuracy, it generates a zero cost penalty while if it maks aprediction with a \(0\%\) accuracy, it generates an infinite penalty value.
$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m} [y^i log(h_\theta(x^i)) + (1-y^i) log(1-h_\theta(x^i))]$$
A vectorized implementation of the cost function would look like the following.
$$J(\theta)=\frac{1}{m}(-y^Tlog(h) - (1-y^T)log(1-h))$$

Gradient Decent:

In order to adjust the parameter vector \(\theta\) untill it fits properly to the training data set, we need to perform gradient decent algorithm. Following line has to be repeated for each \(j\) simultaneuously which represent the parameters \(\theta\). In this gradient decent algorithm, \(\alpha\) is the learning rate.
$$\theta_j=\theta_j - \frac{\alpha}{m}\sum_{i=1}^{m}(h_\theta(x^i) - y^i)x_j^i$$
A vectorized implementation of the gradient decent algorithm would look like the following.
$$\theta = \theta - \frac{\alpha}{m}X^T(g(X\theta) - y)$$

Multiclass Classification:

The above technique works for classifying data into two classes. When we encounter a multiclass classification problem, we should train a logistic regression model for each class. For example, if there are 3 classes, we need three logistic regression models trained to distinguish between the targetted class and the others. In this way, we can solve  multiclass classification problems using logistic regression.

Overfitting and Underfitting:

Overfitting and Underfitting are two problems which can occur in both Linear Regression and Logistic Regression algorithms. The former problem occurs when our model fits too accurately to the training data set so that in does not represent the general case properly. The latter issue occurs when our model does not even properly fit to the training data set.

Regularization is a nice technique to solve the problem of overfitting. What happens there is we maintain the values of \(\theta\) parameter vector in a smaller range in order to stop the learning model curve from adjusting too agressively. This is achieved by adding extra weights to the cost function. It prevents the medel from overfitting to the training dataset. We can use this technique in both linear regression and logistic regression.

Regularized Linear Regression:

The gradient decent algorithm for the linear regression after adding regularization looks like the following. The two steps has to be repeated in each step of the gradient decent. There, \(j\) stands for 1, 2, 3, .. which represent each \(\theta\) parameter in the hypothesis. \(\lambda\) is the regularization parameter.
$$\theta_0 = \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^i) - y^i) x_0^i$$
$$\theta_j = \theta_j - \alpha [(\frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^i) - y^i) x_j^i) + \frac{\lambda}{m} \theta_j]$$

Regularized Logistic Regression:

The gradient decent algorithm for the logistic regression after adding regularization looks like the following.
$$\theta_0 = \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^i) - y^i) x_0^i$$
$$\theta_j = \theta_j - \alpha [(\frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^i) - y^i) x_j^i) + \frac{\lambda}{m} \theta_j]$$
Of course, it looks like the regularized linear regression. However, it is important to remember that in this case, the hypotheis function \(h_\theta (x)\) is a logistic function unlike in the linear regression.

~**********~ 

Thursday, October 12, 2017

Ireland's Space Week in 2017

Image credit:
Last week in Ireland is called Space Week where the focus was on promoting space exploration and related science and technologies among people. This time for Ireland is so special because they are building a cube satellite with the joint contribution of universities and Irish space technology companies. During this space week, there were so many events organized by different institutions all over the Ireland. Even though I was busy with my work, somehow I finally managed to attend to an interesting event organized by the University College Dublin.

The event was titled Designing for Extreme Human Performance in Space which was conducted by two very interesting personalities. The first person was Dava J. Newman, who is a former deputy administrator of NASA and currently works for the MIT. The second person was Guillermo Trotti, who is a professional architect and has worked for NASA on interesting projects. Seeing the profiles of these two speakers attracted me to attend to the event. The session was held for about an hour and a half where the two speakers shared the time to talk on two different areas they are interested in. Finally, the session was concluded with a Q&A session.

Image credit:
In her presenation, Dava talked about the extreme conditions in space which raise the requirement of designing life support systems to assist astronauts. When she asked from the famous astronaut Scott Kelly (@StationCDRKelly), who spent a year in ISS, about what would be the most needed thing if we are to improve in space technology, he has responded that life support systems to ease the operation of astronauts on space is the most needed thing. Dana presented the work she is involved in designing a new kind of space suit for astronauts to use on other planets such as Mars. The pictures she showed indicates a skin-tight suit which is custom designed to the body specification of an astronaut very much like a suit from a sci-fi movie.

Gui Trotti in his presentation talked specifically about his architectural interest on building habitable structures for humans on the Moon and Mars. As a professional architect, he is so inspired to bring his skills into human colonies in outer space. During that presentation, his mentioned three things that inspired me so much. The first is the fact that when an astronaut goes to space and turn back to look at his home planet, all the borders and nationalistic pride goes away and comes the feeling of we all are one human race and that planet Earth is the only home we have. Secondly, he described his tour around the world in a sailing boat which reminded him that space exploration is another form of human courage to explore and see the world. Finally, he said that his dream is to build a university on the moon one day to enable students from the Earth to visit and do research appreciating our home planet.

During the Q&A session, a lot of people asked interesting questions. Among those, one question was about the commercialization of space. They responded with the important fact that there is a potential of performing commercial activities such as manufacturing on space, especially the things which can be done easily on zero gravity environments rather than on the surface of the Earth. Various things such as growing food plants and 3D printing have been tried on the ISS towards this direction. In the near future such as a decade along the line, we would be able to see much more activities from the private sector on space than today. They are so positive about the progress in this area.

Even though I'm not working in anywhere related to space exploration,  I'm always fascinated by this topic and I will continue to be.

~*********~

Thursday, October 5, 2017

1. Notes on Machine Learning: Linear Regression


Machine Learning is a hot topic these days since various kinds of applications rely on Machine Learning algorithms to get things done. While learning this topic, I will be writing my own notes about it as article serious in this blog for my own future reference. The contents might be highly abstract as this is not a tutorial aimed at somebody to learn Machine Learning by reading these notes.

Definition:

According to the definition of Tom Mitchell, Machine Learning is defined as  "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E." Basically, we are enabling computers to do things without explicitly programming them to do it.

Categories of Machine Learning:

There are two broad categories of Machine Learning algorithms. The first is Supervised Learning and the second is Unsupervised Learning. There are various sub categorizations under these categories which can be illustrated as follows.


Machine Learning Models:

A model is a function (ie. hypothesis) \( h(x) \) which provides the output value \( y \) for a given input values \( x \), based on previously given leaning dataset \( X \) and output set \( Y \). The input values \( x \) are called features. A hypothesis function in Linear Regression with one feature would look like the following.
$$ h(x) = \theta_{0} x_{0} + \theta_{1} x_{1} $$
The first feature \( x_{0} \) is always set to 1 while the second feature \( x_{1} \) is actually the feature used in this model. The parameters \( \theta_{0} \) and \( \theta_{1} \) are the weight of the features to the final output and therefore, they are the values we are looking for in order to build the linear regression model for a specific dataset. The reason why we have an extra feature in the beginning which is always set to 1 is that it is easy to perform vecterized calculations (using matrices based tools) when we have it in that way.

Cost Function:

In order to measure the accuracy of a particular hypothesis function, we use another function called cost function. It is actually, a squared mean error function between the difference of predicted output value and the true output value of the hypothesis. By adjusting the values of the parameters \( \theta_{0} \) and \( \theta_{1} \) in the hypothesis, we can minimize the cost function and make the hypothesis more precise.
$$J(\theta_{0},\theta_{1}) = \frac{1}{2m} \sum_{i=0}^{m} (h_\theta(x_i) - y_i)^{2}$$

Gradient Descent:

In this algorithm, what we are doing is keep adjusting the parameters \( \theta_{0} \) and \( \theta_{1} \) until the Cost Function evetually becomes the minimum it can get. That means, we found the most accurate Model for the training dataset distribution. So, in order to adjust the parameters \( \theta_{0} \) and \( \theta_{1} \), we perform the following step over and over again for \( \theta_{0} \) and \( \theta_{1} \). In this equation, \( j \) is 0 and 1 in two steps.
$$\theta_{j} = \theta_{j} - \alpha \frac{\partial }{\partial \theta_{j}} J(\theta_{0},\theta_{1})$$

~*********~