The easiest and most interesting way to start your journey with machine learning probably would be Linear Regression and all the peculiarities that connected to it. It will help you to understand the main components of the model and the breaking points that you should pay attention to.
For the record, from now on if I say just regression, I am referring to simple linear regression as opposed to multiple regression or models that are not linear, which we will hopefully get to those later.
Now regression allows us to model mathematically the relationship between two or more variables, using very simple algebra, to be specific. For now, we'll be working with just two variables: an independent variable, and a dependent variable.
The truth is, when we talk about how "good" a regression model is, we are actually comparing it to another specific model.
I would like to start my article with a problem, and a relatively real world. So we'll call this one tip for service. So let's assume that you are a small restaurant owner or waiter in a nice restaurant.
Tips are a very important part of a waiter's pay. Most of the time, the dollar amount of the tip is related to the dollar amount of the total bill. So if the bill is $5, that would have a smaller tip than a bill that is $50. Now as the waiter or the owner, you would like to develop a model that will allow you to make a prediction about what amount of tip to expect for any given bill amount. So therefore, one evening, you collect data for six meals. But unfortunately, when you begin to look at your data, you kind of forgot something. You realize you collected data for the tip amount and not the meal amount that goes with it. So unfortunately right now, this is the best data you have. So you have a random sample of six meals and the tip amount for each one of those meals. So $5, $17, $11 and so on.
There's only one variable here, the tip amount. The meal number's just a descriptor. So we have one variable, the tip amount. But I still want to challenge you to come up with a model that will allow you to predict within some reason what the next tip is going to be.
Think about it. So the first thing we're gonna do is we're going to visualize our data. So the first thing we'll do is we'll make a graph of our tips. Now on the x-axis on the bottom, we have our meal number. Now that's not a variable, that's just a descriptor of what meal we're graphing.
Now on the y-axis, or the vertical axis, that's where we will graph our tip amount. Let's go ahead and see what this looks like. So for meal one, with a tip of $5, so we'll go ahead and graph that at around $5. For meal two, with a tip of $17, so that goes way up there. For meal three, with a tip of $11, so that goes there. Meal four, with a tip of $8, that goes there. Meal five, that was a $14 tip. And meal number six, that was a $5 tip.
So here are our data points.
Remember, we're only dealing with one variable, that's the tip amount, and the meals along the bottom just describe where we're graphing each point. And the order does not matter. We could have graphed these in any order. This just happens to be the one we ended up with.
I just want to show that the tip amount is y bar, so that's the mean of y, and that's for two reasons. One, the dependent variable, which it will be as we progress forward is always the y of the x and y axes, and of course we're graphing it on the y-axis, so it should be y bar.
So here it is, the basic concept I really want you to remember in your head as you go forward. But obviously, simple linear regression is about two variables. But, we're starting off here, cause this is where it all begins. With only one variable and no other information, the best prediction for
the next measurement is the mean of the sample itself.
So the variability in the tip amount, 'cause they're not on the line, they're above and below, the variability in the tip amounts can only be explained by the tips themselves because that's all we have.
So the way they're above and below the line, that's just the natural variation in the tips. But the basic point is this; With only one variable, the best way, the only way we can make a prediction
about what the next tip amount in this case is the mean. So our best prediction for the tip of the meal of number seven is around $9.
So that tells us how good this line fits these observed data points.Now one way we can do that is to measure the distance they are from that best fit line. We did this to some degree
when we were talking about standard deviation. Remember, we're talking about the distance each data point is from the mean. But guess what we're doing here?
The distance that each data point is from the mean, because the mean is our line of $10 here. So, for meal number one, our tip was $5. so that's $5 below our mean of $10, so that's negative five. Meal number two, got a tip of $17, that was $7 above our mean. Meal three was $11, $1 above our mean. Meal four was $8, that's $2 below our mean. Meal five was $14, that's $4 above our mean. And meal six is $5, that's $5 below our mean.
So these are the distances, in this case, dollar amounts by which each observed value is different from or is deviated from the mean of $10. Now we have a name for these, they're called residuals. So the distance between the best fit line, which in this case, 'cause it's one variable is $10, the distance from the best fit line to the observed values are called residuals.
Now they're also called the error. So the distance is also called the error because that's how far off the observed value is from the best fit line.
But if you remember in standard deviation, one of the steps was that we took the deviations from the mean and we squared them.
We're gonna do the exact same thing here. So the residual for meal one one was $5, so it's $5 below, so we square that and it squares to 25. Meal number two, it was $7 above, we square seven, that's 49. So on and so forth. So the right-hand column of our table, we have our squared residuals. Now the question is why do we square them?
Well we square them for the same reasons we square the deviations when calculating the standard deviation. Number one, it makes them all positive. So if we square a negative number, it obviously makes it positive. And number two, it emphasizes the larger deviations. So a deviation of two will square to four. But a deviation of five will square to 25. So the squaring really exaggerates the points that are further away. Now what we can do is we can take these residuals, these squared residuals in the right-hand column and we can add them up. And they're called the sum of squared residuals, or the sum of squared errors, or the SSE.
Now where have you heard that before?
You've obviously heard it in standard deviations, you've heard it in ANOVA. Same idea, sum of the squared errors. It's a fancy way of saying we add up the squared residuals.And when we do so, it's 120.Now when we say squaring the residuals, we literally mean squaring them. So, 25 over here in the left-hand side, that's negative five squared. 49 is seven squared, and so forth.
So at the end we are defining the best-fitting line to describe our data.