-
Notifications
You must be signed in to change notification settings - Fork 47
Learning From Data [Q A] : Lecture 01
Question 1 : how do you determine if a set of points is linearly separable, and what do you do if they're not separable.
The linear separability assumption is a very how do you determine ifa set of points is linearly separable, and what do you do if they're not separable. simplistic assumption, and doesn't apply mostly in practice. And I chose it only because it goes with a very simple algorithm, which is the perceptron learning algorithm.There are two ways to deal with the case of linear inseparability.There are algorithms, and most algorithms actually deal with thatcase, and there's also a technique that we are going to study nextweek, which will take a set of points which is not linearly separable, andcreate a mapping that makes them linearly separable.
So there is a way to deal with it.However, the question how do you determine it's linearly separable, theright way of doing it in practice is that, when someone gives you data, you assume in general it's not linearly separable.It will hardly ever be, and therefore you take techniques that can deal with that case as well.There is a simple modification of the perceptron learning algorithm, which is called the pocket algorithm,that applies the same rule with a very minor modification, and deals with the case where the data is not separable.However, if you apply the perceptron learning algorithm, that is guaranteed to converge to a correct solution in the case of linear separability, and you apply it to data that is not linearly separable, bad things happen.Not only is it going not to converge, obviously it is not going to converge because it terminates when there are no misclassified points, right?If there is a misclassified point, then there's a next iteration always.So since the data is not linearly separable, we will never come toa point where all the points are classified correctly.
So this is not what is bothering us.What is bothering us is that, as you go from one step to another, you cango from a very good solution to a terrible solution.In the case of no linear separability.So it's not an algorithm that you would like to use, and justterminate by force at an iteration.A modification of it can be used this way, and I'll mention it briefly whenwe talk about linear regression and other linear methods.
Question 2 : There's also a question of how does the rate of convergence of the perceptron change with the dimensionality of the data?
Badly.
That's the answer.Let me put it this way.You can build pathological cases, where it really will take forever.However, I did not give the perceptron learning algorithm in the first lecture to tell you that this is the great algorithm that you need to learn.
I gave it in the first lecture, because this is simplest algorithm I could give.By the end of this course, you'll be saying, what? Perceptron?Never heard of it.So it will go out of contention, after we get to the more interesting stuff. But as a method that can be used, it indeed can be used, and can be explained in five minutes as you have seen.
Question 3: Regarding the items for learning, you mentioned that there must be a pattern. So can you be more specific about that? How do you know if there's a pattern?
You don't.My answers seem to be very abrupt, but that's the way it is.When we get to the theory-- is learning feasible-- it will become very clear that there is a separation between the target function-- there is a pattern to detect--and whether we can learn it.It is very difficult for me to explain it in two minutes, it will take a full lecture to get there.But the essence of it is that you take the data, you apply your learning algorithm, and there is something you can explicitly detect that will tell you whether you learned or not.So in some cases, you're not going to be able to learn.In some cases, you'll be able to learn.And the key is that you're going to be able to tell by running your algorithm.And I'm going to explain that in more details later on. So basically, I'm also resisting taking the data, deciding whether it's linearly separable, looking at it and seeing.
You will realize as we go through that it's a no-no to actually look at the data. What?That's what data is for, to look at.Bear with me.We will come to the level where we ask why don't we look at the data--just looking at it and then saying: It's linearly separable.Let's pick the perceptron. That's bad practice, for reasons that are not obvious now.They will become obvious, once we are done with the theory.So when someone knocks on my door with a set of data, I can ask them all kinds of questions about the data-- not the particular data set that they gave me, but about the general data that is generated by their process.They can tell me this variable is important, the function is symmetric,they can give you all kinds of information that I will take to heart.But I will try, as much as I can, to avoid looking at the particular dataset that they gave me, lest I should tailor my system toward this data set,and be disappointed when another data set comes about.You don't want to get too close to the data set.This will become very clear as we go with the theory.
Question 4 : In general about machine learning, how does it relate to other statistical, especially econometric techniques?
Statistics is, in the form I said, it's machine learning where the target--it's not a function in this case-- is a probability distribution.Statistics is a mathematical field.And therefore, you put the assumptions that you need in order to be able to rigorously prove the results you have, and get the results in detail.For example, linear regression.When we talk about linear regression, it will have very few assumptions, and the results will apply to a wide range, because we didn't make too many assumptions. When you study linear regression under statistics, there is a lot of mathematics that goes with it, lot of assumptions, because that is the purpose of the field.
In general, machine learning tries to make the least assumptions and cover the most territory. These go together.So it is not a mathematical discipline, but it's not a purely applied discipline.It spans both the mathematical, to certain extent, but it is willing to actually go into territory where we don't have mathematical models, and still want to apply our techniques.So that is what characterizes it the most.And then there are other fields.
By doing machine learning,you can find it under the name computational learning,or statistical learning.Data mining has a huge intersection with machine learning.There are lots of disciplines around that actually share some value. But the point is, the premise that you saw is so broad, that it shouldn't be surprising that people at different times developed a particular discipline with its own jargon, to deal with that discipline.So what I'm giving you is machine learning as the mainstream goes, and that can be applied as widely as possible to applications, both practical applications and scientific applications.You will see, here is a situation, I have an experiment, here is a target,I have the data.How do I produce the target in the best way I want? And then you apply machine learning.
Question 5: Do machine learning algorithms perform global optimization methods, or just local optimization methods?
Optimization is a tool for machine learning.So we will pick whatever optimization that does the job for us.And sometimes, there is a very specific optimization method.For example, in support vector machines, it will be quadratic programming.It happens to be the one that works with that.But optimization is not something that machine learning people study for its own sake.They obviously study it to understand it better, and to choose the correct optimization method.Now, the question is alluding to something that will become clear when we talk about neural networks, which is local minimum versus global minimum.And it is impossible to put this in any perspective before we get the details of neural networks, so I will defer that untilwe get to that lecture.
The hypothesis set can be anything, in principle.So it can be continuous, and it can be discrete.For example, in the next lecture I take the simplest case where we have a finite hypothesis set, in order to make a certain point.In reality, almost all the hypothesis sets that you find are continuous and infinite.Very infinite!And the level of sophistication of the hypothesis set can be huge.And nonetheless, we will be able to see that under one condition, which comes from the theory, we'll be able to learn even if the hypothesis set is huge and complicated.
Question 7 : I think I understood, more or less, the general idea, but I don'tunderstand the second example you gave about credit approval.So how do we collect our data?Should we give credit to everyone, or should we make our data biased,because we cannot determine the data of--we can't determine, should we give credit or not to persons we rejected?
Correct.This is a good point. Every time someone asks a question, the lecture number comes to my mind.I know when I'm going to talk about it.So what you describe is called sampling bias.And I will describe it in detail.But when you use the biased data, let's say the bank uses historical records.So it sees the people who applied and were accepted, and for those guys, it can actually predict what the credit behavior is, because it has their credit history.
They charged and repaid and maxed out, and all of this.And then they decide: is this a good customer or not?For those who were rejected, there's really no way to tell in this case whether they were falsely rejected, that they would have been good customers or not.Nonetheless, if you take the customer base that you have, and base your decision on it, the boundary works fairly decently.Actually, pretty decently, even for the other guys, because the other guys usually are deeper into the classification region than the boundary guys that you accepted, and turned out to be bad. But the point is well taken.The data set in this case is not completely representative, and there is a particular principle in learning that we'll talk about, which is sampling bias, that deals with this case.
Question 8 : So how do you decide how much amount of data that is required for a particular problem?
So let me tell you the theoretical, and the practical answer.The theoretical answer is that this is exactly the crux of the theory part that we're going to talk about. And in the theory, we are going to see, can we learn?And how much data.So all of this will be answered in a mathematical way.So this is the theoretical answer.The practical answer is: that's not under your control.When someone knocks on your door: Here is the data, I have 500 points.I tell him, I will give you a fantastic system if you just give me 2000.But I don't have 2000, I have 500.So now you go and you use your theory to do something to your system, such that it can work with the 500. There was one case--I worked with data in different applications--at some point, we had almost 100 million points.You were swimming in data.You wouldn't complain about data.Data was wonderful.And in another case, there were less than 100 points.And you had to deal with the data with gloves!Because if you use them the wrong way, they are contaminated, which is an expression we will see, and then you have nothing.And you will produce a system, and you are proud of it, but you have no idea whether it will perform well or not.And you cannot give this to the customer, and have the customer comeback to you and say: what did you do!? So there is a question of, what performance can you do given what data size you have?But in practice, you really have no control over the data size in almost all the cases, almost all the practical cases.
Question 9 : Another question I have is regarding the hypothesis set.So the larger the hypothesis set is, probably I'll be able to better fit the data.But that, as you were explaining, might be a bad thing to do because when the new data point comes, there might be troubles.So how do you decide the size of your dataset?
You are asking all the right questions, and all of them are coming up.This is again part of the theory, but let me try to explain this.As we mentioned, learning is about being able to predict.So you are using the data, not to memorize it, but to figure out what the pattern is.And if you figure out a pattern that applies to all the data, and it's a reasonable pattern, then you have a chance that it will generalize outside.Now the problem is that, if I give you 50 points, and you use a 7000th-order polynomial, you will fit the heck out of the data.You will fit it so much with so many degrees of freedom to spare, but you haven't learned anything.You just memorized it in a fancy way.You put it in a polynomial form, and that actually carries all the information about the data that you have,and then some.So you don't expect at all that this will generalize outside.And that intuitive observation will be formalized when we talk about the theory.There will be a measurement of the hypothesis set that you give me, that measures the sophistication of it, and will tell you with that sophistication, you need that amount of data in order to be able to make any statement about generalization.So that is what the theory is about.
Question 10: Suppose, I mean, here whatever we discussed, it is like I had a data set and I came up with an algorithm, and gave the output.But won't it be also important to see, OK, we came up with the output, and using that, what was the feedback?Are there techniques where you take the feedback and try to correct your hypothesis?
You are alluding to different techniques here.But one of them would be validation, which is after you learn, you validate your solution.And this is an extremely established and core technique in machine learning that will be covered in one of the lectures.
Question 11: In practice, how many dimensions would be considered easy, medium, and hard for a perceptron problem?
The hard,in most people's mind before they get into machine learning, is the computational time.If something takes a lot of time, then it's a hard problem.If something can be computed quickly, it's an easy problem.For machine learning, the bottleneck in my case, has never been the computation time, even in incredibly big data sets.The bottleneck for machine learning is to be able to generalize outside the data that you have seen.So to answer your question, the perceptron behaves badly in terms of the computational behavior.We will be able to predict its generalization behavior, based on the number of dimensions and the amount of data.This will be given explicitly.And therefore, the perceptron algorithm is bad computationally, good in terms of generalization.If you actually can get away with perceptrons, your chances of generalizing are good because it's a simplistic model, and therefore its ability to generalize is good, as we will see.
Question 12: Also, in the example you explain the use of binary function. So can you use more multi-valued or real functions?
Remember when I told you that there is a topic that is out of sequence.There was a logical sequence to the course, and then I took part of thelinear models and put it very early on, to give you something a little bitmore sophisticated than perceptrons to try your hand on.That happens to be for real-valued functions.And obviously there are hypotheses that cover all types of co-domains.Y could be anything as well.
Question 13: Another question is, in the learning process you showed, when do you pick your learning algorithm, when do you pick your hypothesis set, and what liberty do you have?
The hypothesis set is the most important aspect of determining the generalization behavior that we'll talk about.The learning algorithm does play a role, although it is a secondary role,as we will see in the discussion.So in general, the learning algorithm has the form of minimizing an error function.So you can think of the perceptron, what does the algorithm do?It tries to minimize the classification error.That is your error function, and you're minimizing it using this particular update rule.And in other cases, we'll see that we are minimizing an error function.Now the minimization aspect is an optimization question, and once you determine that this is indeed the error function that I want to minimize, then you go and minimize as much as you can using the most sophisticated optimization technique that you find.So the question now translates into what is the choice of the error function or error measure that will help or not help.And that will be covered also next week under the topic, Error and Noise.When I talk about error, we'll talk about error measures, and this translates directly to the learning algorithm that goes with them.
Question 13 : Back to the perceptron. So what happens if your hypothesis gives you exactly 0 in this case?
So remember that the quantity you compute and compare with the threshold was your credit score.So I told you what happens if you are above threshold, and what happens if you're below threshold.So what happens if you're exactly at the threshold?Your score is exactly that.The informal answer is that it depends on the mood of the credit officer on that day.If they had a bad day, you will be denied!But the serious answer is that there are technical ways of defining that point.You can define it as 0, so the sign of 0 is 0.In which case you are always making an error, because you are never +1 or-1, when you should be.Or you could make it belong to the +1 category or to the -1 category.There are ramifications for all of these decisions that are purely technical.Nothing conceptual comes out of them.That's why I decided not to include it.Because it clutters the main concept with something that really has no ramification.As far as you're concerned, the easiest way to consider it is that the output will be 0, and therefore you will be making an error regardless of whether it's +1 or -1.
Question 14: Is there a kind of problem that cannot be learned even if there's a huge amount of data?
Correct.For example, if I go to my computer and use a pseudo-random number generator to generate the target over the entire domain, then patently,nothing I can give you will make you learn the other guys. So remember the three-- let me try to--the essence of machine learning.The first one was, a pattern exists. If there's no pattern that exists, there is nothing to learn.Let's say that it's like a baby, and stuff is happening, and the baby is just staring. There is nothing to pick from that thing.Once there is a pattern, you can see the smile on the baby's face.Now I can see what is going on.So whatever you are learning, there needs to be a pattern. Now, how to tell that there's a pattern or not,that's a different question.But the main ingredient, there's a pattern. The other one is we cannot pin it down mathematically.If we can pin it down mathematically, and you decide to dot he learning, then you are really lazy.Because you could just write the code.But fine.You can use learning in this case, but it's not the recommended method,because it has certain errors in performance.Whereas if you have the mathematical definition, you just implement it and you'll get the best possible solution.And the third one, you have data, which is key.So you have plenty of data, but the first one is off, you are simply not going to learn.And it's not like I have to answer each of these questions at random.The theory will completely capture what is going on.So there's a very good reason for going through four lectures in the outline that are mathematically inclined.This is not for the sake of math.I don't like to do math hacking, if you will.I pick the math that is necessary to establish a concept.And these will establish it, and they are very much worth being patient with and going through.Because once you're done with them, you basically have it cold about what are the components that make learning possible, and how do we tell, and all of the questions that have been asked.
I will discuss this in neural networks, but in general,when you take a neuron and synapses, and you find what is the function that gets to the neuron, you find that the neuron fires, which is +1, if the signal coming to it, which is roughly a combination of the stimuli, exceeds a certain threshold.So that was the initial inspiration, and the initial inspiration was that: the brain does a pretty good job, so maybe if we mimic the function, we will get something good.But you mimic one neuron, and then you put it together and you'll get the neural network that you are talking about.And I will discuss the analogy with biology, and the extent that it can be benefited from, when we talk about neural networks, because that will be the more proper context for that.
Question 16: Another question is, regarding the hypothesis set, are there Bayesian hierarchical procedures to narrow down the hypothesis set?
The choice of the hypothesis set and the model in general is model selection, and there's quite a bit of stuff that we are going to talk about in model selection, when we talk about validation.In general, the word Bayesian was mentioned here-- if you look at machine learning, there are schools that deal with the subject differently.So for example, the Bayesian school puts a mathematical framework completely on it.And then everything can be derived, and that is based on Bayesian principles.I will talk about that at the very end, so it's last but not least.And I will make a very specific point about it, for what it's worth.But what I'm talking about in the course in all of the details, are the most commonly useful methods in practice.That is my criterion for inclusion. So I will get to that when we get there.In terms of a hierarchy,there are a number of hierarchical methods.For example, structural risk minimization is one of them.There are methods of hierarchies, and the ramifications of it in generalization.I may touch upon it, when I get to support vector machines.But again, there's a lot of theory, and if you read a book on machine learning written by someone from pure theory, you would think that your are reading about a completely different subject.It's respectable stuff, but different from the other stuff that is practiced.So one of the things that I'm trying to do, I'm trying to pick from all the components of machine learning, the big picture that gives you the understanding of the concept, and the tools to use it in practice.That is the criterion for inclusion.