Скачать книгу

to have learned or been “trained” from the data. This is, at its most essential and rudimentary level, what statistical learning actually means in many (not all) contexts. If we subject that model to new data after that, thus “sharpening” its scalars, the model “updates” what its estimators should be in order to continue optimizing a function. Note that this more or less parallels the idea of human learning, in that the model (or “you”) is “learning from experience” as a new experience is incorporated into knowledge. For example, a worker learns how to maximize his or her potential in a job through trial and error, otherwise known as “experience.” If one day his or her boss corrects him or her, that new “data” is incorporated into the learning mechanism. If on another day the individual is reinforced for doing something right, that is also incorporated into the learning mechanism. Of course, we cannot see the scalars or estimators (they are largely metaphorical in this case), but you get the idea. Learning “optimizes” some function though exposure to new experience. In classical learning theory in psychology, for instance, the rat in a Skinner box learns that if he presses the lever, he will receive a pellet of food. If he doesn’t press the lever, he doesn’t receive food. The rat is optimizing the function (its in his little brain, and its metaphorical, we can’t see it) that will allow him to distinguish which response gets the food. This is learning! When the rat is “trained” enough, he starts making predictions nearly perfectly with very few errors. So it also is with the statistical model; it does an increasingly good job at “getting it right” as it is trained on increasingly more data (i.e. more “experience”). It also “learns” from what it did wrong, just as the rat learns that if he doesn’t press the lever, he doesn’t eat.

      Now, in the spirit of statistical learning and “training,” validating a model has become equally emphasized, in the sense that after a model is trained on one set of data, it should be applied to a similar set of data to estimate the error rate on that new set. But what does this mean? How can we understand this idea? Easily! Here are some easy examples of where this occurs:

       The pilot learns in the simulator or test flights and then his or her knowledge is “validated” on a new flight. The pilot was “trained” in landing in a thunderstorm yesterday and now that knowledge (model) will be evaluated in a new flight on a new storm.

       Rafael Nadal, tennis player, learns from his previous match how to not make errors when returning the ball. That learning is evaluated on new data, which is a new tennis match.

       A student in a statistics class learns from the first test how to adjust his or her study strategies. That knowledge is validated on test 2 to see how much was learned.

      In this book, while it can be said that we do “train” models by fitting them, we do not cross-validate them on new data. Since it is essentially an introduction and primer, we do not take that additional step. However, you should know that such a step is often a good one to take if you have such data at your disposal to make cross-validation do-able. In many cases, scientists may not have such cross-validation data available to them, at least not yet. Hence, “splitting the sample” into a training and test set may not be do-able due to the size of the data. However, that does not necessarily mean testing cannot be done. It can be, on a new data set that is assumed to be drawn from the same population as the original test set. Techniques for cross-validation do exist that minimize having to collect very large validation samples (e.g. see James et al., 2013). Further, to use one of our previous metaphors, validating the pilot’s skill may be delayed until a new storm is available; it does not necessarily have to be done today. Hence, and in general, when you fit a model, you should always have it in mind to validate that model on new data, data that was not used in the training of the model. Why is this last point important? Quite simply because if the pilot is testing his or her skills on the same storm in which he or she was trained, it’s hardly a test at all, because he or she already knows that particular storm and knows the intricacies and details of that storm, so it is not really a test of new skills; it is more akin to a test of how well he or she remembers how to deal with that specific storm and (returning to our statistical discussion) capitalizes on chance factors. This is why if you are to cross-validate a model, it should be done on new “test” data, never the original training data. If you do not cross-validate the model, you can generally expect your model fit on the training data in most cases to be more optimistic than not, such that it will appear that the model fits “better” than it actually would on new data. This is the primary reason why cross-validation of a model is strongly encouraged. Either way, clear communication of your results is the goal, in that if you fit a model to training data and do not cross-validate it on test data, inform your audience of this so they can know what you have done. If you do cross-validate it, likewise inform them. Hence, in this respect, it is not “essential” that you cross-validate immediately, but what is essential is that you are honest and open about what you have done with your data and clearly communicate to your readers why the current estimates of model fit are likely to be a bit inflated due to not immediately testing out the model on new data. In your next study, if you are able to collect a sample from the same population, evaluate your model on new data to see how well it fits. That will give you a more honest assessment of how good your model really is. For further details on cross-validation, see James et al. (2013), and for a more thorough and deeper theoretical treatment, see Hastie et al. (2009).

      1.10 Where We Are Going From Here: How to Use This Book

Скачать книгу