Test Sets and Training Sets

Training Set and Test Set

In the classification section, we were introduced to classifiers which can label data points into classes, but how do we know if this was better than just randomly guessing? How good would our classifier be on new data? These questions motivate the intuition for having a training and test set. This involves breaking up data into two classes:

Training set: The purpose of the training set is to build your model and improve it. This set of data should have the true labels, so that we can test how good our model can predict the true categories.

Test set: The purpose of the testing set is to check whether your trained model is generalizable, not just specifically attuned to the training set you passed in. It should obscure the true classes in the actual testing, then compare with the true classes to get the proportions.

After training the classifier with the training set and checking its accuracy on the test set, we can use the proportion of correctly classified results as a measure of how well the classifier will perform on new data.

Think about what having a bad classifier might mean. A possible consideration is that having a bad classifier might mean that those with life-threatening diseases could be told they don't have the disease. Similarly, those without life-threatening diseases may be misdiagnosed.

Train and Test Split

To get a train and test set, we shuffle the data to randomize and then do a train-test split. This split is important because if we trained and tested on the same data, the model could simply memorize how to classify the data correctly. For example, imagine that a student in a class needed to take a multiple choice exam and somehow had the solutions. If they just memorized letters corresponding to the answers and subsequently answered as such, they could get 100% on that particular test. That said, if the professor realized the student had the solutions and changed the order of the possible answers, the memorization approach would immediately fail. The student only memorized the order, not the intuition - this would be the same as a classifier tested on it's training set.

Never train on your test set! This means that when you divide your data into a training and testing set, there should be NO overlap!

Train-Test Split Conventions

Typically, we split with one of the following guides:

  • 75% of data in training set, 25% in test set

  • 80% of data in training set, 20% in test set

  • 90% of data in training set, 10% in test set

Notice that all of these splits have much more than 50% of the data in the training set! Why?

In these examples the test set is much smaller proportionally, but the test set still needs to be large enough to generate meaningful results and be representative of the underlying data.

For now, you should choose between these (or similar train-test splits based on size of your data). For larger data sets, you can veer on the side of having more in your test set, because with large data and many training points, there's not as much worry about underfitting the model.

The Process

A general process goes as follows, given data and a potential classifier that you want to test the aptitude of:

  1. Shuffle your data

  2. Split into training and test set by the conventions above

  3. Use your training set to train your classifier

  4. Use your trained classifier on your test set

  5. See how well your classifier scores on the test set

Over- and Underfitting

We circle back to the student with the upcoming exam. Earlier we mentioned that memorizing the answers is specific to a particular test, and not a strategy that would be applicable to other tests. On the flip side, what if the student had followed the canonical advice of "always choose c?" For any test they would have the same approach. In other words, taking this particular test has no impact on this student's choices for their answers. However, this strategy may not be ideal, as they would likely want to tailor their answers differently when taking a math test versus an English test. The first approach gives the intuition behind the dangers of overfitting your model, where the model changes every time the data is even slightly different, and the second gives intuition behind the dangers of underfitting, where the model makes strong assumptions that the data will be as the model predicts regardless of the actual data. Both these extremes will likely lead to many points lost on the test!

The optimal solution would be something that takes a little from each approach as a happy medium to reduce the points lost (and hopefully in the real world this maps to having a strong conceptual understanding of the material and choosing guidelines for how to approach problems, rather than having a mix of finding solutions beforehand and always choosing one letter!). This problem does not necessarily have a neat solution, and can rely on approximations to minimize the loss.

As a general note, choosing between train-test splits and knowing what would be considered over- or underfitting your data is something complicated computer science libraries and high-level machine learning experts can figure out. That said, in practice with more complex data, we can import special libraries and split up our data even more to find the optimal split and additional considerations for the smallest error.

Garbage In, Garbage Out

Your classifier is only going to be as good as the data its given. Let's circle back to our fruit classifier. Say, instead of our data looking like this...

...it looked like this.

fruit_type

skin_texture

skin_color

shape

orange

fuzzy

pink

round

apple

smooth

red

oblong

peach

rough

orange

round

apple

smooth

yellow

round

If we input a new fruit that is orange, round, with a rough skin-texture, our classifier will categorize it as a peach.

Garbage In, Garbage Out has real life applications. In 1986, St. George’s Hospital Medical school was found guilty of discriminating against women and ethnic minorities when scanning the CVs of applicants. As many as 60 out of 2000 were denied an interview based on their name, which was an indication of gender and racial background. The classifier the hospital utilized for initial screening on applicants were unfairly classifying these people as unqualified. The reason? The program was written after careful analysis of how professionals made admissions decisions, and these professionals carried some implicit biases against women and ethnic minorities.

Last updated