jDataLab

3 minute read

Predictive learning is a process where a model is trained from predictor attributes and the model is used to predict a continuous value or a categorical label for something that is of interest. Predictive learning is instance-based learning.

Classification is a technique of predicting a categorical label. Regression is for forecasting a continuous quantitative value.

A pipeline of practical predictive learning should consist of the following steps:

  • Data Collection
  • Data Inspection
  • Data Cleaning
  • Data Partition
  • Model Building and selection
  • Model Evaluation
  • Model Improvement
  • Model Deployment
  • Model Integration

Data Collection

Firstly, we want to determine the problem, i.e., the specific variable to be predicted. Next, we want to construct a dataset that contains samples/instances, each of which has values for the predicted variable and other regular variables. Variables are usually stored by columns and samples are by rows.

Note: The samples in the dataset must have an accompanying value for the predicted variable. This means the dataset should have been labeled.

1. Data Inspection

After the dataset has been prepared, inspect each variable attribute for its data type, statistics, distribution, possible missing values and outliers. It is very important to assign each variable with a correct data type; Wrong attribute types will lead to erroneous learning results and misclassification errors thereafter.

2. Data Cleaning

Real data often lacks of quality and is not conditioned well for learning. We want to handle incorrect and inconsistent data entries, missing values and outliers.

3. Data Partition

Before building the learning model, we need to partition the dataset into two or more subsets for the succeeding step on model evaluation. The subsets will serve for two stages in model building: training and testing.

We train a model on the training sets and test the model on the test sets. The data in the test sets is never used in building the model.

When there is not a vast supply of data available, there are several partition schemes including leave-one-out, holdout, k-fold cross validation and bootstrap.

Partition Schemes

  • Holdout: one third of the data for testing and two thirds of the data for training

  • k-fold Cross Validation: Partition the data into k equally-sized subsets; each in turn is used for testing and the remainder for training. Ten (10) is about the right number of folds to get the best esitmate of error, that has been backed up with some theorectical evidence. It is standard procedure to repeat 10-fold cross validation for 10 times and average the results.

  • Leave-one-out: Each instance in turn is left out and the model is trained on the remaining instances.

  • Bootstrap: Sample the dataset with replacement to form a training set.

For the models that involves parameter optimization, three sets are used. The training data is used to construct learners. The validation data is used to optimize parameters of the learners or to select one learner. The test data is used to calculate the error rate or success rate.

4. Model Building and Model Selection

At this step, according to the prediction type, continuous or categorical, we want to select a learning scheme and build a predictive model from the training set and may validate the model on the validation set.

5. Model Evaluation

To evaluate how good the model is in term of success rate, we can apply the model to the test set to find its success rate.

6. Model Improvement

For k-fold cross validation, we will repeat the previous two steps until getting a model which meets specific performance.

7. Model Deployment

Deploy the model directly to the unlabeled data. The model should produce a value for the label.

8. Model Integration

The model can also be integrated into a computer application as an augment to its analytic function.