How to Improve Your Predictive Model
With the pervasiveness of machine learning libraries, building a predictive model is now much more simpler than before. First, load your data into a data structure, then do some preprocessing tasks and finally depends on what type of prediction problem you have (classification or regression), just pick an algorithm from a free software machine learning library (e.g. scikit-learn or pylearn) to train your model. As a result, you will have a predictive model which is able to make prediction on new/unseen data. The most important question here is what to do more to make my results better? In this blog post, I will try to give some tips about the techniques which allow you to make your results better.
Judge Your Data First Not Your Algorithm:
Check The Number of Samples
Increase the number of samples in your data. The larger the sample size, the more accurate predictive model, specially If you are trying to learn a nonlinear model.
Use Resampling Techniques
If you don't have enough data samples and more data samples are not simply accessible, then try to generate data samples artificially with the same statistical characteristic of your available data. It would be possible by augmenting, permuting or using a probabilistic model.
Remove Bad Samples
You can safely remove corrupted samples from your dataset. Corrupted samples are the ones had a bad measurement and can degrade the model performance. In addition, it is better to identify the outliers samples in you data and remove them as well. The outlier samples are the ones that fall a long way outside of the other samples distribution and can mislead the training process. For example, in a normal distribution, outliers may be values on the tails of the distribution.
Improve The Data Distribution
You might need to do over-sampling or under-sampling to fit your samples in a better distribution.
Scale The Data Features
Normalisation and standardisation of data features can boost the performance of the predictive model where the learning algorithm using a distance based strategy (e.g. KNN Classifier) or using wighted inputs (e.g. Neural Network Classifiers) to train.
Select Just Important Features
Note that not every feature in a dataset carries useful information for discriminating samples; some of them are either redundant or irrelevant which results in degrading the performance. In addition, reducing the number of features can reduce the data dimension and consequently avoid overfitting.
Think About Your Prediction Problem One More Time
Let me start by giving an example. Imagine that you have a binary classification problem and the data you have is an imbalanced data in which the number of samples for one class (e.g. negative samples) is not enough and the size of that is much more smaller than other one. In this situation instead of thinking about the problem of your data (here having imbalanced data) you can simply consider your classification problem as anomaly detection problem which needs just one class to train and detects another class as anomaly. Therefore, if you don't have a suitable data for a given prediction problem, just think about your problem again and try to see it as another problem where your data fits better.
Check your Evaluation Methods:
Use a Cross-Validation Technique
Cross validation is a method for evaluating a predictive model and it is better than residuals. Residual evaluation just gives you information about the difference between observed sample and its predicted value while a cross validation method gives you an indication of how well a predictive model will do when it is asked to make new predictions for data it has not already seen. Using a K-fold or Leave-one-out cross validation is highly recommended.
Use a Right Evaluation Metric
The type of evaluation metric is highly depended on what type of machine learning problem you are solving, is it a regression or classification? Among all the metrics for classification problems, classification accuracy is the one is mostly misused as it needs the equal number of observation in each class which is a very rare situation. Therefore, try to use a metric which is suitable for your problem. It can be a ROC Curve or Confusion Matrix for a classification problem or R^2 metric for a regression problem.
Use a Baseline to Compare Your Results
For classification algorithm, a good baseline is the one uses the label of the class which has the large number of observations as a predicted label for unseen samples. If the problem is regression you can use the mean or the median of the target variable for the all unseen data.
Check Your Algorithm, It's Time Now:
Switch To Use a Non-Linear Method
The linear problems are the ones in which the data is linearly separable based on their labels. Most of the data in real world is not 2D or 3D and it is difficult to visualise to understand whether they are linearly separable or not. So the best way is to start with a linear method first (such as Naive Bayes or SVM with linear kernel for classification problems and linear regression for regression problems) as they don't need a large amount of data , are easy to understand and are fast to train. However, if your results is not good enough, it is better to switch to use a non-linear method such as Multi-layer Perceptron for classification problems and Support Vector Regression with RBF kernel for regression problems.
Check your Algorithm Configuration
I am not talking about tuning the parameters, which I will discuss about that later, I am talking about the minimum configuration which every machine learning method needs to work well.
Check Academic Literatures
Find a good academic paper (check the paper has been published in high ranked conferences or journals) and check what type of algorithm they employed for solving your problem or is there any extension which proposed by them for your current algorithm.
Tune Your Algorithm's Parameters
In machine learning, we have two classes of parameters. One class of parameters are the ones that their values are derived via training of the model and another class are the ones which have to be tuned before the training is started. The second class of the parameters called hyperparameters. These parameters are different from algorithm to algorithm. For example for a SVM classifier equipped with an RBF kernel a regularisation constant C and gamma for RBF kernel are two hyperparameters has to be tuned in advance. There are several optimization algorithm can employed for hyperparameter tuning such as Grid search, Bayesian optimization, Random search and Gradient-based optimization.
Combine Predictions Using Ensemble Methods
Ensembles try to combine several classifiers and the goal is to improve performance, e.g. in terms of accuracy compared to (best) single classifier. Ensemble methods can be Heterogeneous or Homogenous, the former is the combination of different classifiers and the later is the combination of same classifiers. One of the example of Homogenous ensemble method is Random Forest where the prediction of different Decision Tree algorithms are combined together by majority voting. The decision tree classifiers built on bootstrap replicates of the training set, this type of ensemble method called Bagging (Bootstrap AGGregatING). There is another type of method called Boosting. In the case of Bagging, any sample has the same probability to appear in a new data set. However, for Boosting the observations are weighted and consequently just some of them will take part in the new sets more often.