There are usually several data preprocessing steps required before applying any machine learning algorithms to data. These are required by the nature of available data and algorithms. Below are listed few common instances where data preprocessing is required. Recall in this context, attributes are variables (columns in the data spreadsheet) and each row in this column is a data value. A target or label attribute is the dependent variable which is being predicted.
- Data Scaling: Some algorithms, such as k-nearest neighbors are sensitive to the scale of data. If you have one attribute whose range spans millions (of dollars, for example) while another attribute is in a few tens, then the larger scale attribute will influence the outcome. In order to eliminate or minimize such bias, we must Normalize the data. Normalization implies transforming all the attributes to a common range. This is easily achieved by dividing each attribute by its largest value, for instance. This is called Range Normalization. Different software’s have different ways to achieve this task. For example, in Dell Statistica 12 you can do this by ‘Standardisation‘ in the data tab, whereas in RapidMiner, it can be done by Normalization.
- Data Transformation: Some algorithms do not work on the value types of certain attributes. For example, if the target or label variable is categorical, then you cannot use regression or generalized linear models to predict it. In this case you will need to transform the label attribute into a continuous variable. Similarly in the converse case, if the label attribute is polynomial (multiple categories), you cannot use a classification algorithm such as logistic regression, which only works with binomial (yes/no, true/false) type variables. In any such situations, data type transformations are required. For Statistica see this example and for in RapidMiner its called Normalization
- Data set is too large or imbalanced. The first of these scenarios is quite common: the entire space of big data analytics is dedicated to address these kinds of issues. However, big data infrastructure may not always be available and you may have to analyze data in-memory. If you know that the data quality is high, you may be better off simply sampling the data to reduce the computational expense. Sampling implies we only use a portion of the data set. The question then becomes which portion to use. The simplest solution is to randomly select rows from the dataset. However, this may create bias in the testing and training samples if the random sample selected has either too few or too many examples from one of the outcome classes. To avoid this you may want to stratify the sampling. This ensures that the sample chosen has exactly the same proportion of classes as in the full dataset. A related problem is that of imbalanced datasets which is requires a complete discussion altogether. But for the sake of brevity, one of the solution for RapidMiner is given here and here
- Data has missing values: because of inability to measure on some attributes or errors. Some algorithms, such as neural networks or support vector machines, cannot handle missing data. In such situations, we have two options: replacing missing values by a mean, median, minimum or maximum or zero. However a better option would be to impute missing value. The way this works is to treat the attribute which has missing values as an intermediate target variable and predict the missing values using examples for which values are not missing. Clearly, for either of these approaches to work, the proportion of values missing must be “small”. As a thumb rule, attributes with more than a third of values missing must be reconsidered for inclusion in the modeling process.
- Despite the big data analytics capabilities that are gaining ground, from an efficiency point of view, it still makes sense to measure and weight the influence of different variables on the label attribute before building any models. The objective is to identify and rank the influence of each attribute on the label. It makes sense to remove attributes which have either a low influence on the target or high correlation to other independent variables. Very little prediction accuracy is lost by removing such attributes. In Statistica, you can use correlation matrices in the Basic Statistics tab whereas in RapidMiner, you can refer to this blog post
In most real world data analytics situations we will encounter one or more of these scenarios. Poorly prepared or unprepared data will render the predictive models unreliable. While the time spent on any or all of these, may seem high, it is totally justified: experienced data miners understand and live the rubric that 80% of data mining is really data preparation.