The resulting data may contain missing values, which will cause problems because most statistical methods do not handle them well.
Data may be missing for various reasons. Some of these reasons are random, while others are not. Several possible causes for miss- ing values are reported in the following table.
Improper data entry or collection may lead to missing values
|1||The data entry is improperly done.|
|2||The client did not provide the information.|
|3||The acquiring service did not provide the information.|
|4||The internal data processing failed to keep the information.|
If the data are missing due to a programming error, the error is systematic and, therefore, not random. Fixing it and re-processing the data may address the issue. However, many times this is not feasible because the data is not recoverable; in these cases, the missing data must be managed.
Approaches to manage missing values
The traditional ways to manage missing values are as follows:
- drop the cases that have missing information;
- impute the missing value with the average ;
- impute the missing values with expectation-maximization ;
- carry out multiple imputation.
- Wikipedia. Imputation, 2017.
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society, series b, 39(1):1–38, 1977.
- Photo by Mateus S. Figueiredo (Own work) [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)]