When confronted with a fresh dataset for the first time, it can be overwhelming. It is possible that you will be presented with hundreds or thousands of features without even being given a brief description. What do you do first? here comes feature engineering.
A good initial step is to create a ranking using a feature utility metric, which is a function that measures the associations between a feature and the target feature. Then you can focus on a smaller selection of the most important features to create first, allowing you to be more confident that your time is being well utilized[1].
The statistic we’ll be using is referred to as “mutual information.” Because it evaluates the relationship between two quantities, mutual information is similar to correlation in that respect. When compared to correlation, mutual information has the advantage of being able to detect any type of link, whereas correlation can only discover linear relationships[2].
When developing new features[3], mutual information is an excellent general-purpose statistic that is especially valuable at the beginning of the process when you may not be sure which model you want to employ yet. It is as follows:
• simple to use and understand, and
• highly efficient in terms of calculation
• well-founded in terms of theory,
• it is resistant to overfitting; and
• capable of recognizing any type of relationship
Mutual Information and What it Is Used to Determine
Relationships are described in terms of uncertainty through the concept of mutual information. An expression that expresses the extent to which knowledge of one quantity reduces uncertainty about the other is called mutual information (MI). How much more confident would you be in your ability to achieve a goal if you were aware of the importance of a feature?
Several considerations should be kept in mind when employing mutual information:
If you look at a feature as a predictor of the target on its own, MI can help you grasp its relative potential as a predictor.
• It is conceivable for a feature to be extremely instructive when it is used in conjunction with other features, but not so informative when used on its own. MI is unable to recognize interactions between features. It is a univariate metric in this case.
• The actual utility of a feature is determined by the model with which it is used. A feature is only valuable to the extent that the link between it and the target is one that your model can learn from and predict accurately. Just though a feature has a high MI score does not necessarily imply that your model will be able to do anything with the information it contains. It is possible that you may need to modify the feature first in order to expose the association.
Example in Python[4-5]
References:
[2] Sahoo, S. et al. (2020). Classification of spammer and nonspammer content in online social network using genetic algorithm-based feature selection. Enterprise Information Systems, 14(5), 710-736.
[3] Sahoo, S. R.,et al. (2020). Popularity-based detection of malicious content in facebook using machine learning approach. In First international conference on sustainable technologies for computational intelligence (pp. 163-176). Springer, Singapore.
[4] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
[5] UCI Machine Learning Repository. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic) Accessed [August] [2021].