Feature Engineering

When confronted with a fresh dataset for the first time, it can be overwhelming. It is possible that you will be presented with hundreds or thousands of features without even being given a brief description. What do you do first? here comes feature engineering.

A good initial step is to create a ranking using a feature utility metric, which is a function that measures the associations between a feature and the target feature. Then you can focus on a smaller selection of the most important features to create first, allowing you to be more confident that your time is being well utilized[1].

The statistic we’ll be using is referred to as “mutual information.” Because it evaluates the relationship between two quantities, mutual information is similar to correlation in that respect. When compared to correlation, mutual information has the advantage of being able to detect any type of link, whereas correlation can only discover linear relationships[2].

When developing new features[3], mutual information is an excellent general-purpose statistic that is especially valuable at the beginning of the process when you may not be sure which model you want to employ yet. It is as follows:

• simple to use and understand, and

• highly efficient in terms of calculation

• well-founded in terms of theory,

• it is resistant to overfitting; and

• capable of recognizing any type of relationship

Mutual Information and What it Is Used to Determine

Relationships are described in terms of uncertainty through the concept of mutual information. An expression that expresses the extent to which knowledge of one quantity reduces uncertainty about the other is called mutual information (MI). How much more confident would you be in your ability to achieve a goal if you were aware of the importance of a feature?

Several considerations should be kept in mind when employing mutual information:

If you look at a feature as a predictor of the target on its own, MI can help you grasp its relative potential as a predictor.

• It is conceivable for a feature to be extremely instructive when it is used in conjunction with other features, but not so informative when used on its own. MI is unable to recognize interactions between features. It is a univariate metric in this case.

• The actual utility of a feature is determined by the model with which it is used. A feature is only valuable to the extent that the link between it and the target is one that your model can learn from and predict accurately. Just though a feature has a high MI score does not necessarily imply that your model will be able to do anything with the information it contains. It is possible that you may need to modify the feature first in order to expose the association.

Example in Python[4-5]

A python program to calculate feature score based on correlation

References:

[1] Alweshah, M., et. al. (2020). The monarch butterfly optimization algorithm for solving feature selection problems. Neural Computing and Applications, 1-15.

[2] Sahoo, S. et al. (2020). Classification of spammer and nonspammer content in online social network using genetic algorithm-based feature selection. Enterprise Information Systems, 14(5), 710-736.

[3] Sahoo, S. R.,et al. (2020). Popularity-based detection of malicious content in facebook using machine learning approach. In First international conference on sustainable technologies for computational intelligence (pp. 163-176). Springer, Singapore.

[4] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.

[5] UCI Machine Learning Repository. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic) Accessed [August] [2021].

431420cookie-checkFeature Engineering

Post Views: 365

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Feature Engineering

Leave a Reply Cancel reply

Smart grid and cyber defences

Revolutionizing Healthcare: The Role of Machine Learning in IoMT

Revolutionizing Software Engineering using Quantum Computing

AGILE METHODOLOGIES IN THE ERA OF MACHINE LEARNING DEVELOPMENT

The Marvels of Large Language Models: Unleashing The Power of Generative AI

The differences between Edge Computing and Federated Learning

Evaluating the Efficacy of Phishing Detection Models in Multi-Lingual Environments

Cross-Platform Phishing Detection: Applying Unified Models across Email and Web

Adaptive Phishing Detection Systems Using Online Learning Methods

Real-Time Phishing Detection: Challenges and Solutions in Streaming Data

Incorporating NLP Techniques to Enhance Contextual Understanding in Phishing Detection