Feature engineering is a process of taking unrefined raw data and converting it into meaningful features that your model can understand better, and provide a better decision boundary. To give an example, think if you are a retailer, and you want to target your customers who you think will not do business from you again. So basically what you are doing, you have your customer information you have your customer transaction information, you are taking this raw data and trying to create features to understand the recency of the customer visit. The recency can be today's date minus the last date, the customer made a purchase. That can be the recency. The frequency of customer purchase. So basically the number of times the customer purchased with you for the last seven days, the number of times the customer purchased in the last 14 days last 30 days last 60 days. So basically you're taking the data and creating multiple features to understand the recency of the customer visit and the frequency of the customer visit. So basically, by creating such features that you are feeding to the model, the model may be able to better understand whether the customer will come back and purchase from you or not. So that's what transforming your raw data into meaningful insight.
Now, coming back to feature engineering, the feature engineering can arise out of your domain understanding, you might have some understanding of your business already, and you want to incorporate that understanding as features into your model. It can come from your data analysis or exploratory data analysis phase as well. When you are taking the raw data and you're exploring the data, you may find some insight that you can convert into features. Sometimes the features can also come from an external data provider. So, typically like you are, you are kind of trying to measure if a particular customer will default, and you have the customer information and you want to kind of maybe use some external third-party provider who can give you more information about the customer, like what is the customer external delinquency rate, whether the customer has filed bankruptcy or something like that. So, the features can also be from an external data provider.
There are two steps in feature engineering. The first step is more algorithm specific, or you can also call it as data pre-processing step. So most algorithm expects the data to be perfect for it to work correctly and efficiently. In some models are more sensitive to outliers. Some models may be more, some models will work better if the data is scaled like gradient descent algorithm converges faster if the data is scaled for it. Some algorithm might require your categorical values rather like most algorithm requires your categorical values to be numerically encoded. And most specifically some algorithm, requires this category encoded value to be one-hot encoded. So, that no order in the data is maintained.
To give an example, if you take this particular scenario, the chart you see and data with a data points with an outlier in it, and if you see the regression line basically, because of the outliers the regression line is distorted and the slope is pointing towards the outliers and basically this model as an high residual error. Whereas, after the outlier treatment, if you see the other model after the outliers are removed, the data fits the line better. So, sometimes like you are a model may be more sensitive to outlier and you want to do an outlier treatment. The second part is feature engineering from domain understanding. So, this can be a feature that represents time aggregates or events. It can be your customer behavior pattern or the customer journey that led to your business. It can be a count frequency or ratio of a particular entity that you are trying to model. It can be like bucketing your data in such a way that a nonlinear relationship can be made linear so that your model can understand the data better. Now, there are plenty of scenarios there are like countless scenarios in which you can do future engineering and there is no predefined way or better way. It all depends on your creativity and curiosity when you start analyzing your data and try to implement it. The benefit of feature engineering is you can keep your model as simple as possible. Even a simple algorithm with the right set of engineered data can give a pretty high lift in performance. The second thing is you get better explainability of your model. Since you know what features you have created, you can create the more explainable model. The third is you can remove unwanted bias in the model. it can be kind of having, the data can also be underrepresented, scenarios or something like that, which you want to handle, you can handle it through feature engineering so that your data outcome is not biased.
Now, coming back to feature engineering, the feature engineering can arise out of your domain understanding, you might have some understanding of your business already, and you want to incorporate that understanding as features into your model. It can come from your data analysis or exploratory data analysis phase as well. When you are taking the raw data and you're exploring the data, you may find some insight that you can convert into features. Sometimes the features can also come from an external data provider. So, typically like you are, you are kind of trying to measure if a particular customer will default, and you have the customer information and you want to kind of maybe use some external third-party provider who can give you more information about the customer, like what is the customer external delinquency rate, whether the customer has filed bankruptcy or something like that. So, the features can also be from an external data provider.
There are two steps in feature engineering. The first step is more algorithm specific, or you can also call it as data pre-processing step. So most algorithm expects the data to be perfect for it to work correctly and efficiently. In some models are more sensitive to outliers. Some models may be more, some models will work better if the data is scaled like gradient descent algorithm converges faster if the data is scaled for it. Some algorithm might require your categorical values rather like most algorithm requires your categorical values to be numerically encoded. And most specifically some algorithm, requires this category encoded value to be one-hot encoded. So, that no order in the data is maintained.
To give an example, if you take this particular scenario, the chart you see and data with a data points with an outlier in it, and if you see the regression line basically, because of the outliers the regression line is distorted and the slope is pointing towards the outliers and basically this model as an high residual error. Whereas, after the outlier treatment, if you see the other model after the outliers are removed, the data fits the line better. So, sometimes like you are a model may be more sensitive to outlier and you want to do an outlier treatment. The second part is feature engineering from domain understanding. So, this can be a feature that represents time aggregates or events. It can be your customer behavior pattern or the customer journey that led to your business. It can be a count frequency or ratio of a particular entity that you are trying to model. It can be like bucketing your data in such a way that a nonlinear relationship can be made linear so that your model can understand the data better. Now, there are plenty of scenarios there are like countless scenarios in which you can do future engineering and there is no predefined way or better way. It all depends on your creativity and curiosity when you start analyzing your data and try to implement it. The benefit of feature engineering is you can keep your model as simple as possible. Even a simple algorithm with the right set of engineered data can give a pretty high lift in performance. The second thing is you get better explainability of your model. Since you know what features you have created, you can create the more explainable model. The third is you can remove unwanted bias in the model. it can be kind of having, the data can also be underrepresented, scenarios or something like that, which you want to handle, you can handle it through feature engineering so that your data outcome is not biased.