Thursday, January 23, 2020

What is Feature Engineering?

Feature engineering is a process of taking unrefined raw data and converting it into meaningful features that your model can understand better, and provide a better decision boundary. To give an example, think if you are a retailer, and you want to target your customers who you think will not do business from you again. So basically what you are doing, you have your customer information you have your customer transaction information, you are taking this raw data and trying to create features to understand the recency of the customer visit. The recency can be today's date minus the last date, the customer made a purchase. That can be the recency. The frequency of customer purchase. So basically the number of times the customer purchased with you for the last seven days, the number of times the customer purchased in the last 14 days last 30 days last 60 days. So basically you're taking the data and creating multiple features to understand the recency of the customer visit and the frequency of the customer visit. So basically, by creating such features that you are feeding to the model, the model may be able to better understand whether the customer will come back and purchase from you or not. So that's what transforming your raw data into meaningful insight.
Now, coming back to feature engineering, the feature engineering can arise out of your domain understanding, you might have some understanding of your business already, and you want to incorporate that understanding as features into your model. It can come from your data analysis or exploratory data analysis phase as well. When you are taking the raw data and you're exploring the data, you may find some insight that you can convert into features. Sometimes the features can also come from an external data provider. So, typically like you are, you are kind of trying to measure if a particular customer will default, and you have the customer information and you want to kind of maybe use some external third-party provider who can give you more information about the customer, like what is the customer external delinquency rate, whether the customer has filed bankruptcy or something like that. So, the features can also be from an external data provider.

There are two steps in feature engineering. The first step is more algorithm specific, or you can also call it as data pre-processing step. So most algorithm expects the data to be perfect for it to work correctly and efficiently. In some models are more sensitive to outliers. Some models may be more, some models will work better if the data is scaled like gradient descent algorithm converges faster if the data is scaled for it. Some algorithm might require your categorical values rather like most algorithm requires your categorical values to be numerically encoded. And most specifically some algorithm, requires this category encoded value to be one-hot encoded. So, that no order in the data is maintained.

To give an example, if you take this particular scenario, the chart you see and data with a data points with an outlier in it, and if you see the regression line basically, because of the outliers the regression line is distorted and the slope is pointing towards the outliers and basically this model as an high residual error. Whereas, after the outlier treatment, if you see the other model after the outliers are removed, the data fits the line better. So, sometimes like you are a model may be more sensitive to outlier and you want to do an outlier treatment. The second part is feature engineering from domain understanding. So, this can be a feature that represents time aggregates or events. It can be your customer behavior pattern or the customer journey that led to your business. It can be a count frequency or ratio of a particular entity that you are trying to model. It can be like bucketing your data in such a way that a nonlinear relationship can be made linear so that your model can understand the data better. Now, there are plenty of scenarios there are like countless scenarios in which you can do future engineering and there is no predefined way or better way. It all depends on your creativity and curiosity when you start analyzing your data and try to implement it. The benefit of feature engineering is you can keep your model as simple as possible. Even a simple algorithm with the right set of engineered data can give a pretty high lift in performance. The second thing is you get better explainability of your model. Since you know what features you have created, you can create the more explainable model. The third is you can remove unwanted bias in the model. it can be kind of having, the data can also be underrepresented, scenarios or something like that, which you want to handle, you can handle it through feature engineering so that your data outcome is not biased.

Thursday, October 10, 2019

Permutations and Combinations

Permutations
When the order does matter it is called Permutation. In other words, a Permutation is an ordered Combination.
We should really call this a "Permutation Lock"!

There are basically two types of permutation:
  • Repetition is Allowed: such as the lock above. It could be "00000".
  • No Repetition: for example the first three people in a running race. You can't be first and second.

1. Permutations with Repetition
These are the easiest to calculate. When a thing has n different types... we have n choices each time!

For example: choosing 3 of those things, the permutations are:
n × n × n
(n multiplied 3 times)

More generally: choosing r of something that has n different types, the permutations are:
n × n × ... (r times)

So, the formula is:
nr (where n is the number of things to choose from, and we choose r of them. Repetition is allowed,
and order matters.)

2. Permutations without Repetition
In this case, we have to reduce the number of available choices each time.

Example: what order could 16 pool balls be in?
After choosing, say, number "14" we can't choose it again.

So, our first choice has 16 possibilities, and our next choice has 15 possibilities, then 14, 13, 12, 11, ... etc. And the total permutations are:
16 × 15 × 14 × 13 × ... = 20,922,789,888,000

 But maybe we don't want to choose them all, just 3 of them, and that is then:
16 × 15 × 14 = 3,360

In other words, there are 3,360 different ways that 3 pool balls could be arranged out of 16 balls.

Without repetition, our choices get reduced each time.

So, the formula is:
Here n is the number of things to choose from, and we choose r of them, no repetitions, order matters.


Combinations
When the order doesn't matter, it is a Combination.

There are also two types of combinations (remember the order does not matter now):

  • Repetition is Allowed: such as coins in your pocket (5,5,5,10,10)
  • No Repetition: such as lottery numbers (2,14,15,27,30,33)

1. Combinations with Repetition
Let us say there are five flavours of ice cream: banana, chocolate, lemon, strawberry and vanilla.

We can have three scoops. How many variations will there be?

Let's use letters for the flavours: {b, c, l, s, v}. Example selections include
{c, c, c} (3 scoops of chocolate)
{b, l, v} (one each of banana, lemon and vanilla)
{b, v, v} (one of banana, two of vanilla)

So, what about our example, what is the answer?
There are 35 ways of having 3 scoops from five flavours of ice cream.

2. Combinations without Repetition
 let's say we just want to know which 3 pool balls are chosen, not the order. We already know that 3 out of 16 gave us 3,360 permutations. But many of those are the same to us now, because we don't care what order!
For example, let us say balls 1, 2 and 3 are chosen. These are the possibilities:
So, the permutations have 6 times as many possibilities.
So we adjust our permutations formula to reduce it by how many ways the objects could be in order. That formula is so important it is often just written in big parentheses like this:
where n is the number of things to choose from, and we choose r of them, no repetition, order doesn't matter.

Mean vs Variance vs Standard Deviation

Mean: The mean is the average of the numbers.
Variance: Variance is nothing but the average of the squares of the deviations.
Standard Deviation: Standard Deviation is the square root of the numerical value obtained while calculating variance.

Definition of Variance
In statistics, variance is defined as the measure of variability that represents how far members of a group are spread out. It finds out the average degree to which each observation varies from the mean. When the variance of a data set is small, it shows the closeness of the data points to the mean whereas a greater value of variance represents that the observations are very dispersed around the arithmetic mean and from each other.

Definition of Standard Deviation
Standard deviation is a measure that quantifies the amount of dispersion of the observations in a dataset. The low standard deviation is an indicator of the closeness of the scores to the arithmetic mean and a high standard deviation represents; the scores are dispersed over a higher range of values.

Key Differences Between Variance and Standard Deviation
The difference between standard deviation and variance can be drawn clearly on the following grounds:
  1. Variance is a numerical value that describes the variability of observations from its arithmetic mean. Standard deviation is a measure of dispersion of observations within a data set.
  2. Variance is nothing but an average of squared deviations. On the other hand, the standard deviation is the root mean square deviation.
  3. Variance is denoted by sigma-squared (σ2) whereas standard deviation is labelled as sigma (σ).
  4. Variance is expressed in square units which are usually larger than the values in the given dataset. As opposed to standard deviation which is expressed in the same units as the values in the set of data.
  5. Variance measures how far individuals in a group are spread out. Conversely, Standard Deviation measures how much observations of a data set differs from its mean.

Illustration


Mean

Variance

Standard Deviation



Monday, September 30, 2019

Simple Linear regression


Let’s take a look at this dataset. It’s related to the Co2 emission of different cars. It includes Engine size, Cylinders, Fuel Consumption and Co2 emissions for various car models. The question is: Given this dataset, can we predict the Co2 emission of a car, using another field, such as Engine size? Quite simply, yes! We can use linear regression to predict a continuous value such as Co2 Emission, by using other variables. Linear regression is the approximation of a linear model used to describe the relationship between two or more variables. In simple linear regression, there are two variables: a dependent variable and an independent variable. The key point in the linear regression is that our dependent value should be continuous and cannot be a discreet value. However, the independent variable(s) can be measured on either a categorical or continuous measurement scale.
There are two types of linear regression models. They are simple regression and multiple regression. Simple linear regression is when one independent variable is used to estimate a dependent variable. For example, predicting Co2 emission using the EngineSize variable. When more than one independent variable is present, the process is called multiple linear regression.


To understand linear regression, we can plot our variables here. We show Engine size as an independent variable and Emission as the target value that we would like to predict. A scatterplot clearly shows the relationship between variables where changes in one variable "explain" or possibly "cause" changes in the other variable. Also, it indicates that these variables are linearly related.
With linear regression, you can fit a line through the data. For instance, as the EngineSize increases, so do the emissions. With linear regression, you can model the relationship between these variables.
A good model can be used to predict what the approximate emission of each car is.

We’re going to predict the target value, y. In our case, using the independent variable, "Engine Size," represented by x1. The fit line is shown traditionally as a polynomial. In a simple regression problem (a single x), the form of the model would be θ0 +θ1 x1. In this equation, y ̂ is the dependent variable
or the predicted value, and x1 is the independent variable; θ0 and θ1 are the parameters of the line that we must adjust. θ1 is known as the "slope" or "gradient" of the fitting line and θ0 is known as the "intercept." θ0 and θ1 are also called the coefficients of the linear equation. You can interpret this equation as y ̂ being a function of x1, or y ̂ being dependent of x1. Now the questions are: "How would you draw a line through the points?" And, "How do you determine which line ‘fits best’?"
Linear regression estimates the coefficients of the line. This means we must calculate θ0 and θ1 to find the best line to ‘fit’ the data. This line would best estimate the emission of the unknown data points. Let’s see how we can find this line, or to be more precise, how we can adjust the
parameters to make the line the best fit for the data.
For a moment, let’s assume we’ve already found the best fit line for our data. Now, let’s go through all the points and check how well they align with this line. Best fit, here, means that if we have, for instance, a car with engine size x1=5.4, and actual Co2=250, its Co2 should be predicted very close to the actual value, which is y=250, based on historical data. But, if we use the fit line, or better to say, using our polynomial with known parameters to predict the Co2 emission, it will return y ̂ =340.
Now, if you compare the actual value of the emission of the car with what we predicted using our model, you will find out that we have a 90-unit error. This means our prediction line is not accurate. This error is also called the residual error. So, we can say the error is the distance from the data point to the fitted regression line. The mean of all residual errors shows how poorly the line fits with the whole dataset. Mathematically, it can be shown by the equation, mean squared error, shown as (MSE). Our objective is to find a line where the mean of all these errors is minimized. In other words, the mean error of the prediction using the fit line should be minimized. Let’s re-word it more technically. The objective of linear regression is to minimize this MSE equation and to minimize it, we should find the best parameters, θ0 and θ1. Now, the question is, how to find θ0 and θ1 in such a way that it minimizes this error? How can we find such a perfect line? Or, said another way, how should we find the best parameters for our line? Should we move the line a lot randomly and calculate the MSE value every time, and choose the minimum one?
Not really! Actually, we have two options here: Option 1 - We can use a mathematic approach. Or, Option 2 - We can use an optimization approach.

Let’s see how we can easily use a mathematic formula to find the θ0 and θ1. As mentioned before, θ0 and θ1, in the simple linear regression, are the coefficients of the fit line. We can use a simple equation to estimate these coefficients. That is, given that it’s a simple linear regression, with only 2 parameters, and knowing that θ0 and θ1 are the intercept and slope of the line, we can estimate them directly from our data. It requires that we calculate the mean of the independent and dependent or target columns, from the dataset. Notice that all of the data must be available to traverse and calculate the parameters. It can be shown that the intercept and slope can be calculated using these equations. We can start off by estimating the value for θ1. This is how you can find the slope of a line based on the data. x ̅  is the average value for the engine size in our dataset. Please consider that we have 9 rows here, row 0 to 8. First, we calculate the average of x1 and average of y. Then we plug it into the slope equation, to find θ1. The xi and yi in the equation refer to the fact that we need to repeat these calculations across all values in our dataset and i refers to the i’th value of x or y. Applying all values, we find θ1=39; it is our second parameter. It is used to calculate the first parameter, which is the intercept of the line. Now, we can plug θ1 into the line equation to find θ0. It is easily calculated that θ0=125.74. So, these are the two parameters for the line, where θ0 is also called the bias coefficient and θ1 is the coefficient for the Co2 Emission column. As a side note, you really don’t need to remember the formula for calculating these parameters, as most of the libraries used for machine learning in Python, R, and Scala can easily find these parameters for you. But it’s always good to understand how it works.
Now, we can write down the polynomial of the line. So, we know how to find the best fit for our data, and its equation. Now the question is: "How can we use it to predict the emission of a new car based on its engine size?"
After we found the parameters of the linear equation, making predictions is as simple as solving the equation for a specific set of inputs. Imagine we are predicting Co2 Emission(y) from EngineSize(x) for the Automobile in record number 9. Our linear regression model representation for this problem would be: y ̂ = θ0 + θ1 x1. Or if we map it to our dataset, it would be Co2Emission = θ0 + θ1  EngineSize. As we saw, we can find θ0, θ1 using the equations that we just talked about.
Once found, we can plug in the equation of the linear model.
For example, let’s use θ0=125 and θ1=39. So, we can rewrite the linear model as 𝐶𝑜2𝐸𝑚𝑖𝑠𝑠𝑖𝑜𝑛=125+39𝐸𝑛𝑔𝑖𝑛𝑒𝑆𝑖𝑧𝑒. Now, let’s plug in the 9th row of our dataset and calculate the Co2 Emission for a car with an EngineSize of 2.4. So Co2Emission = 125 + 39 × 2.4.
Therefore, we can predict that the Co2 Emission for this specific car would be 218.6.


Wednesday, September 25, 2019

Regression Models

The regression model is a powerful method that allows you to examine the relationship between two or more variables of interest. In other word, Regression models (both linear and non-linear) are used for predicting a real value, like salary for example. If your independent variable is time, then you are forecasting future values, otherwise, your model is predicting present but unknown values.

Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent (target) and independent variable (s) (predictor). This technique is used for forecasting, time series modelling and finding the causal effect relationship between the variables

Regression technique varies from Linear Regression to SVR and Random Forests Regression.

Look at this dataset. It's related to Co2 emissions from different cars. It includes Engine size, number of Cylinders, Fuel Consumption and Co2 emission from various automobile models. The question is, "Given this dataset, can we predict the Co2 emission of a car using other fields, such as EngineSize or Cylinders?" Let’s assume we have some historical data from different cars and assume that a car, such as in row 9, has not been manufactured yet, but we're interested in estimating its approximate Co2 emission, after production. Is it possible?
We can use regression methods to predict a continuous value, such as CO2 Emission, using some other variables. Indeed, Regression is the process of predicting a continuous value.
In regression, there are two types of variables: a dependent variable and one or more independent variables.
The dependent variable can be seen as the "state", "target" or "final goal" we study
and try to predict, and the independent variables, also known as explanatory variables, can be seen as the "causes" of those "states". The independent variables are shown conventionally by x, and the dependent variable is notated by y.

A regression model relates y, or the dependent variable, to a function of x, i.e., the independent variables. The key point in the regression is that our dependent value should be continuous, and cannot be a discreet value. However, the independent variable or variables can be measured on either a categorical or continuous measurement scale. So, what we want to do here is to use the historical data of some cars, using one or more of their features, and from that data, make a model.
We use regression to build such a regression/estimation model. Then the model is used to predict the expected Co2 emission for a new or unknown car. Basically, there are 2 types of regression models: simple regression and multiple regression.
Simple regression is when one independent variable is used to estimate a dependent variable. It can be either linear on non-linear. For example, predicting Co2emission using the variable of EngineSize. The linearity of regression is based on the nature of the relationship between independent and dependent variables.
When more than one independent variable is present, the process is called multiple linear regression. For example, predicting Co2emission using EngineSize and the number of Cylinders in any given car. Again, depending on the relation between dependent and independent variables, it can be either linear or non-linear regression.

Following Machine Learning Regression models:
  • Simple Linear Regression
  • Multiple Linear Regression
  • Polynomial Regression
  • Support Vector for Regression (SVR)
  • Decision Tree Regression
  • Random Forest Regression

Wednesday, September 18, 2019

Understanding Deep Learning with TensorFlow playground


The TensorFlow playground can be used to illustrate that deep learning uses multiple layers of abstraction.

First, notice blue represents +1, orange represents -1, and white represents 0.

Let’s start with the default classification example. There are 4 datasets.
The four datasets: circular, 4 quadrants, 2 clusters, and a swirl

The datasets all have 2 input features and 1 output label. The 2 input features, X1 and X2, are represented by the coordinates. X1 is the horizontal axis and X2 is the vertical axis. You can infer that from the feature inputs below.
Graph of input features: X1 and X2



The output label is the color of the dots, blue (+1) or orange (-1).
Features: X1, and X2 the horizontal and vertical axes. The label: blue(+1) or orange(-1) dots.

The first 3 datasets can be solved with the default setting with 2 hidden layers. However the 4th, the swirl dataset, can not. When you click the play button it is actually training a neural network that runs in your browser. The background color of the output changes from light shades (representing 0) to blue and orange patterns that illustrate what the network will predict for new input.

What is TensorFlow? Introduction, Architecture

What is TensorFlow?

Currently, the most famous deep learning library in the world is Google's TensorFlow. Google product uses machine learning in all of its products to improve the search engine, translation, image captioning or recommendations.

To give a concrete example, Google users can experience a faster and more refined the search with AI. If the user types a keyword in the search bar, Google provides a recommendation about what could be the next word.

Google wants to use machine learning to take advantage of their massive datasets to give users the best experience. Three different groups use machine learning:

  • Researchers
  • Data scientists
  • Programmers

They can all use the same toolset to collaborate with each other and improve their efficiency.

Google does not just have any data; they have the world's most massive computer, so Tensor Flow was built to scale. TensorFlow is a library developed by the Google Brain Team to accelerate machine learning and deep neural network research.

It was built to run on multiple CPUs or GPUs and even mobile operating systems, and it has several wrappers in several languages like Python, C++ or Java.



History of TensorFlow

A couple of years ago, deep learning started to outperform all other machine learning algorithms when giving a massive amount of data. Google saw it could use these deep neural networks to improve its services:
  • Gmail
  • Photo
  • Google search engine

They build a framework called Tensorflow to let researchers and developers work together on an AI model. Once developed and scaled, it allows lots of people to use it.

It was first made public in late 2015, while the first stable version appeared in 2017. It is open source under Apache Open Source license. You can use it, modify it and redistribute the modified version for a fee without paying anything to Google.