ANOVA for Feature Selection in Machine Learning
Applications of ANOVA in Feature selection
The biggest challenge in machine learning is selecting the best features to train the model. We need only the features which are highly dependent on the response variable. But what if the response variable is continuous and the predictor is categorical ???
ANOVA ( Analysis of Variance) helps us to complete our job of selecting the best features.
In this article, I will take you through
a. Impact of Variance
b. F-Distribution
c. ANOVA
c. One Way ANOVA with example
Impact of Variance
Variance is the measurement of the spread between numbers in a variable. It measures how far a number is from the mean and every number in a variable.
The variance of a feature determines how much it is impacting the response variable. If the variance is low, it implies there is no impact of this feature on response and vice-versa.
F-Distribution
A probability distribution generally used for the analysis of variance. It assumes Hypothesis as
H0: Two variances are equal
H1: Two variances are not equal
Degrees of Freedom
Degrees of freedom refers to the maximum number of logically independent values, which have the freedom to vary. In simple words, it can be defined as the total number of observations minus the number of independent constraints imposed on the observations.
Df = N -1 where N is the Sample Size
F- Value
It is the ratio of two Chi-distributions divided by its degrees of Freedom.
Let’s solve the above equation and check how it can be useful to analyze the variance.
In the real world, we always deal with samples so comparing standard deviations will be almost equal to comparing the variances.
In the above fig, we could observe that the shape of F- distribution always depends on degrees of freedom.
ANOVA
Analysis of Variance is a statistical method, used to check the means of two or more groups that are significantly different from each other. It assumes Hypothesis as
H0: Means of all groups are equal.
H1: At least one mean of the groups are different.
How comparison of means transformed to the comparison of variance?
Consider two distributions and their behavior in below fig.
From the above fig, we can say If the distributions overlap or close, the grand mean will be similar to individual means whereas if distributions are far, the grand mean and individual means differ by larger distance.
It refers to variations between the groups as the values in each group are different. So in ANOVA, we will compare Between-group variability to Within-group variability.
ANOVA uses F-tet check if there is any significant difference between the groups. If there is no significant difference between the groups that all variances are equal, the result of ANOVA’s F-ratio will be close to 1.
One Way ANOVA with example
- One Way ANOVA tests the relationship between categorical predictor vs continuous response.
- Here we will check whether there is equal variance between groups of categorical feature wrt continuous response.
- If there is equal variance between groups, it means this feature has no impact on response and it can not be considered for model training.
Let’s consider a school dataset having data about student’s performance. We have to predict the final grade of the student based on features like age, guardian, study time, failures, activities, etc.
By using One Way ANOVA let us determine is there any impact of the guardian on the final grade. Below is the data
We can see guardian ( mother, father, other ) as columns and student final grade in rows.
Steps to perform One Way ANOVA
- Define Hypothesis
- Calculate the Sum of Squares
- Determine degrees of freedom
- F-value
- Accept or Reject the Null Hypothesis
Define Hypothesis
H0: All levels or groups in guardian have equal variance
H1: At least one group is different.
Calculate the Sum of Squares
The sum of squares is the statistical technique used to determine the dispersion in data points. It is the measure of deviation and can be written as
As stated in ANOVA, we have to do F-test to check if there is any variance between the groups by comparing the variance between the groups and variance within groups. This can be done by using the sum of squares and the definitions are as follows.
Total Sum of Squares
The distance between each observed point x from the grand mean xbar is x-xbar. If you calculate this distance between each data point, square each distance and add up all the squared distances you get
Between the Sum of Squares
The distance between each group average value g from grand means xbar is g-xbar. Doing similar to the total sum of squares we get
Within the Sum of Squares
The distance between each observed value within the group x from the group-mean g is given as x-g. Doing similar to the total sum of squares we get
The total sum of squares = Between Sum of Squares + Within Sum of Squares
Determine degrees of freedom
We already discussed what is the definition of degrees of freedom now we will calculate for between groups and within groups.
- Since we have 3 groups ( mother, father, other) degrees of freedom for Between groups can be given as (3–1) = 2.
- Having 18 samples in each group, Degrees of freedom for within groups will be the sum of degrees of freedom of all groups that is (18–1) + (18–1) + (18–1) = 51.
F-value
Since we are comparing the variance between the groups and variance within the groups. The F value is given as
Calculating Sum of Squares and F value here is the summary.
Accept or reject the Null Hypothesis
With 95% confidence, alpha = 0.05 , df1 =2 ,df2 =51 given F value from the F table is 3.179 and the calculated F value is 18.49.
In the above fig, we see that the calculated F value falls in the rejection region that is beyond our confidence level. So we are rejecting the Null Hypothesis.
To Conclude, as the null hypothesis is rejected that means variance exists between the groups which state that there is an impact of the guardian on student final score. So we will include this feature for model training.
Using One way ANOVA we can check only single predictor vs response and determine the relationship but what if you have two predictors? we will use Two way ANOVA and if there are more than two features we will go for multi-factor ANOVA.
Using two-way or multi-factor ANOVA we can check the relationship on a response like
- Will the guardian is impacting the final student grade?
- Will the student activities are impacting the final student grade?
- Will the guardian and student activities together are impacting final grade?
Drawing above conclusions doing one test is always interesting right ?? I am on the way to make an article on two-way and multi-factor ANOVA and will make more interesting.
Here we dealt with having the response as continuous and predictor as categorical. If the response is categorical and the predictor is categorical, please check on my article Chi-Square test for Feature Selection in machine learning.
Hope you enjoyed it !! Stay tuned !!! Please do comment on any queries or suggestions !!!!