Commonly Used Statistical Tests in Data Science

8 min readMar 11, 2024

A Comprehensive Guide to Essential Statistical Tests and Their Applications

Commonly Used Statistical Tests in Data Science — Photo by Campaign Creators on Unsplash

In the digital era, making informed choices requires the nuanced skill of analyzing statistics and trends.

This article describes the statistical tests widely employed in data science to help make these informed judgments.

Each statistical test, such as the T-test and Chi-square test, is explained, calculated, and implemented in Python, and project recommendations are included. Let us start with the T-test.

If you want to know more about statistical tests check this one basic types of statistical tests in data science.

T-Test

A t-test is applied to ascertain whether the average difference among two groups differs significantly from each other. This emanates from the t-distribution applied in making statistical decisions.

There are mainly three types of T-tests.

Independent samples T-test: The T-test looks at averages in two groups that aren’t connected.
Paired sample T-test: Compares means from the same group at different times.
One-sample T-test: Compares the mean of a particular group to a known mean.

Calculation

Simply put, the T-test estimates the difference between two groups by dividing it by the data’s variability. Let’s see the formula.

The sample means are x1 and x2, the variance is s2, and the sample sizes are n1 and n2.

Simple Python Implementation

Let’s see a simple implementation of it.

import numpy as np
from scipy import stats

# Sample data: Group A and Group B
group_a = np.random.normal(5.0, 1.5, 30)
group_b = np.random.normal(6.0, 1.5, 30)

# Performing an Independent T-test
t_stat, p_val = stats.ttest_ind(group_a, group_b)
print(f"T-Statistic: {t_stat}, P-Value: {p_val}")

The Python code’s output for the T-test is:

T-Statistic -3.06
P-Value 0.003

This P-value (smaller than 0.05), shows a statistically significant difference in means between the two groups. The negative T-statistic shows that group A has a lower mean than group B.

Further Project Suggestion

Effectiveness of Sleep Aids: Compare the average sleep duration of subjects taking a new herbal sleep aid versus a placebo.

Educational Methods: Evaluate students’ test scores using traditional methods against those taught via e-learning platforms.

Fitness Program Results: Assess the impact of two different 8-week fitness programs on weight loss among similar demographic groups.

Productivity Software: Compare the average task completion time for two groups using different productivity apps.

Food Preference Study: Measure and compare the perceived taste rating of a new beverage with a standard competitor’s product across a sample of consumers.

Chi-Square Test

The Chi-Square Test determines whether there is a strong association between the two data types or not. There are two types of Chi-square tests.

Chi-Square Test of Independence: The aim is to find out whether two category variables are independent or not.
Chi-Square Goodness of Fit Test: In this one, the aim is to find out whether a sample distribution matches a population distribution or not.

Calculation

The formula for the Chi-Square statistic is:

‘Oi’ is the number we see and ‘Ei’ is the number we expect.

Simply, it involves calculating a value that summarizes the difference between observed and expected frequencies. The larger this value, the more likely the observed differences are not due to random chance.

Simple Python Implementation

Let’s see a simple implementation of it.

from scipy.stats import chi2_contingency
import numpy as np

# Example data: Gender vs. Movie Preference
data = np.array([[30, 10], [5, 25]])
chi2, p, dof, expected = chi2_contingency(data)
print(f"Chi2 Statistic: {chi2}, P-value: {p}")

The Python code’s output for the Chi-Square Test is

Chi-Square Statistic: 21.06
P-Value: 0.00000446

The Chi-Square statistic is 21.06 with a P-value of approximately 0.00000446. This very low P-value suggests a significant association between gender and movie preference at a 5% significance level.

Further Project Suggestion

Election Prediction: Examine the relationship between voter age groups and their preferences for particular political topics.

Marketing Campaign: Determine if there is a difference in responses to two distinct marketing campaigns across geographies.

Education Level and Technologies Use: Explore the link between educational level and adopting new technologies in a community.

illness Outbreak: Explore the relationship between illness spread and population density in the most affected areas.

Customer Satisfaction: Find out the relationship between customer satisfaction and the time of day they get service in retail.

ANOVA (Analysis of Variance)

ANOVA is used to assess averages between three or more groups.. It helps determine if at least one group’s mean is statistically different.

One-Way ANOVA: Compares means across one independent variable with three or more levels (groups).
Two-Way ANOVA: Compares means considering two independent variables.
Repeated Measures ANOVA: Used when the same subjects are used in all groups.

Calculation

The formula for ANOVA is:

In simpler terms, ANOVA calculates an F-statistic, a ratio of the variance between groups to the variance within groups. A higher F-value indicates a more significant difference between the group means.

Simple Python Implementation

Let’s see a simple implementation of it.

from scipy import stats
import numpy as np

# Sample data: Three different groups
group1 = np.random.normal(5.0, 1.5, 30)
group2 = np.random.normal(6.0, 1.5, 30)
group3 = np.random.normal(7.0, 1.5, 30)

# Performing One-Way ANOVA
f_stat, p_val = stats.f_oneway(group1, group2, group3)
print(f"F-Statistic: {f_stat}, P-Value: {p_val}")

The Python code’s output for the ANOVA test is:

F-Statistic: 15.86
P-Value: 0.00000134

The F-statistic is 15.86 with a P-value of approximately 0.00000134. This extremely low P-value indicates a significant difference between the means of at least one of the groups compared to the others at a 5% significance level.

Further Project Suggestion

Agricultural Crop Yields: Compare the average yields of various wheat types across numerous areas to determine which are the most productive.

Employees Productivity: Compare staff productivity across firm departments to evaluate whether there is a substantial variation.

Therapeutic strategies: Evaluate the effectiveness of various therapy techniques to reduce anxiety levels, for instance.

Gaming Platforms: Determine whether meaningful differences in average frame rates exist over many gaming systems running the same video game.

Dietary Effects on Health: Investigate the influence of different diets, like vegan, and vegetarian on specific health indicators in participants.

Pearson Correlation

Pearson Correlation evaluates the straight-line connection between two ongoing variables. It produces a value between -1 and 1, indicating the strength and direction of the association.

The Pearson Correlation is a specific type of correlation, mainly differing from others like Spearman’s correlation, used for non-linear relationships.

Calculation

The formula for the Pearson Correlation coefficient is:

Pearson Correlation in Statistical Tests

Simply put, it calculates how much one variable changes with another.

A value close to 1 indicates a strong positive correlation, and close to -1 indicates a strong negative correlation.

Simple Python Implementation

Let’s see a simple implementation of it.

import numpy as np
from scipy.stats import pearsonr

# Sample data
x = np.array([10, 20, 30, 40, 50])
y = np.array([15, 25, 35, 45, 55])

# Calculating Pearson Correlation
corr, _ = pearsonr(x, y)
print(f"Pearson Correlation Coefficient: {corr}")

The Python code’s output for the Pearson Correlation test is:

Pearson Correlation Coefficient: 1.0

This shows a perfect positive relationship between these variables. If you are doing an ML project with this correlation, you have to suspect overfitting.

Further Project Suggestion

Economic Indicators: Investigate the link between consumer confidence and retail sales volume.

Healthcare Analysis: In this one, you can explore the link between the number of hours spent physically active and blood pressure levels.

Educational Achievement: Would not be super nice, if you had a chance to examine the link between the amount of time you spent on your homework and your success.

Technology Use: Maybe it is time to lower your screen time, to find out if you can investigate the relationship between time spent on social media and perceived stress or happiness.

Real Estate Pricing: Research the link between social media use and estimated stress or happiness levels.

Mann-Whitney U Test

The Mann-Whitney U Test is a test that evaluates differences between two independent groups if the data you have does not fit a normal distribution.

It is a substitute for the T-test when data doesn’t adhere to the normality assumption.

Calculation

The Mann-Whitney U statistic is calculated based on the ranks of the data in the combined dataset.

Mann-Whitney U Test in Statistical Tests

Where

U is the Mann-Whitney U statistic.
R1 and R2 are the sum of ranks for the first and second groups, respectively.
n1 and n2 are the sample sizes of the two groups

Simple Python Implementation

Let’s see a simple implementation of it.

from scipy.stats import mannwhitneyu
import numpy as np

# Sample data: Two groups
group1 = np.random.normal(5.0, 1.5, 30)
group2 = np.random.normal(6.0, 1.5, 30)

# Performing Mann-Whitney U Test
u_stat, p_val = mannwhitneyu(group1, group2)
print(f"U Statistic: {u_stat}, P-Value: {p_val}")

The Python code’s output for the Mann-Whitney U Test is:

U Statistic: 305.0
P-Value: 0.032

This P-value is below the typical alpha level of 0.05, indicating that there is a statistically significant difference in the median ranks of the two groups at the 5% significance level. The Mann-Whitney U Test result suggests that the distributions of the two groups are not equal.

Further Project Suggestion

Medication Response: Compare the change in symptom severity before and after taking two distinct drugs in non-normally distributed patient data.

Job Satisfaction: It can be a good time to switch departments. To decide where to go, you can Compare the levels of job satisfaction among employees in your company’s high- and low-stress departments.

Teaching Materials: Determine the impact of two teaching materials on student involvement in a classroom where data is not typically distributed.

E-commerce Delivery timeframes: Compare the delivery timeframes of two courier services for e-commerce packages.

Exercise Impact on Mood: Investigate the impact of two distinct forms of short-term exercise on mood improvement in individuals, with a focus on nonparametric data.

Conclusion

We've looked at everything from the T-test to the Mann-Whitney U Test, including Python implementations and real-world project ideas.

Remember that being a skilled data scientist requires practice. Exploring these assessments through hands-on projects strengthens your comprehension and sharpens your analytical abilities.

To do that, visit our platform and do data projects like Student Performance Analysis. Here, you’ll have a chance to do Chi-square tests.

Originally published at https://www.stratascratch.com.

Commonly Used Statistical Tests in Data Science

T-Test

Calculation

Simple Python Implementation

Further Project Suggestion

Chi-Square Test

Calculation

Simple Python Implementation

Further Project Suggestion

ANOVA (Analysis of Variance)

Calculation

Simple Python Implementation

Further Project Suggestion

Pearson Correlation

Calculation

Simple Python Implementation

Further Project Suggestion

Mann-Whitney U Test

Calculation

Simple Python Implementation

Further Project Suggestion

Conclusion

Written by Nathan Rosidi