SciPy: Statistics and Probability

The scipy.stats module is a powerful and versatile library for statistical computing in Python. It provides a vast collection of probability distributions (both continuous and discrete), a large number of statistical tests, descriptive statistics, and more. This module is indispensable for hypothesis testing, data analysis, and modeling.

1. Probability Distributions

scipy.stats offers objects for many common probability distributions. Each distribution object provides methods for: * pdf() (Probability Density Function) / pmf() (Probability Mass Function): Probability at a given point. * cdf() (Cumulative Distribution Function): Probability of a random variable being less than or equal to a given value. * ppf() (Percent Point Function / Quantile Function): The inverse of cdf(). Returns the value x such that the probability of the random variable being less than or equal to x is p. * sf() (Survival Function): 1 - cdf(x). Probability of x being greater than a value. * isf() (Inverse Survival Function): The inverse of sf(). * rvs() (Random Variates): Generates random samples from the distribution.

a. Normal (Gaussian) Distribution (`norm`)

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Create a standard normal distribution (mean=0, std=1)
# loc: mean, scale: standard deviation
std_normal = norm(loc=0, scale=1)

# Generate some x values
x = np.linspace(-3, 3, 100)

# Plot PDF
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(x, std_normal.pdf(x), label='PDF')
plt.title('Standard Normal Distribution PDF')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.grid(True)

# Plot CDF
plt.subplot(1, 2, 2)
plt.plot(x, std_normal.cdf(x), label='CDF')
plt.title('Standard Normal Distribution CDF')
plt.xlabel('x')
plt.ylabel('Cumulative Probability')
plt.grid(True)
plt.tight_layout()
plt.show()

# Get specific values
print("PDF at x=0:", std_normal.pdf(0))
print("CDF at x=1 (P(X <= 1)):", std_normal.cdf(1))
print("PPF at 0.975 (value below which 97.5% of data falls):", std_normal.ppf(0.975))

# Generate random samples
random_samples = std_normal.rvs(size=10)
print("\n10 random samples from standard normal:\n", random_samples)

b. Other Distributions (e.g., Uniform, Poisson, t-distribution)

import numpy as np
from scipy.stats import uniform, poisson, t

# Uniform Distribution (loc: lower bound, scale: width)
uni_dist = uniform(loc=0, scale=10) # Uniform between 0 and 10
print("\nUniform(0, 10) PDF at 5:", uni_dist.pdf(5))
print("Uniform(0, 10) 3 random samples:", uni_dist.rvs(size=3))

# Poisson Distribution (mu: expected number of events)
poisson_dist = poisson(mu=3) # Poisson with lambda=3
print("\nPoisson(lambda=3) PMF at 2 (P(X=2)):", poisson_dist.pmf(2))
print("Poisson(lambda=3) 5 random samples:", poisson_dist.rvs(size=5))

# Student's t-distribution (df: degrees of freedom)
t_dist = t(df=10)
print("\nt-distribution (df=10) PDF at 0:", t_dist.pdf(0))
print("t-distribution (df=10) 3 random samples:", t_dist.rvs(size=3))

2. Descriptive Statistics

scipy.stats can calculate various descriptive statistics.

import numpy as np
from scipy import stats

data = np.array([12, 15, 13, 18, 16, 14, 20, 11])

# Basic descriptive statistics
print("Data:", data)
print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Variance:", np.var(data)) # Population variance
print("Standard Deviation:", np.std(data)) # Population std dev

# Using stats.describe()
description = stats.describe(data)
print("\nStats Describe:\n", description)
print(f"Number of observations: {description.nobs}")
print(f"Min/Max: {description.minmax}")
print(f"Mean: {description.mean:.2f}")
print(f"Variance: {description.variance:.2f}")
print(f"Skewness: {description.skewness:.2f}")
print(f"Kurtosis: {description.kurtosis:.2f}")

# Mode (most frequent value)
mode_result = stats.mode(data, keepdims=False) # keepdims=False is for newer versions
print(f"\nMode: {mode_result.mode}, Count: {mode_result.count}")

3. Statistical Tests (Hypothesis Testing)

SciPy provides numerous functions for hypothesis testing, allowing you to test assumptions about your data.

a. T-test (`ttest_ind`, `ttest_rel`, `ttest_1samp`)

Used to determine if there is a significant difference between the means of two groups, or between a sample mean and a known value.

import numpy as np
from scipy import stats

# Two independent samples
group1 = np.array([20, 22, 23, 25, 21, 24])
group2 = np.array([25, 26, 28, 24, 27, 29])

# Independent t-test (assumes equal variance by default, can set equal_var=False)
t_statistic, p_value = stats.ttest_ind(group1, group2, equal_var=True)
print("Independent T-test:")
print(f"T-statistic: {t_statistic:.3f}")
print(f"P-value: {p_value:.3f}")

if p_value < 0.05:
    print("There is a significant difference between the means of the two groups.")
else:
    print("There is no significant difference between the means of the two groups.")

# Paired sample t-test (ttest_rel) for before/after measurements
# One sample t-test (ttest_1samp) for comparing sample mean to a constant

b. ANOVA (Analysis of Variance) (`f_oneway`)

Used to compare the means of two or more samples to see if at least one group mean is significantly different from the others.

import numpy as np
from scipy import stats

group1 = np.array([10, 12, 11, 13, 10])
group2 = np.array([15, 14, 16, 17, 15])
group3 = np.array([20, 22, 21, 19, 20])

# One-way ANOVA
f_statistic, p_value = stats.f_oneway(group1, group2, group3)
print("\nOne-way ANOVA:")
print(f"F-statistic: {f_statistic:.3f}")
print(f"P-value: {p_value:.3f}")

if p_value < 0.05:
    print("There is a significant difference among the means of the groups.")
else:
    print("There is no significant difference among the means of the groups.")

c. Chi-squared Test (`chi2_contingency`)

Used to test for association between categorical variables.

import numpy as np
from scipy import stats

# Contingency table (observed frequencies)
# Rows: Gender (Male, Female), Columns: Preference (A, B)
contingency_table = np.array([[30, 20],
                              [15, 35]])

chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print("\nChi-squared Test:")
print(f"Chi-squared statistic: {chi2:.3f}")
print(f"P-value: {p_value:.3f}")

if p_value < 0.05:
    print("There is a significant association between gender and preference.")
else:
    print("There is no significant association between gender and preference.")

Further Topics:

Non-parametric tests (e.g., Wilcoxon, Mann-Whitney U, Kruskal-Wallis).
Multivariate statistics.
Kernel density estimation (gaussian_kde).
Linear and non-linear regression models (beyond simple curve fitting).

The scipy.stats module is a robust toolkit for any data scientist or researcher working with statistical analysis, providing both theoretical distributions and practical hypothesis testing methods.