NumPy: Random Number Generation and Statistical Functions

NumPy's random module is essential for generating random numbers and samples from various probability distributions. This is crucial for simulations, bootstrapping, creating synthetic datasets, and initializing neural network weights. Additionally, NumPy provides a comprehensive suite of statistical functions for data analysis.

1. Random Number Generation (`numpy.random`)

The numpy.random module provides functions for generating pseudorandom numbers. It's recommended to use the new Generator API (introduced in NumPy 1.17) for better control and reproducibility.

Basic Random Numbers

rng.rand(d0, d1, ..., dn): Creates an array of the given shape, filled with random floats from the uniform distribution [0.0, 1.0).
rng.randn(d0, d1, ..., dn): Creates an array of the given shape, filled with random floats from the standard normal (Gaussian) distribution (mean 0, variance 1).
rng.randint(low, high=None, size=None, dtype=int): Returns random integers from low (inclusive) to high (exclusive).

import numpy as np

# It's good practice to create a Generator instance for reproducibility
rng = np.random.default_rng(seed=42) # Seed for reproducibility

# Random floats in [0.0, 1.0)
rand_floats = rng.rand(3)
print("3 random floats (uniform [0,1)):\n", rand_floats)

rand_2d = rng.rand(2, 3) # 2x3 array
print("\n2x3 random floats:\n", rand_2d)

# Random floats from standard normal distribution
rand_normal = rng.randn(4)
print("\n4 random floats (standard normal):\n", rand_normal)

# Random integers between low (inclusive) and high (exclusive)
rand_int_single = rng.randint(0, 10) # Single integer between 0 and 9
print("\nSingle random integer [0,9]:", rand_int_single)

rand_int_array = rng.randint(5, 10, size=(2, 3)) # 2x3 array of integers in [5,9]
print("\n2x3 random integers [5,9]:\n", rand_int_array)

Samples from Specific Distributions

The Generator offers many methods to draw samples from various probability distributions.

import numpy as np
rng = np.random.default_rng(seed=42)

# Normal distribution (mean, standard deviation, size)
normal_samples = rng.normal(loc=10, scale=2, size=5) # Mean 10, Std Dev 2
print("5 samples from Normal(mean=10, std=2):\n", normal_samples)

# Uniform distribution (low, high, size)
uniform_samples = rng.uniform(low=-1, high=1, size=5)
print("\n5 samples from Uniform(-1, 1):\n", uniform_samples)

# Binomial distribution (n, p, size)
binomial_samples = rng.binomial(n=10, p=0.5, size=5) # 5 trials, 0.5 probability
print("\n5 samples from Binomial(n=10, p=0.5):\n", binomial_samples)

# Choice from an array
choices = rng.choice(['apple', 'banana', 'cherry'], size=3, replace=False) # Without replacement
print("\n3 unique choices:", choices)

choices_with_replacement = rng.choice(['A', 'B'], size=5, p=[0.7, 0.3]) # With replacement, specified probabilities
print("5 choices with replacement and prob:", choices_with_replacement)

# Permutations and Shuffling
arr_perm = np.arange(5)
rng.shuffle(arr_perm) # Shuffles the array IN-PLACE
print("\nShuffled array:", arr_perm)

permutation = rng.permutation(5) # Returns a NEW permutation
print("New permutation:", permutation)

2. Statistical Functions

NumPy provides many functions for basic statistical analysis.

Descriptive Statistics

np.min(), np.max(), np.argmin(), np.argmax(): Minimum, maximum values and their indices.
np.mean(): Arithmetic mean.
np.median(): Median.
np.std(): Standard deviation.
np.var(): Variance.
np.percentile(): Compute the q-th percentile of the data along the specified axis.

import numpy as np

data = np.array([[10, 12, 11],
                 [15, 8, 14],
                 [9, 13, 10]])
print("Data Array:\n", data)

# Min, Max, Mean, Std across entire array
print("\nMinimum:", np.min(data))
print("Maximum:", np.max(data))
print("Mean:", np.mean(data))
print("Standard Deviation:", np.std(data))

# Operations along an axis
print("\nMin along columns (axis=0):", np.min(data, axis=0)) # [9, 8, 10]
print("Max along rows (axis=1):", np.max(data, axis=1))     # [12, 15, 13]

# Median
print("\nMedian of all elements:", np.median(data))

# Percentiles
print("25th percentile:", np.percentile(data, 25))
print("75th percentile along rows (axis=1):", np.percentile(data, 75, axis=1))

Correlation and Covariance

np.corrcoef(x, y): Returns the Pearson product-moment correlation coefficients.
np.cov(m): Estimates a covariance matrix.

import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 4, 3, 2, 1])
z = np.array([1, 1.2, 2.8, 3.5, 4.8])

# Correlation coefficient between x and y
print("\nCorrelation between x and y:", np.corrcoef(x, y)[0, 1]) # Expected negative correlation

# Correlation coefficient between x and z
print("Correlation between x and z:", np.corrcoef(x, z)[0, 1]) # Expected positive correlation

# Covariance matrix (each row of 'a' is a variable, or each column if rowvar=False)
data_for_cov = np.vstack((x, z))
covariance_matrix = np.cov(data_for_cov)
print("\nCovariance matrix for x and z:\n", covariance_matrix)

Histograms

np.histogram(): Computes the histogram of a set of data.

import numpy as np
import matplotlib.pyplot as plt

data = rng.normal(loc=0, scale=1, size=1000) # 1000 samples from standard normal

# Compute histogram
hist, bin_edges = np.histogram(data, bins=30)
print("\nHistogram counts:", hist[:5], "...") # Show first 5 counts
print("Bin edges:", bin_edges[:5], "...") # Show first 5 edges

# Visualizing a histogram (requires Matplotlib)
plt.hist(data, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
plt.title('Histogram of Normal Distribution Samples')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
plt.show()

Further Topics:

Advanced random number generation techniques (e.g., seeding, bit generators).
Hypothesis testing (often SciPy stats is used here).
Weighted averages.
bincount() for non-negative integers.

Random number generation and statistical functions are fundamental tools in many scientific and data-driven fields. NumPy's optimized implementations make these operations fast and efficient.