Pandas: Group By Operations

The groupby() method is one of the most powerful and frequently used functions in Pandas. It allows you to split your data into groups based on some criteria, apply a function to each group independently, and then combine the results into a single DataFrame. This "split-apply-combine" strategy is central to many data analysis tasks.

The Split-Apply-Combine Strategy

Split: Data is divided into groups based on one or more keys (column values).
Apply: A function (e.g., aggregation, transformation, filtration) is applied to each group independently.
Combine: The results of the applied functions are combined back into a single data structure.

Let's create a sample DataFrame to illustrate groupby() operations:

import pandas as pd
import numpy as np

data = {
    'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
    'Value1': [10, 12, 15, 11, 13, 16, 14, 18],
    'Value2': [100, 110, 105, 120, 115, 125, 130, 135],
    'City': ['NY', 'LA', 'NY', 'SF', 'LA', 'SF', 'NY', 'LA']
}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

1. Grouping and Aggregation

Aggregation involves computing a summary statistic (like sum, mean, count, min, max, std) for each group.

import pandas as pd
import numpy as np

data = {
    'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
    'Value1': [10, 12, 15, 11, 13, 16, 14, 18],
    'Value2': [100, 110, 105, 120, 115, 125, 130, 135],
    'City': ['NY', 'LA', 'NY', 'SF', 'LA', 'SF', 'NY', 'LA']
}
df = pd.DataFrame(data)

# Group by 'Category' and calculate the mean of numerical columns
print("\nMean of numerical columns grouped by 'Category':\n", df.groupby('Category').mean())

# Group by 'Category' and calculate the sum of 'Value1'
print("\nSum of 'Value1' grouped by 'Category':\n", df.groupby('Category')['Value1'].sum())

# Group by multiple columns ('Category' and 'City')
print("\nMean grouped by 'Category' and 'City':\n", df.groupby(['Category', 'City']).mean())

# Apply multiple aggregation functions to a single column
print("\nAggregating 'Value1' with multiple functions:\n",
      df.groupby('Category')['Value1'].agg(['sum', 'mean', 'max', 'min']))

# Apply different aggregation functions to different columns
print("\nAggregating different columns with different functions:\n",
      df.groupby('Category').agg({'Value1': 'sum', 'Value2': 'mean'}))

# Custom aggregation function using apply or lambda
print("\nCustom aggregation (range) on 'Value1' grouped by 'Category':\n",
      df.groupby('Category')['Value1'].apply(lambda x: x.max() - x.min()))

2. Transformation

Transformation returns an object that is of the same size as the group. It performs a group-specific computation and broadcasts the results.

import pandas as pd
import numpy as np

data = {
    'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
    'Value': [10, 12, 15, 11, 13, 16, 14, 18]
}
df_transform = pd.DataFrame(data)
print("Original DataFrame for transformation:\n", df_transform)

# Subtract the mean of each group from its values
df_transform['Value_demeaned'] = df_transform.groupby('Category')['Value'].transform(lambda x: x - x.mean())
print("\nDataFrame with 'Value_demeaned' (group-wise mean subtracted):\n", df_transform)

# Fill missing values with the group-wise mean
df_missing = pd.DataFrame({
    'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
    'Value': [10, 12, np.nan, 11, 13, 16, np.nan, 18]
})
df_missing['Value_filled'] = df_missing.groupby('Category')['Value'].transform(lambda x: x.fillna(x.mean()))
print("\nDataFrame with missing 'Value' filled by group mean:\n", df_missing)

3. Filtration

Filtration discards entire groups based on a group-wise condition.

import pandas as pd
import numpy as np

data = {
    'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
    'Value': [10, 12, 15, 11, 13, 16, 14, 18]
}
df_filter = pd.DataFrame(data)
print("Original DataFrame for filtration:\n", df_filter)

# Filter out groups where the mean of 'Value' is less than 13
df_filtered = df_filter.groupby('Category').filter(lambda x: x['Value'].mean() >= 13)
print("\nDataFrame after filtering groups with mean Value < 13:\n", df_filtered)

# Filter out groups that have less than 3 entries
df_less_entries = df_filter.groupby('Category').filter(lambda x: len(x) >= 3)
print("\nDataFrame after filtering groups with less than 3 entries:\n", df_less_entries)

4. Iterating Through Groups

You can iterate through the groups, which can be useful for more complex, custom operations.

import pandas as pd
import numpy as np

data = {
    'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
    'Value': [10, 12, 15, 11, 13, 16, 14, 18]
}
df = pd.DataFrame(data)

print("\nIterating through groups:")
for name, group in df.groupby('Category'):
    print(f"\nGroup Name: {name}")
    print(group)

Further Topics:

get_group() to select a single group
size() to get the number of items in each group
ngroup() to get the numerical indicator of the group the row belongs to.
Advanced custom functions with apply()

The groupby() method is indispensable for summarizing data, performing group-specific calculations, and cleaning data based on group properties. Mastering it is key to advanced data analysis with Pandas.