Pandas: Group By Operations
The groupby() method is one of the most powerful and frequently used functions in Pandas. It allows you to split your data into groups based on some criteria, apply a function to each group independently, and then combine the results into a single DataFrame. This "split-apply-combine" strategy is central to many data analysis tasks.
The Split-Apply-Combine Strategy
- Split: Data is divided into groups based on one or more keys (column values).
- Apply: A function (e.g., aggregation, transformation, filtration) is applied to each group independently.
- Combine: The results of the applied functions are combined back into a single data structure.
Let's create a sample DataFrame to illustrate groupby() operations:
import pandas as pd
import numpy as np
data = {
'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
'Value1': [10, 12, 15, 11, 13, 16, 14, 18],
'Value2': [100, 110, 105, 120, 115, 125, 130, 135],
'City': ['NY', 'LA', 'NY', 'SF', 'LA', 'SF', 'NY', 'LA']
}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)
1. Grouping and Aggregation
Aggregation involves computing a summary statistic (like sum, mean, count, min, max, std) for each group.
import pandas as pd
import numpy as np
data = {
'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
'Value1': [10, 12, 15, 11, 13, 16, 14, 18],
'Value2': [100, 110, 105, 120, 115, 125, 130, 135],
'City': ['NY', 'LA', 'NY', 'SF', 'LA', 'SF', 'NY', 'LA']
}
df = pd.DataFrame(data)
# Group by 'Category' and calculate the mean of numerical columns
print("\nMean of numerical columns grouped by 'Category':\n", df.groupby('Category').mean())
# Group by 'Category' and calculate the sum of 'Value1'
print("\nSum of 'Value1' grouped by 'Category':\n", df.groupby('Category')['Value1'].sum())
# Group by multiple columns ('Category' and 'City')
print("\nMean grouped by 'Category' and 'City':\n", df.groupby(['Category', 'City']).mean())
# Apply multiple aggregation functions to a single column
print("\nAggregating 'Value1' with multiple functions:\n",
df.groupby('Category')['Value1'].agg(['sum', 'mean', 'max', 'min']))
# Apply different aggregation functions to different columns
print("\nAggregating different columns with different functions:\n",
df.groupby('Category').agg({'Value1': 'sum', 'Value2': 'mean'}))
# Custom aggregation function using apply or lambda
print("\nCustom aggregation (range) on 'Value1' grouped by 'Category':\n",
df.groupby('Category')['Value1'].apply(lambda x: x.max() - x.min()))
2. Transformation
Transformation returns an object that is of the same size as the group. It performs a group-specific computation and broadcasts the results.
import pandas as pd
import numpy as np
data = {
'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
'Value': [10, 12, 15, 11, 13, 16, 14, 18]
}
df_transform = pd.DataFrame(data)
print("Original DataFrame for transformation:\n", df_transform)
# Subtract the mean of each group from its values
df_transform['Value_demeaned'] = df_transform.groupby('Category')['Value'].transform(lambda x: x - x.mean())
print("\nDataFrame with 'Value_demeaned' (group-wise mean subtracted):\n", df_transform)
# Fill missing values with the group-wise mean
df_missing = pd.DataFrame({
'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
'Value': [10, 12, np.nan, 11, 13, 16, np.nan, 18]
})
df_missing['Value_filled'] = df_missing.groupby('Category')['Value'].transform(lambda x: x.fillna(x.mean()))
print("\nDataFrame with missing 'Value' filled by group mean:\n", df_missing)
3. Filtration
Filtration discards entire groups based on a group-wise condition.
import pandas as pd
import numpy as np
data = {
'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
'Value': [10, 12, 15, 11, 13, 16, 14, 18]
}
df_filter = pd.DataFrame(data)
print("Original DataFrame for filtration:\n", df_filter)
# Filter out groups where the mean of 'Value' is less than 13
df_filtered = df_filter.groupby('Category').filter(lambda x: x['Value'].mean() >= 13)
print("\nDataFrame after filtering groups with mean Value < 13:\n", df_filtered)
# Filter out groups that have less than 3 entries
df_less_entries = df_filter.groupby('Category').filter(lambda x: len(x) >= 3)
print("\nDataFrame after filtering groups with less than 3 entries:\n", df_less_entries)
4. Iterating Through Groups
You can iterate through the groups, which can be useful for more complex, custom operations.
import pandas as pd
import numpy as np
data = {
'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
'Value': [10, 12, 15, 11, 13, 16, 14, 18]
}
df = pd.DataFrame(data)
print("\nIterating through groups:")
for name, group in df.groupby('Category'):
print(f"\nGroup Name: {name}")
print(group)
Further Topics:
get_group()to select a single groupsize()to get the number of items in each groupngroup()to get the numerical indicator of the group the row belongs to.- Advanced custom functions with
apply()
The groupby() method is indispensable for summarizing data, performing group-specific calculations, and cleaning data based on group properties. Mastering it is key to advanced data analysis with Pandas.