Pandas: Merging, Joining, and Concatenating DataFrames

In data analysis, it's common to work with data that is spread across multiple DataFrame objects. Pandas provides several flexible operations to combine these DataFrames based on various criteria, including merging, joining, and concatenating. Understanding these operations is crucial for assembling comprehensive datasets.

1. Concatenating DataFrames (`pd.concat()`)

pd.concat() is used to stack DataFrames either vertically (along rows) or horizontally (along columns). It's generally used when DataFrames have the same columns (for vertical concatenation) or the same index (for horizontal concatenation), or when you want to combine data without a specific key lookup.

Parameters:

objs: A sequence or mapping of Series or DataFrame objects.
axis: The axis to concatenate along (0 for rows, 1 for columns). Default is 0.
join: How to handle indexes on other axis ('outer' for union, 'inner' for intersection). Default is 'outer'.
ignore_index: If True, do not use the index values along the concatenation axis. Useful when you are just stacking data and don't care about the original indices.

import pandas as pd

# Create sample DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1'],
                    'B': ['B0', 'B1']},
                   index=[0, 1])

df2 = pd.DataFrame({'A': ['A2', 'A3'],
                    'B': ['B2', 'B3']},
                   index=[2, 3])

df3 = pd.DataFrame({'C': ['C0', 'C1'],
                    'D': ['D0', 'D1']},
                   index=[0, 1])

print("df1:\n", df1)
print("\ndf2:\n", df2)
print("\ndf3:\n", df3)

# Concatenate along rows (axis=0, default)
result_rows = pd.concat([df1, df2])
print("\nConcatenated along rows (default outer join):\n", result_rows)

# Concatenate along rows, ignoring original index
result_rows_ignore_index = pd.concat([df1, df2], ignore_index=True)
print("\nConcatenated along rows (ignore index):\n", result_rows_ignore_index)

# Concatenate along columns (axis=1, default outer join)
result_cols_outer = pd.concat([df1, df3], axis=1)
print("\nConcatenated along columns (outer join, default):\n", result_cols_outer)

# Concatenate along columns (inner join)
result_cols_inner = pd.concat([df1, df3], axis=1, join='inner')
print("\nConcatenated along columns (inner join):\n", result_cols_inner)

# Concatenate with differing columns (outer join by default)
df4 = pd.DataFrame({'A': ['A4', 'A5'], 'C': ['C4', 'C5']}, index=[4, 5])
result_diff_cols = pd.concat([df1, df4], axis=0)
print("\nConcatenated with differing columns (outer join by default):\n", result_diff_cols)

2. Merging DataFrames (`pd.merge()`)

pd.merge() combines DataFrames based on common values in one or more columns (keys). This is similar to SQL JOIN operations.

Parameters:

left, right: The DataFrames to merge.
how: Type of merge to be performed.
- 'inner' (default): Use intersection of keys from both frames.
- 'left': Use keys from left frame only.
- 'right': Use keys from right frame only.
- 'outer': Use union of keys from both frames.
on: Column or list of column names to join on. Must be present in both DataFrames.
left_on, right_on: Column or list of column names in the left/right DataFrame respectively to join on. Use if join keys have different names.
left_index, right_index: Use the index from the left/right DataFrame as the join key(s).

import pandas as pd

# Create sample DataFrames
df_customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David']
})

df_orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [2, 1, 3, 2, 5], # Customer 5 does not exist in df_customers
    'amount': [150, 200, 50, 300, 100]
})

print("df_customers:\n", df_customers)
print("\ndf_orders:\n", df_orders)

# Inner Merge (default): Only rows where customer_id exists in both DataFrames
inner_merge = pd.merge(df_customers, df_orders, on='customer_id', how='inner')
print("\nInner Merge (customers with orders):\n", inner_merge)

# Left Merge: Keep all rows from left DataFrame, add matching rows from right
# If no match in right, fill with NaN
left_merge = pd.merge(df_customers, df_orders, on='customer_id', how='left')
print("\nLeft Merge (all customers, with their orders):\n", left_merge)

# Right Merge: Keep all rows from right DataFrame, add matching rows from left
# If no match in left, fill with NaN
right_merge = pd.merge(df_customers, df_orders, on='customer_id', how='right')
print("\nRight Merge (all orders, with customer info):\n", right_merge)

# Outer Merge: Keep all rows from both DataFrames, fill with NaN where no match
outer_merge = pd.merge(df_customers, df_orders, on='customer_id', how='outer')
print("\nOuter Merge (union of customers and orders):\n", outer_merge)

# Merging on different key names (left_on, right_on)
df_products = pd.DataFrame({
    'prod_id': [1, 2, 3],
    'product_name': ['Laptop', 'Mouse', 'Keyboard']
})
df_order_items = pd.DataFrame({
    'order_id': [101, 101, 102],
    'product_code': [1, 3, 2],
    'quantity': [1, 2, 1]
})
merge_diff_keys = pd.merge(df_order_items, df_products, left_on='product_code', right_on='prod_id')
print("\nMerge on different key names:\n", merge_diff_keys)

3. Joining DataFrames (`.join()`)

The .join() method is a convenient way to combine the columns of two potentially differently-indexed DataFrames into a single result DataFrame. It's effectively a pd.merge() with left_index=True and/or right_index=True.

Parameters:

other: DataFrame or Series to join with.
on: Column or index level names to join on. Defaults to joining on indexes.
how: Type of join ('left', 'right', 'outer', 'inner'). Defaults to 'left'.
lsuffix, rsuffix: Suffixes to use if there are overlapping column names.

import pandas as pd

# Create sample DataFrames
df_employee = pd.DataFrame({
    'name': ['John', 'Jane', 'Mike'],
    'department': ['HR', 'IT', 'Sales']
}, index=[101, 102, 103]) # Employee ID as index

df_salary = pd.DataFrame({
    'salary': [70000, 80000, 60000],
    'bonus': [5000, 8000, 3000]
}, index=[102, 101, 104]) # Employee ID as index, with an extra ID

print("df_employee:\n", df_employee)
print("\ndf_salary:\n", df_salary)

# Left Join (default): Join df_salary to df_employee on their indices
left_join = df_employee.join(df_salary, how='left')
print("\nLeft Join (employee info + matching salary):\n", left_join)

# Outer Join: Union of indices
outer_join = df_employee.join(df_salary, how='outer')
print("\nOuter Join (union of employee and salary indices):\n", outer_join)

# Inner Join: Intersection of indices
inner_join = df_employee.join(df_salary, how='inner')
print("\nInner Join (intersection of employee and salary indices):\n", inner_join)

# Join on a key column instead of index (use 'on' parameter in .join)
df_dept_info = pd.DataFrame({
    'department': ['IT', 'HR', 'Marketing'],
    'head': ['Sarah', 'Tom', 'Laura']
})

# To join df_employee (which has 'department' as a column) with df_dept_info
# df_employee.set_index('department').join(df_dept_info.set_index('department'))
# Alternatively, use merge if joining on columns is more natural
merge_on_column = pd.merge(df_employee, df_dept_info, on='department', how='left')
print("\nMerge on 'department' column (similar to join on column):\n", merge_on_column)

When to Use Which?

pd.concat(): Use when you want to stack DataFrames (add rows or columns) without specific key matching. Good for combining datasets that represent different time periods or different sensor readings, etc., where schema is similar.
pd.merge(): Use when you need to combine DataFrames based on common values in one or more columns (keys), similar to relational database joins. This is the most flexible and widely used for combining heterogeneous data sources.
.join(): Use when you want to combine DataFrames based on their indices, or when one DataFrame's index matches a column in another. It's a convenient shorthand for certain types of merges.

Mastering these DataFrame combination techniques is essential for preparing complex datasets for analysis and modeling.

Pandas: Merging, Joining, and Concatenating DataFrames

1. Concatenating DataFrames (pd.concat())

Parameters:

2. Merging DataFrames (pd.merge())

Parameters:

3. Joining DataFrames (.join())

Parameters:

When to Use Which?

1. Concatenating DataFrames (`pd.concat()`)

2. Merging DataFrames (`pd.merge()`)

3. Joining DataFrames (`.join()`)