⬡ Hub
Skip to content

Pandas: Interview Questions

This document compiles a range of common interview questions related to Pandas, covering fundamental concepts to more advanced topics. These questions are designed to test a candidate's understanding of Pandas' architecture, common operations, and practical application in data analysis.

Foundational Concepts

  1. What are the two main data structures in Pandas? Briefly describe each.

    • Answer:
      • Series: A one-dimensional labeled array capable of holding any data type. It's like a single column in a spreadsheet or a Python list with an index.
      • DataFrame: A two-dimensional labeled data structure with columns of potentially different types. It's like a spreadsheet or a SQL table. It's the most commonly used Pandas object.
  2. How do you create a Pandas DataFrame from a Python dictionary? Provide an example.

    • Answer: You can create a DataFrame from a dictionary where keys become column names and values are lists (or Series) representing the column data. python import pandas as pd data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]} df = pd.DataFrame(data) print(df)
  3. Explain the difference between .loc[] and .iloc[] for DataFrame indexing.

    • Answer:
      • .loc[] (Label-based indexing): Used for selection by label (index names and column names). When slicing, both the start and end labels are included.
      • .iloc[] (Integer-location based indexing): Used for selection by position (0-based integer position). When slicing, the start position is included, but the end position is excluded (like standard Python slicing).
  4. How do you handle missing values (NaN) in a Pandas DataFrame? Name a few methods.

    • Answer: Pandas provides several methods to handle missing values:
      • Identification: df.isnull() or df.isna() to get a boolean DataFrame, df.isnull().sum() to count NaNs per column.
      • Dropping: df.dropna() to remove rows/columns with NaNs. how='any' (default) or how='all', and axis=0 (rows) or axis=1 (columns).
      • Filling: df.fillna(value) to replace NaNs with a specific value, column mean/median, or using forward fill (method='ffill') or backward fill (method='bfill').
      • Interpolation: df.interpolate() to estimate missing values based on surrounding data points.
  5. What is the purpose of groupby() in Pandas? Describe the "split-apply-combine" strategy.

    • Answer: groupby() is used to group data based on one or more keys (column values) to perform aggregate analysis. The "split-apply-combine" strategy involves:
      • Split: Dividing data into groups based on some criteria.
      • Apply: Applying a function (e.g., aggregation, transformation, filtration) to each group independently.
      • Combine: Combining the results of the applied functions back into a single data structure.

Intermediate Concepts

  1. How do you merge two DataFrames in Pandas? What are the different types of merges?

    • Answer: You use pd.merge() to combine DataFrames based on common values in one or more columns (keys), similar to SQL JOINs.
    • Types of merges (how parameter):
      • inner (default): Keeps only rows with keys present in both DataFrames (intersection).
      • left: Keeps all rows from the left DataFrame, and matching rows from the right. NaN for non-matching.
      • right: Keeps all rows from the right DataFrame, and matching rows from the left. NaN for non-matching.
      • outer: Keeps all rows from both DataFrames, filling NaN for non-matching keys (union).
  2. Explain the difference between pd.merge() and .join() methods.

    • Answer:
      • pd.merge(): More general, combines DataFrames based on specified common columns (keys) or indices. It's more flexible for arbitrary column-based joins.
      • .join(): A convenient method primarily used for combining DataFrames based on their indices, or when one DataFrame's index matches a column in another (by specifying the on parameter). It's a shorthand for specific merge operations.
  3. How can you apply a function to a Series or DataFrame in Pandas? Differentiate between apply(), map(), and applymap()

    • Answer:
      • apply(): Can be used on Series (element-wise or scalar) or DataFrame (column-wise or row-wise, depending on axis). It's flexible but can be slower than vectorized operations.
      • map(): Only for Series. It maps values from one Series to another based on a dictionary, Series, or a function. Best for element-wise transformation where the transformation logic is simple.
      • applymap(): Only for DataFrames. It applies a function element-wise to every single element in the DataFrame.
  4. You have a column of strings that need cleaning (e.g., removing whitespace, changing case). How would you do this efficiently in Pandas?

    • Answer: Pandas Series objects with string data have a special .str accessor that exposes a variety of vectorized string methods. python df['column_name'] = df['column_name'].str.strip().str.lower().str.replace('old', 'new') This allows for efficient string operations without explicit loops.
  5. How would you convert a column to a datetime object, handling potential errors?

    • Answer: Use pd.to_datetime(). The errors parameter is crucial:
      • errors='raise' (default): Raises an error on invalid parsing.
      • errors='coerce': Invalid parsing will be set as NaT (Not a Time).
      • errors='ignore': Invalid parsing will return the input. python df['date_column'] = pd.to_datetime(df['date_column'], errors='coerce')

Advanced Concepts

  1. Describe a scenario where you would use pd.pivot_table(). How does it differ from df.pivot()?

    • Answer: pd.pivot_table() is used to create a "pivot" table similar to those found in spreadsheet software. It aggregates data based on one or more index and columns and applies an aggfunc (e.g., 'mean', 'sum'). It's used for summarizing and reshaping data, especially for multivariate analysis.
      • Difference from df.pivot(): df.pivot() is a simpler function used for "pivoting" without aggregation. It requires the combination of index, columns, and values to result in unique values. If there are duplicates, pivot() will fail, whereas pivot_table() can handle them by applying an aggregation function.
  2. How do you perform feature engineering in Pandas, specifically creating new columns based on existing ones?

    • Answer: Feature engineering in Pandas often involves creating new columns through arithmetic operations, conditional logic, or applying functions to existing columns. python df['new_col'] = df['col1'] + df['col2'] # Arithmetic df['is_high_value'] = np.where(df['sales'] > 1000, True, False) # Conditional df['text_length'] = df['text_col'].str.len() # Applying string methods df['category_ratio'] = df.groupby('Category')['Value'].transform(lambda x: x / x.sum()) # Group-wise transformation
  3. What is a MultiIndex in Pandas, and when would you use it?

    • Answer: A MultiIndex (or hierarchical index) allows you to have multiple levels of indexing on either the rows or columns of a DataFrame. It enables you to work with higher-dimensional data in a two-dimensional structure. You would use it when dealing with complex datasets that have natural hierarchical relationships, such as time series data from multiple sensors at multiple locations, or experimental data with multiple conditions and replicates.
  4. Discuss performance considerations when working with large DataFrames in Pandas. What techniques can improve performance?

    • Answer: Pandas can be memory-intensive and slow with very large datasets. Techniques for performance improvement include:
      • Vectorization: Prefer vectorized Pandas/NumPy operations over explicit Python loops (.apply(), for loops).
      • Choosing appropriate dtypes: Use category dtype for categorical data, int8/int16 for smaller integers, float32 instead of float64 where precision isn't critical. pd.to_numeric(downcast='integer').
      • Chunking data: For extremely large files, read them in chunks using pd.read_csv(..., chunksize=...).
      • Parquet/Feather: Use more efficient binary storage formats like Parquet or Feather instead of CSV for I/O.
      • Numba or Cython: For custom, performance-critical operations that cannot be vectorized, consider writing them in Numba or Cython.
      • Dask: For datasets that genuinely don't fit in memory, consider Dask DataFrames, which provide a Pandas-like API for out-of-core computation.
  5. How would you perform a time series resampling operation in Pandas? Give an example.

    • Answer: Time series resampling involves changing the frequency of time series data (e.g., from daily to weekly, or hourly to minutes). The resample() method is used for this, often followed by an aggregation function. ```python import pandas as pd # Create a sample time series DataFrame dates = pd.to_datetime(['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-03', '2023-01-03']) df_ts = pd.DataFrame({'value': [10, 12, 15, 11, 13, 16]}, index=dates)

      Resample to daily frequency, taking the mean

      daily_mean = df_ts.resample('D').mean() print("\nDaily mean:\n", daily_mean)

      Resample to weekly frequency, taking the sum

      weekly_sum = df_ts.resample('W').sum() print("\nWeekly sum:\n", weekly_sum) ```

Scenario-Based Questions

  1. You have a dataset of customer transactions. Each row is a transaction, and it includes customer_id, product_id, amount, and timestamp. How would you find the total purchase amount for each customer?

    • Answer: Use groupby() and sum(): python total_purchases = df.groupby('customer_id')['amount'].sum()
  2. Given two DataFrames, one with employee_id and employee_name, and another with employee_id and department_id, how would you combine them to get a list of employee names with their department IDs?

    • Answer: Use pd.merge() with an inner join on employee_id: python merged_df = pd.merge(df_employees, df_departments, on='employee_id', how='inner')
  3. You have a DataFrame with a text_column that might contain URLs. How would you extract all URLs into a new column?

    • Answer: Use the .str.extract() method with a regular expression: python df['urls'] = df['text_column'].str.extract(r'(https?://\S+)') # Or if multiple URLs are possible per entry, use .str.findall() and then explode if needed # df['urls'] = df['text_column'].str.findall(r'(https?://\S+)')
  4. How would you identify and remove rows from a DataFrame where a specific column ('price') has outlier values (e.g., more than 3 standard deviations from the mean)?

    • Answer: Calculate the mean and standard deviation for the 'price' column, then use boolean indexing to filter out rows outside the desired range. python mean_price = df['price'].mean() std_price = df['price'].std() df_filtered = df[(df['price'] > mean_price - 3 * std_price) & (df['price'] < mean_price + 3 * std_price)]
  5. You have sensor data with readings every minute, but you need to analyze it on an hourly basis, specifically the maximum reading for each hour. How would you do this?

    • Answer: Assuming the DataFrame has a datetime index, use resample(): python hourly_max = df.resample('H')['reading'].max()