Pandas: Interview Questions
This document compiles a range of common interview questions related to Pandas, covering fundamental concepts to more advanced topics. These questions are designed to test a candidate's understanding of Pandas' architecture, common operations, and practical application in data analysis.
Foundational Concepts
-
What are the two main data structures in Pandas? Briefly describe each.
- Answer:
Series: A one-dimensional labeled array capable of holding any data type. It's like a single column in a spreadsheet or a Python list with an index.DataFrame: A two-dimensional labeled data structure with columns of potentially different types. It's like a spreadsheet or a SQL table. It's the most commonly used Pandas object.
- Answer:
-
How do you create a Pandas DataFrame from a Python dictionary? Provide an example.
- Answer: You can create a DataFrame from a dictionary where keys become column names and values are lists (or Series) representing the column data.
python import pandas as pd data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]} df = pd.DataFrame(data) print(df)
- Answer: You can create a DataFrame from a dictionary where keys become column names and values are lists (or Series) representing the column data.
-
Explain the difference between
.loc[]and.iloc[]for DataFrame indexing.- Answer:
.loc[](Label-based indexing): Used for selection by label (index names and column names). When slicing, both the start and end labels are included..iloc[](Integer-location based indexing): Used for selection by position (0-based integer position). When slicing, the start position is included, but the end position is excluded (like standard Python slicing).
- Answer:
-
How do you handle missing values (NaN) in a Pandas DataFrame? Name a few methods.
- Answer: Pandas provides several methods to handle missing values:
- Identification:
df.isnull()ordf.isna()to get a boolean DataFrame,df.isnull().sum()to count NaNs per column. - Dropping:
df.dropna()to remove rows/columns with NaNs.how='any'(default) orhow='all', andaxis=0(rows) oraxis=1(columns). - Filling:
df.fillna(value)to replace NaNs with a specific value, column mean/median, or using forward fill (method='ffill') or backward fill (method='bfill'). - Interpolation:
df.interpolate()to estimate missing values based on surrounding data points.
- Identification:
- Answer: Pandas provides several methods to handle missing values:
-
What is the purpose of
groupby()in Pandas? Describe the "split-apply-combine" strategy.- Answer:
groupby()is used to group data based on one or more keys (column values) to perform aggregate analysis. The "split-apply-combine" strategy involves:- Split: Dividing data into groups based on some criteria.
- Apply: Applying a function (e.g., aggregation, transformation, filtration) to each group independently.
- Combine: Combining the results of the applied functions back into a single data structure.
- Answer:
Intermediate Concepts
-
How do you merge two DataFrames in Pandas? What are the different types of merges?
- Answer: You use
pd.merge()to combine DataFrames based on common values in one or more columns (keys), similar to SQL JOINs. - Types of merges (
howparameter):inner(default): Keeps only rows with keys present in both DataFrames (intersection).left: Keeps all rows from the left DataFrame, and matching rows from the right.NaNfor non-matching.right: Keeps all rows from the right DataFrame, and matching rows from the left.NaNfor non-matching.outer: Keeps all rows from both DataFrames, fillingNaNfor non-matching keys (union).
- Answer: You use
-
Explain the difference between
pd.merge()and.join()methods.- Answer:
pd.merge(): More general, combines DataFrames based on specified common columns (keys) or indices. It's more flexible for arbitrary column-based joins..join(): A convenient method primarily used for combining DataFrames based on their indices, or when one DataFrame's index matches a column in another (by specifying theonparameter). It's a shorthand for specificmergeoperations.
- Answer:
-
How can you apply a function to a Series or DataFrame in Pandas? Differentiate between
apply(),map(), andapplymap()- Answer:
apply(): Can be used on Series (element-wise or scalar) or DataFrame (column-wise or row-wise, depending onaxis). It's flexible but can be slower than vectorized operations.map(): Only for Series. It maps values from one Series to another based on a dictionary, Series, or a function. Best for element-wise transformation where the transformation logic is simple.applymap(): Only for DataFrames. It applies a function element-wise to every single element in the DataFrame.
- Answer:
-
You have a column of strings that need cleaning (e.g., removing whitespace, changing case). How would you do this efficiently in Pandas?
- Answer: Pandas Series objects with string data have a special
.straccessor that exposes a variety of vectorized string methods.python df['column_name'] = df['column_name'].str.strip().str.lower().str.replace('old', 'new')This allows for efficient string operations without explicit loops.
- Answer: Pandas Series objects with string data have a special
-
How would you convert a column to a datetime object, handling potential errors?
- Answer: Use
pd.to_datetime(). Theerrorsparameter is crucial:errors='raise'(default): Raises an error on invalid parsing.errors='coerce': Invalid parsing will be set asNaT(Not a Time).errors='ignore': Invalid parsing will return the input.python df['date_column'] = pd.to_datetime(df['date_column'], errors='coerce')
- Answer: Use
Advanced Concepts
-
Describe a scenario where you would use
pd.pivot_table(). How does it differ fromdf.pivot()?- Answer:
pd.pivot_table()is used to create a "pivot" table similar to those found in spreadsheet software. It aggregates data based on one or moreindexandcolumnsand applies anaggfunc(e.g., 'mean', 'sum'). It's used for summarizing and reshaping data, especially for multivariate analysis.- Difference from
df.pivot():df.pivot()is a simpler function used for "pivoting" without aggregation. It requires the combination ofindex,columns, andvaluesto result in unique values. If there are duplicates,pivot()will fail, whereaspivot_table()can handle them by applying an aggregation function.
- Difference from
- Answer:
-
How do you perform feature engineering in Pandas, specifically creating new columns based on existing ones?
- Answer: Feature engineering in Pandas often involves creating new columns through arithmetic operations, conditional logic, or applying functions to existing columns.
python df['new_col'] = df['col1'] + df['col2'] # Arithmetic df['is_high_value'] = np.where(df['sales'] > 1000, True, False) # Conditional df['text_length'] = df['text_col'].str.len() # Applying string methods df['category_ratio'] = df.groupby('Category')['Value'].transform(lambda x: x / x.sum()) # Group-wise transformation
- Answer: Feature engineering in Pandas often involves creating new columns through arithmetic operations, conditional logic, or applying functions to existing columns.
-
What is a MultiIndex in Pandas, and when would you use it?
- Answer: A MultiIndex (or hierarchical index) allows you to have multiple levels of indexing on either the rows or columns of a DataFrame. It enables you to work with higher-dimensional data in a two-dimensional structure. You would use it when dealing with complex datasets that have natural hierarchical relationships, such as time series data from multiple sensors at multiple locations, or experimental data with multiple conditions and replicates.
-
Discuss performance considerations when working with large DataFrames in Pandas. What techniques can improve performance?
- Answer: Pandas can be memory-intensive and slow with very large datasets. Techniques for performance improvement include:
- Vectorization: Prefer vectorized Pandas/NumPy operations over explicit Python loops (
.apply(),forloops). - Choosing appropriate dtypes: Use
categorydtype for categorical data,int8/int16for smaller integers,float32instead offloat64where precision isn't critical.pd.to_numeric(downcast='integer'). - Chunking data: For extremely large files, read them in chunks using
pd.read_csv(..., chunksize=...). - Parquet/Feather: Use more efficient binary storage formats like Parquet or Feather instead of CSV for I/O.
NumbaorCython: For custom, performance-critical operations that cannot be vectorized, consider writing them inNumbaorCython.Dask: For datasets that genuinely don't fit in memory, considerDask DataFrames, which provide a Pandas-like API for out-of-core computation.
- Vectorization: Prefer vectorized Pandas/NumPy operations over explicit Python loops (
- Answer: Pandas can be memory-intensive and slow with very large datasets. Techniques for performance improvement include:
-
How would you perform a time series resampling operation in Pandas? Give an example.
-
Answer: Time series resampling involves changing the frequency of time series data (e.g., from daily to weekly, or hourly to minutes). The
resample()method is used for this, often followed by an aggregation function. ```python import pandas as pd # Create a sample time series DataFrame dates = pd.to_datetime(['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-03', '2023-01-03']) df_ts = pd.DataFrame({'value': [10, 12, 15, 11, 13, 16]}, index=dates)Resample to daily frequency, taking the mean
daily_mean = df_ts.resample('D').mean() print("\nDaily mean:\n", daily_mean)
Resample to weekly frequency, taking the sum
weekly_sum = df_ts.resample('W').sum() print("\nWeekly sum:\n", weekly_sum) ```
-
Scenario-Based Questions
-
You have a dataset of customer transactions. Each row is a transaction, and it includes
customer_id,product_id,amount, andtimestamp. How would you find the total purchase amount for each customer?- Answer: Use
groupby()andsum():python total_purchases = df.groupby('customer_id')['amount'].sum()
- Answer: Use
-
Given two DataFrames, one with
employee_idandemployee_name, and another withemployee_idanddepartment_id, how would you combine them to get a list of employee names with their department IDs?- Answer: Use
pd.merge()with an inner join onemployee_id:python merged_df = pd.merge(df_employees, df_departments, on='employee_id', how='inner')
- Answer: Use
-
You have a DataFrame with a
text_columnthat might contain URLs. How would you extract all URLs into a new column?- Answer: Use the
.str.extract()method with a regular expression:python df['urls'] = df['text_column'].str.extract(r'(https?://\S+)') # Or if multiple URLs are possible per entry, use .str.findall() and then explode if needed # df['urls'] = df['text_column'].str.findall(r'(https?://\S+)')
- Answer: Use the
-
How would you identify and remove rows from a DataFrame where a specific column (
'price') has outlier values (e.g., more than 3 standard deviations from the mean)?- Answer: Calculate the mean and standard deviation for the
'price'column, then use boolean indexing to filter out rows outside the desired range.python mean_price = df['price'].mean() std_price = df['price'].std() df_filtered = df[(df['price'] > mean_price - 3 * std_price) & (df['price'] < mean_price + 3 * std_price)]
- Answer: Calculate the mean and standard deviation for the
-
You have sensor data with readings every minute, but you need to analyze it on an hourly basis, specifically the maximum reading for each hour. How would you do this?
- Answer: Assuming the DataFrame has a datetime index, use
resample():python hourly_max = df.resample('H')['reading'].max()
- Answer: Assuming the DataFrame has a datetime index, use