Scikit-learn: Interview Questions

This document compiles a range of common interview questions related to Scikit-learn, covering fundamental concepts to more advanced topics. These questions are designed to test a candidate's understanding of Scikit-learn's architecture, common functionalities, and practical application in machine learning workflows.

Foundational Concepts

What is Scikit-learn, and what kind of tasks does it address?
- Answer: Scikit-learn is a free software machine learning library for the Python programming language. It provides simple and efficient tools for predictive data analysis. It addresses tasks such as classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.
What is the general workflow for using a Scikit-learn estimator (e.g., a classifier or regressor)?
- Answer: The typical workflow involves:
  1. Import: Import the desired estimator from sklearn.
  2. Instantiate: Create an instance of the estimator, optionally setting hyperparameters (e.g., model = LogisticRegression()).
  3. Train: Fit the model to the training data using the fit(X_train, y_train) method.
  4. Predict: Make predictions on new data using the predict(X_test) method.
  5. Evaluate: Assess the model's performance using appropriate metrics (accuracy_score, mean_squared_error, etc.).
Explain the difference between supervised and unsupervised learning algorithms in Scikit-learn.
- Answer:
  - Supervised Learning: Algorithms learn from labeled data (input features X and corresponding output targets y). The goal is to predict y given X. Examples: LinearRegression, SVC, RandomForestClassifier.
  - Unsupervised Learning: Algorithms learn from unlabeled data (only X). The goal is to find hidden patterns or structures in the data. Examples: KMeans (clustering), PCA (dimensionality reduction).
Why is data preprocessing important before applying machine learning models? Name a few Scikit-learn tools for preprocessing.
- Answer: Data preprocessing is crucial because raw data is often noisy, inconsistent, and not in a format suitable for ML algorithms. It can significantly impact model performance and convergence.
- Scikit-learn tools:
  - StandardScaler, MinMaxScaler (for scaling numerical features)
  - OneHotEncoder, LabelEncoder (for encoding categorical features)
  - SimpleImputer (for handling missing values)
  - PolynomialFeatures (for feature engineering)
What is the purpose of train_test_split? Why should you use it?
- Answer: train_test_split is used to divide a dataset into two subsets: a training set and a testing (or validation) set.
- Purpose: To evaluate the generalization capability of a machine learning model. A model should be trained on one part of the data and evaluated on unseen data to assess how well it performs on new, real-world examples, preventing overfitting.

Intermediate Concepts

Explain the concept of cross-validation. Why is it better than a single train-test split for model evaluation?
- Answer: Cross-validation is a technique for assessing how the results of a statistical analysis (e.g., a machine learning model) generalize to an independent data set. It repeatedly partitions the data into training and validation sets. K-Fold cross-validation, for example, divides data into K folds, training on K-1 folds and validating on the remaining fold, repeating K times.
- Advantage over single split: A single train-test split can be sensitive to the random selection of data points, leading to a biased estimate of performance. Cross-validation provides a more robust and reliable estimate of model performance by using all data points for both training and validation across different iterations, reducing variance.
What are hyperparameters, and how do you tune them in Scikit-learn?
- Answer: Hyperparameters are parameters whose values are set before the learning process begins (e.g., learning rate, number of trees in a Random Forest, C in SVM, max_depth in Decision Trees). They are not learned from the data.
- Tuning methods in Scikit-learn:
  - GridSearchCV: Exhaustively searches through a predefined set of hyperparameter combinations.
  - RandomizedSearchCV: Randomly samples a fixed number of hyperparameter combinations from specified distributions.
  - Both use cross-validation to evaluate each combination and return the best set of parameters.
Describe the role of Pipeline in Scikit-learn. What are its benefits?
- Answer: A Pipeline in Scikit-learn allows you to chain multiple data preprocessing steps (e.g., imputation, scaling, encoding) and a final estimator (model) into a single object.
- Benefits:
  - Convenience: Simplifies the workflow by encapsulating multiple steps.
  - Data Leakage Prevention: Ensures that data transformations learned from the training data are applied consistently to new data, and prevent fitting transformers on the test set.
  - Reproducibility: Makes the entire workflow easily reproducible.
  - Cleaner Code: Reduces boilerplate code.
  - Hyperparameter Tuning: Allows hyperparameter tuning across the entire pipeline.
When would you choose StandardScaler over MinMaxScaler (or vice-versa)?
- Answer:
  - StandardScaler (Standardization): Scales data to have a mean of 0 and standard deviation of 1. It's preferred when the data follows a Gaussian distribution or when algorithms assume normally distributed data (e.g., Linear Regression, Logistic Regression, SVMs with RBF kernel). It's less affected by outliers than MinMaxScaler.
  - MinMaxScaler (Normalization): Scales data to a fixed range (usually 0 to 1). It's preferred when features have different ranges but similar importance, or when algorithms that are not robust to outliers are used (e.g., KNN, neural networks). It is sensitive to outliers.
Explain how OneHotEncoder works and when you would use it.
- Answer: OneHotEncoder converts categorical integer features into a one-hot (binary) representation. For each unique category in a feature, it creates a new binary column. If a sample belongs to that category, the new column will have a 1, otherwise 0.
- Use cases: When dealing with nominal (unordered) categorical features, to prevent models from assuming an ordinal relationship that doesn't exist (e.g., encoding 'red', 'green', 'blue' as 0, 1, 2 would imply an ordering). Most tree-based models can handle integer encoding directly, but linear models or SVMs often require one-hot encoding.

Advanced Concepts

What is data leakage in the context of Scikit-learn, and how do pipelines help prevent it?
- Answer: Data leakage occurs when information from the test dataset "leaks" into the training process. This leads to an overly optimistic evaluation of the model's performance because the model implicitly learned from information it shouldn't have seen.
- How pipelines help: Pipelines enforce that all preprocessing steps (like fitting a StandardScaler or SimpleImputer) are applied only to the training data. The fitted transformers are then used to transform both the training and test data. This ensures that information from the test set (e.g., its mean or variance) doesn't influence the preprocessing steps applied to the training set, preventing leakage.
When would you use ColumnTransformer in conjunction with Pipeline?
- Answer: ColumnTransformer is used when you need to apply different transformers to different columns or subsets of columns in your DataFrame. For example, you might want to StandardScaler numerical columns, OneHotEncoder categorical columns, and simply pass_through (or drop) others.
- Usage: It's often used as the first step in a Pipeline to handle heterogeneous data types and apply appropriate transformations to each.
Discuss the various metrics for evaluating classification models in Scikit-learn. When would you prefer precision, recall, or F1-score over accuracy?
- Answer:
  - Accuracy: Overall correct predictions. Good for balanced datasets.
  - Precision: (True Positives) / (True Positives + False Positives). Answers: "Of all predicted positives, how many were actually positive?" Important when false positives are costly.
  - Recall: (True Positives) / (True Positives + False Negatives). Answers: "Of all actual positives, how many did we correctly identify?" Important when false negatives are costly.
  - F1-score: Harmonic mean of precision and recall. A good general measure when there's an uneven class distribution and you need to balance precision and recall.
- Preference: When classes are imbalanced, or when the costs of False Positives and False Negatives are different, accuracy can be misleading. For example, in fraud detection, high recall (catching most fraud) is often more important than high precision (some legitimate transactions might be flagged). In spam detection, high precision (few legitimate emails marked as spam) is crucial.
How do you save and load a trained Scikit-learn model?
- Answer: Use joblib or pickle. joblib is generally preferred for large NumPy arrays within Scikit-learn models as it's more efficient. ```python import joblib from sklearn.linear_model import LogisticRegression
  
  Save model
  
  model = LogisticRegression() # Assume trained model joblib.dump(model, 'my_model.joblib')
  
  Load model
  
  loaded_model = joblib.load('my_model.joblib') ```
What is dimensionality reduction? Name a common Scikit-learn technique for it and explain its purpose.
- Answer: Dimensionality reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. It helps to simplify models, reduce noise, prevent overfitting (especially with many features and few samples), and speed up training.
- Technique: PCA (Principal Component Analysis). It transforms high-dimensional data into a lower-dimensional subspace while retaining most of the variance. It finds orthogonal principal components that capture the maximum variance in the data.

Scenario-Based Questions

You are building a spam email classifier. After training, you get 98% accuracy. Is this a good result? What other metrics would you look at?
- Answer: 98% accuracy sounds good, but for a spam classifier, the dataset is likely highly imbalanced (far fewer spam emails than legitimate ones). Accuracy alone can be misleading.
- Other metrics to look at:
  - Precision: How many of the emails flagged as spam were actually spam? (Crucial to avoid marking legitimate emails as spam).
  - Recall: How many of the actual spam emails did the classifier catch?
  - F1-score: A balance of precision and recall.
  - Confusion Matrix: To see the counts of True Positives, True Negatives, False Positives, and False Negatives.
  - ROC AUC Score: To evaluate the model's ability to distinguish between classes across various threshold settings.
You have a dataset with some categorical features (e.g., 'Gender': 'Male', 'Female'; 'City': 'NY', 'LA', 'SF'). How would you prepare these for a LogisticRegression model?
- Answer: LogisticRegression expects numerical input, so categorical features need encoding. For nominal categories like 'City' or 'Gender', OneHotEncoder is appropriate to avoid implying any order. ```python from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression
  
  Assuming 'gender' and 'city' are categorical, 'age' is numerical
  
  preprocessor = ColumnTransformer( transformers=[ ('cat', OneHotEncoder(handle_unknown='ignore'), ['gender', 'city']), ('num', 'passthrough', ['age']) # Keep numerical as is, or scale them ])
  
  model_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression(solver='liblinear'))])
  
  model_pipeline.fit(X_train, y_train)
  
```
Your KMeans clustering algorithm always gives different results each time you run it, even with the same data. What could be the reason, and how can you make it reproducible?
- Answer: K-Means is sensitive to the initial placement of centroids. If not specified, the centroid initialization is often random.
- To make it reproducible:
  - Set the random_state parameter in KMeans to an integer value (e.g., KMeans(n_clusters=k, random_state=42)).
  - Use init='k-means++' (which is the default) as it intelligently selects initial cluster centers to speed up convergence, but still relies on a random start unless random_state is set.
You've trained a Random Forest model, and it's performing poorly. What are the first few hyperparameters you would consider tuning using RandomizedSearchCV?
- Answer:
  - n_estimators: The number of trees in the forest. More trees generally improve performance but increase computation.
  - max_depth: The maximum depth of each tree. Controls model complexity and prevents overfitting.
  - min_samples_leaf: The minimum number of samples required to be at a leaf node. Prevents trees from being too specific to the training data.
  - max_features: The number of features to consider when looking for the best split.
How would you check for multicollinearity among numerical features in your dataset, and why is it a concern for some models?
- Answer:
  - Checking:
    - Correlation Matrix/Heatmap: Calculate Pearson correlation coefficients between all pairs of numerical features. High absolute values (close to 1 or -1) indicate strong correlation.
    - Variance Inflation Factor (VIF): Calculate VIF for each feature (often requires statsmodels library). VIF values greater than 5 or 10 are often considered problematic.
  - Concern: Multicollinearity (high correlation between independent variables) is a concern for models sensitive to feature independence, primarily linear models (e.g., LinearRegression, LogisticRegression). It makes it difficult to interpret the individual coefficients (as their standard errors increase), makes the model less stable, and can lead to unreliable feature importance estimates. Tree-based models are generally less affected.

Scikit-learn: Interview Questions

Foundational Concepts

Intermediate Concepts

Advanced Concepts

Save model

Load model

Scenario-Based Questions

Assuming 'gender' and 'city' are categorical, 'age' is numerical

model_pipeline.fit(X_train, y_train)