Code Snippet 1:
from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification import numpy as np X, y = make_classification(n_samples=100, n_features=4, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) print("Training set size:", X_train.shape[0]) print("Test set size:", X_test.shape[0]) print("Number of features:", X_train.shape[1]) print("Unique classes:", np.unique(y))
Task: Predict the output of this code snippet and explain what each line does in the context of preparing data for supervised learning.
Code Snippet 2:
from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split X, y = make_classification(n_samples=200, n_features=5, n_informative=3, n_redundant=2, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) model = LogisticRegression(random_state=42) model.fit(X_train, y_train) train_score = model.score(X_train, y_train) test_score = model.score(X_test, y_test) print("Training accuracy:", round(train_score, 3)) print("Test accuracy:", round(test_score, 3)) print("Number of training samples:", len(X_train))
Task: Explain what this code does step by step. What would it indicate if the training accuracy was 0.99 but the test accuracy was 0.65? What action would you take?
Sample Output:
Training Accuracy: 0.971 Testing Accuracy: 0.978 Predictions for first 5 test samples: [1 0 2 1 1]
Expected Output Format:
Training R² Score: 0.xxx Testing R² Score: 0.xxx
Sample Output:
Split 50-50: Test Accuracy = 0.xxx, Test Samples = xxx Split 70-30: Test Accuracy = 0.xxx, Test Samples = xxx Split 90-10: Test Accuracy = 0.xxx, Test Samples = xxx
Input/Output:
Model 1 (Default) - Train: 0.xxx, Test: 0.xxx Model 2 (C=0.01) - Train: 0.xxx, Test: 0.xxx
Advanced Multi-Model Comparison Framework
Build a comprehensive supervised learning comparison system that:
Constraints:
Expected Output Structure:
Model Performance Comparison: ================================ Model: Logistic Regression Training Accuracy: 0.xxxx Testing Accuracy: 0.xxxx Difference: 0.xxxx Model: Decision Tree Training Accuracy: 0.xxxx Testing Accuracy: 0.xxxx Difference: 0.xxxx Model: K-Nearest Neighbors Training Accuracy: 0.xxxx Testing Accuracy: 0.xxxx Difference: 0.xxxx Best Generalizing Model: [Model Name] Cross-Validation Score (5-fold): 0.xxxx

Q1: Supervised learning uses labeled data where both input features and target outputs are provided, allowing the model to learn the mapping between them. Example: Email spam detection where emails are labeled as spam or not spam. Unsupervised learning works with unlabeled data to discover hidden patterns or structures. Example: Customer segmentation where customers are grouped based on purchasing behavior without predefined categories. The key difference is the presence or absence of labeled target values during training.
Q2: The training set is used to fit the model and learn patterns from the data. The validation set is used during model development to tune hyperparameters and make decisions about model architecture without touching the test set. The test set provides a final, unbiased evaluation of the model's performance on completely unseen data. This separation prevents data leakage and ensures honest assessment of generalization ability.
Q3: Overfitting occurs when a model learns the training data too well, including noise and random fluctuations, resulting in poor performance on new data. It is caused by model complexity being too high relative to the amount of training data. Underfitting occurs when a model is too simple to capture the underlying patterns in the data, performing poorly on both training and test sets. You can identify overfitting when training accuracy is very high but test accuracy is significantly lower, and underfitting when both training and test accuracies are low.
Code Snippet 1 Output:
Training set size: 70 Test set size: 30 Number of features: 4 Unique classes: [0 1]
Explanation:
Line 1-2: Import necessary functions and create a synthetic binary classification dataset with 100 samples and 4 features.
Line 3: Split the data using 30% for testing (test_size=0.3), resulting in 70 training samples and 30 test samples.
Lines 4-7: Print the sizes of the resulting datasets and confirm there are 4 features and 2 classes (0 and 1).
The random_state parameter ensures reproducible splits.
Code Snippet 2 Output:
Training accuracy: (approximately 0.92-0.95) Test accuracy: (approximately 0.88-0.94) Number of training samples: 150
Explanation:
This code creates a synthetic classification dataset with 200 samples and 5 features (3 informative, 2 redundant).
It splits the data into 75% training (150 samples) and 25% testing (50 samples).
A Logistic Regression model is instantiated and trained on the training data.
The score() method calculates classification accuracy on both sets.
If training accuracy was 0.99 and test accuracy was 0.65, this would indicate severe overfitting. The model has memorized the training data but cannot generalize. Actions to take would include: adding regularization, reducing model complexity, gathering more training data, or using cross-validation to better tune hyperparameters.
Q1 Solution:
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression # Load the Iris dataset iris = load_iris() X, y = iris.data, iris.target # Split into 70% training and 30% testing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Train Logistic Regression classifier model = LogisticRegression(max_iter=200, random_state=42) model.fit(X_train, y_train) # Print accuracy scores train_acc = model.score(X_train, y_train) test_acc = model.score(X_test, y_test) print(f"Training Accuracy: {train_acc:.3f}") print(f"Testing Accuracy: {test_acc:.3f}") # Make predictions on first 5 test samples predictions = model.predict(X_test[:5]) print(f"Predictions for first 5 test samples: {predictions}")
Approach: This solution loads the classic Iris dataset, splits it using train_test_split with the specified ratio, trains a Logistic Regression model, and evaluates performance. The max_iter parameter is increased to ensure convergence.
Q2 Solution:
from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression # Generate synthetic regression dataset X, y = make_regression(n_samples=150, n_features=1, noise=20, random_state=42) # Split into 80% training and 20% testing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train Linear Regression model model = LinearRegression() model.fit(X_train, y_train) # Calculate R² scores train_r2 = model.score(X_train, y_train) test_r2 = model.score(X_test, y_test) print(f"Training R² Score: {train_r2:.3f}") print(f"Testing R² Score: {test_r2:.3f}")
Approach: Creates a synthetic regression dataset with specified noise level, splits the data, trains a simple linear regression model, and evaluates using R² score which measures the proportion of variance explained by the model.
Q3 Solution:
from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression # Load breast cancer dataset data = load_breast_cancer() X, y = data.data, data.target # Define split ratios splits = [(0.5, '50-50'), (0.3, '70-30'), (0.1, '90-10')] # Train and evaluate for each split for test_size, label in splits: X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=test_size, random_state=42 ) model = LogisticRegression(max_iter=10000, random_state=42) model.fit(X_train, y_train) test_acc = model.score(X_test, y_test) test_samples = len(X_test) print(f"Split {label}: Test Accuracy = {test_acc:.3f}, Test Samples = {test_samples}")
Approach: This solution demonstrates how different train-test splits affect evaluation. Larger test sets provide more reliable performance estimates but leave less data for training. The loop iterates through different split ratios for comparison.
Q4 Solution:
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression # Generate dataset with many features but few informative ones X, y = make_classification( n_samples=100, n_features=20, n_informative=2, n_redundant=18, random_state=42 ) # Split data X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.4, random_state=42 ) # Model 1: Default parameters model1 = LogisticRegression(max_iter=10000, random_state=42) model1.fit(X_train, y_train) train_acc1 = model1.score(X_train, y_train) test_acc1 = model1.score(X_test, y_test) # Model 2: Stronger regularization model2 = LogisticRegression(C=0.01, max_iter=10000, random_state=42) model2.fit(X_train, y_train) train_acc2 = model2.score(X_train, y_train) test_acc2 = model2.score(X_test, y_test) print(f"Model 1 (Default) - Train: {train_acc1:.3f}, Test: {test_acc1:.3f}") print(f"Model 2 (C=0.01) - Train: {train_acc2:.3f}, Test: {test_acc2:.3f}") # Determine which generalizes better gap1 = train_acc1 - test_acc1 gap2 = train_acc2 - test_acc2 better_model = "Model 2" if gap2 < gap1="" else="" "model="" 1"="" print(f"{better_model}="" generalizes="" better="" (smaller="" train-test="" gap)")="">
Approach: Creates a scenario prone to overfitting with many features but few informative ones. Compares a default model against one with stronger regularization (lower C value). The model with smaller gap between training and test accuracy generalizes better.
from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split, cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier import numpy as np # Load digits dataset digits = load_digits() X, y = digits.data, digits.target # Stratified train-test split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # Define models models = { 'Logistic Regression': LogisticRegression(max_iter=10000, random_state=42), 'Decision Tree': DecisionTreeClassifier(max_depth=10, random_state=42), 'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5) } # Store results results = {} print("Model Performance Comparison:") print("=" * 32) # Train and evaluate each model for name, model in models.items(): model.fit(X_train, y_train) train_acc = model.score(X_train, y_train) test_acc = model.score(X_test, y_test) difference = train_acc - test_acc results[name] = { 'train': train_acc, 'test': test_acc, 'diff': difference } print(f"\nModel: {name}") print(f" Training Accuracy: {train_acc:.4f}") print(f" Testing Accuracy: {test_acc:.4f}") print(f" Difference: {difference:.4f}") # Find best generalizing model (smallest difference) best_model_name = min(results, key=lambda x: results[x]['diff']) best_model = models[best_model_name] print(f"\nBest Generalizing Model: {best_model_name}") # Perform cross-validation on best model cv_scores = cross_val_score(best_model, X_train, y_train, cv=5) mean_cv_score = np.mean(cv_scores) print(f"Cross-Validation Score (5-fold): {mean_cv_score:.4f}")
Explanation: This comprehensive solution loads the digits dataset and creates a stratified split to maintain class distribution across train and test sets. It trains three different classifiers and compares their performance metrics. The model with the smallest gap between training and testing accuracy is identified as having the best generalization. Finally, 5-fold cross-validation is performed on the best model to get a more robust estimate of its performance. Stratified splitting and cross-validation are important techniques for reliable model evaluation, especially with imbalanced datasets.