Case Study
Loan Safety with Decision Trees and a Small Random Forest
2025
- Python
- pandas
- scikit-learn
- Decision Trees
Used LendingClub loan data to predict whether a loan is safe (+1) or risky (-1) with decision trees and a simple random forest, and compared training vs validation accuracy across different tree depths.
Problem & Motivation:
Given LendingClub data with features like grade, home ownership, purpose, term, and debt-to-income ratio, predict whether a loan is a safe loan (+1) or a risky loan (-1).
Data & Approach:
- Loaded lending-club-data.csv, created the safe_loans label from bad_loans, and explored features such as grade and home_ownership.
- Selected the assignment’s feature list and used pd.get_dummies to one-hot encode categorical columns for sklearn.
- Trained DecisionTreeClassifier models with different max_depth values, and used GridSearchCV over max_depth and min_samples_leaf to tune early-stopping settings.
- Implemented a small RandomForest416 class that fits multiple trees on bootstrap samples and predicts by majority vote, then compared its train/validation accuracy to a single tree.
Results:
- Deeper trees fit the training data very well but did not always improve validation accuracy, showing overfitting at large depths.
- The RandomForest416 model generally gave better validation accuracy than a single decision tree at similar depths.
- Features based on grade, sub_grade, home_ownership, purpose, and term became usable once expanded into one-hot encoded columns.
Limitations:
Used only decision trees and a small random forest on train/validation splits (no separate held-out test set or additional model families beyond the assignment scope).