Getting started with competitive data science can be quite intimidating. But it’s actually surprisingly simple!
I recently started messing around with Kaggle and made top 1% on a few competitions. This kernel is a quick overview of how I made top 0.3% on the Advanced Regression Techniques competition.
You can find the code + overview here: How I made top 0.3% on a Kaggle competition
- Each row in the dataset describes the characteristics of a house.
- Our goal is to predict the SalePrice, given these features.
- Our models are evaluated on the Root-Mean-Squared-Error (RMSE) between the log of the SalePrice predicted by our model, and the log of the actual SalePrice. Converting RMSE errors to a log scale ensures that errors in predicting expensive houses and cheap houses will affect our score equally.
Key features of the model training process in this kernel
- Cross Validation: Using 12-fold cross-validation
- Models: On each run of cross-validation I fit 7 models (ridge, svr, gradient boosting, random forest, xgboost, lightgbm regressors)
- Stacking: In addition, I trained a meta StackingCVRegressor optimized using xgboost
- Blending: All models trained will overfit the training data to varying degrees. Therefore, to make final predictions, I blended their predictions together to get more robust predictions.
We can observe from the graph below that the blended model far outperforms the other models, with an RMSLE of 0.075. This is the model I used for making the final predictions.