Learning Machine Learning and Kaggle’s Playground Series
Overview
For this project, I participated in a Kaggle competition with the goal of learning the basics of machine learning.
The task was to build a model that could predict student test scores based on a given dataset.
Because of this, I shared ideas, discussed approaches with my peers, learned machine learning and Kaggle terminology (e.g. overfitting, CV/LB score, encoding), and used AI tools such as ChatGPT to create and enhance a predictive model. More importantly, I learned the process of managing data, training models, and gradually improving their accuracy.
Kaggle Competition Results on Public Leaderboard (as of now):
- Kaggle Username: lilrayray
- Place: 916/3929
- Best Score: 8.69287
Strategy & Approach
My general approach looked like this:
- Explored the dataset
- Created graphs to show correlations
- Checked data types
- Identified important features
- Preprocessed the data
- Handled missing values
- Encoded categorical variables
- Scaled numerical features when necessary
- Engineered new features
- Created interaction features
- Example: Sleep Score, which is hours of sleep multiplied by 1, 2, or 3 depending on sleep quality (higher quality = higher multiplier)
- Created square features
- Created interaction features
- Model selection
- Started with a simple notebook using LightGBM
- Found that creating a prediction then adding it back into the features to create another prediction worked well
- Experimented with a more advanced model
- Tried to use a neural network and failed miserably
- Opted for a simpler but effective approach by using multiple models
- Inspired by a notebook found online, I trained LightGBM, XGBoost, CatBoost, and used Ridge Stacking to blend the models’ predictions together (this worked the best)
- Started with a simple notebook using LightGBM
- Evaluation
- Avoided overfitting by validating results
- Submitted to Kaggle to check leaderboard score
What I Learned
- The importance of Exploratory Data Analysis
- I never realized how important it is to find correlations (graphing also makes this proccess a whole lot easier)
- The importance of data cleaning and preprocessing
- Not cleaning data might lead to overfitting or inaccurate predictions
- Preprocessing needs to be done so that the models can read the data
- The basics on how some models work
- How Kaggle competitions are structured and evaluated
Challenges & Troubles
Some of the main challenges I ran into:
- Understanding how to deal with an unproccessed dataset
- Finding ways to improve my model (half the time my score would be worse)
- Understanding a lot of the code and terminology shown in the Kaggle discussions and the code page.
Results
While my model wasn’t perfect, it achieved a reasonable prediction accuracy and helped me understand the workflow of a machine learning project.
More importantly, I enjoyed having discussions with my peers in class and learning the process together.
Notebooks & Code
Exploratory Data Analysis (created fully using AI) 01_eda.ipynb (GitHub)
Feature Engineering (creatly mostly using AI, with the exception of the engineered features) 02_feature_engineering.ipynb (GitHub)
Final Model (Ridge + CatBoost + LightGBM + XGBoost) (mostly done by me with the help of AI and inspiration from the code tab on the competition page) 05_ridge_catlightxg_model.ipynb (GitHub)
Next Steps
- Experiment with additional models
- Explore neural networks
- Participate in more Kaggle competitions