Learning Machine Learning and Kaggle’s Playground Series

2 minute read

Overview

For this project, I participated in a Kaggle competition with the goal of learning the basics of machine learning.
The task was to build a model that could predict student test scores based on a given dataset.

Because of this, I shared ideas, discussed approaches with my peers, learned machine learning and Kaggle terminology (e.g. overfitting, CV/LB score, encoding), and used AI tools such as ChatGPT to create and enhance a predictive model. More importantly, I learned the process of managing data, training models, and gradually improving their accuracy.

Kaggle Competition Results on Public Leaderboard (as of now):

Kaggle Username: lilrayray
Place: 916/3929
Best Score: 8.69287

Strategy & Approach

My general approach looked like this:

Explored the dataset
- Created graphs to show correlations
- Checked data types
- Identified important features
Preprocessed the data
- Handled missing values
- Encoded categorical variables
- Scaled numerical features when necessary
Engineered new features
- Created interaction features
  - Example: Sleep Score, which is hours of sleep multiplied by 1, 2, or 3 depending on sleep quality (higher quality = higher multiplier)
- Created square features
Model selection
- Started with a simple notebook using LightGBM
  - Found that creating a prediction then adding it back into the features to create another prediction worked well
- Experimented with a more advanced model
  - Tried to use a neural network and failed miserably
- Opted for a simpler but effective approach by using multiple models
  - Inspired by a notebook found online, I trained LightGBM, XGBoost, CatBoost, and used Ridge Stacking to blend the models’ predictions together (this worked the best)
Evaluation
- Avoided overfitting by validating results
- Submitted to Kaggle to check leaderboard score

What I Learned

The importance of Exploratory Data Analysis
- I never realized how important it is to find correlations (graphing also makes this proccess a whole lot easier)
The importance of data cleaning and preprocessing
- Not cleaning data might lead to overfitting or inaccurate predictions
- Preprocessing needs to be done so that the models can read the data
The basics on how some models work
How Kaggle competitions are structured and evaluated

Challenges & Troubles

Some of the main challenges I ran into:

Understanding how to deal with an unproccessed dataset
Finding ways to improve my model (half the time my score would be worse)
Understanding a lot of the code and terminology shown in the Kaggle discussions and the code page.

Results

While my model wasn’t perfect, it achieved a reasonable prediction accuracy and helped me understand the workflow of a machine learning project.

More importantly, I enjoyed having discussions with my peers in class and learning the process together.

Notebooks & Code

Exploratory Data Analysis (created fully using AI) 01_eda.ipynb (GitHub)

Feature Engineering (creatly mostly using AI, with the exception of the engineered features) 02_feature_engineering.ipynb (GitHub)

Final Model (Ridge + CatBoost + LightGBM + XGBoost) (mostly done by me with the help of AI and inspiration from the code tab on the competition page) 05_ridge_catlightxg_model.ipynb (GitHub)

Next Steps

Experiment with additional models
Explore neural networks
Participate in more Kaggle competitions

Share on

X Facebook LinkedIn Bluesky

Rayan Bashir