Tori Grasso

Pushing SoundSoar toward the finish line: I added five additional models (Logistic Regression, SVM, LDA, Extra Trees, KNN), built a cold-start search to predict songs without history, versioned all model artifacts, and tightened performance and explainability for the final presentation.

Introduction

This month I focused on strengthening SoundSoar’s machine learning pipeline for trend prediction. I integrated several new models, improved how I handle songs without popularity history, and organized all model artifacts for clarity and reproducibility. The cover image shows a single snapshot of the latest model performance view.

New models and cold-start search

I expanded the model roster and added a search function that can predict outcomes for tracks with limited or no popularity history.

Logistic Regression: baseline probability classifier on structured features.
Support Vector Machine (SVM): solid for high-dimensional separation.
Linear Discriminant Analysis (LDA): interpretable dimensionality reduction and classification.
Extra Trees: fast, variance-reducing tree ensemble.
K-Nearest Neighbors (KNN): distance-based classification and regression.
Random Forest: robust tree ensemble I continued to refine.
HistGradientBoosting: boosted trees with strong tabular performance.

I also stored starter scripts, README notes, pickled models, and CSV data for each approach so instructors and other developers can run and adapt the work easily.

Lessons about popularity history

Reliance on popularity history improves accuracy when that data exists, but models struggle when it does not. At the moment KNN and LDA show more signal from audio attributes like tempo and danceability. Next I will strengthen models that lean on attributes rather than history or expand history coverage to reduce cold-start gaps.

Model performance overview

I evaluated seven models on accuracy and class balance across up, down, and stable trends.

Random Forest and HistGradientBoosting: ~0.86 accuracy with consistent predictions on down and stable, minimal misclassification.
Extra Trees: ~0.83 accuracy; effective parameters captured subtle patterns.
KNN and SVM: ~0.81 and ~0.78 accuracy respectively; competitive with appropriate neighbors and margins.
Logistic Regression and LDA: ~0.69 and ~0.70; more interpretable but struggled to separate up vs. down.

Feature importance insights

Random Forest: standard deviation in popularity and velocity stand out.
HistGradientBoosting: velocity and current popularity are strong drivers.
Extra Trees: mean and median popularity matter most.
SVM: balanced across features with slight lift from velocity and current popularity.
KNN: tempo and danceability carry more weight for neighborhood grouping.
Logistic Regression: mean and current popularity dominate the linear decision boundary.
LDA: velocity and danceability help separate classes after projection.

Product updates

I refined the template work for the active model page and the review page so users can see accuracy, precision, F1, and feature importance at a glance.

Retrospective

What went right

Broadened the model set and improved overall predictive strength.
Cold-start prediction path is in place for tracks with little history.
Model artifacts are organized and reproducible.

Challenges

Accuracy drops when popularity history is missing.
Lower-complexity models had difficulty separating up and down classes.

Next steps

Boost attribute-driven models to handle cold-start cases better.
Expand coverage of popularity history where feasible.
Continue tuning and validating ensembles for balanced class performance.