Back


Predicting Skin Cancer Status (Benign vs. Malignant)
2025




Overview

This project focused on building a machine learning binary classification system to predict whether a patient’s skin cancer status is benign or malignant.  The dataset involved clinical, demographic, environmental, and behavioral data. Early detection is critical for improving survival outcomes, so the goal of this project was to explore how data-driven methods can influence medical decisions.


Dataset

  • 50,000 training observations
  • 20,000 testing observations
  • 50 predictors categorized as follows:
  •     - Demographic factors
        - Environmental and sun exposure variables
        - Sun protection and skin care habits
        - Biological and health indicators
        - Lifestyle and behavioral features
    All sensitive information was removed to ensure privacy compliance


Modeling Approach

Data preprocessing included missing value imputation using missForest, categorical encoding via one-hot encoding, and feature selection using a combination of LASSO regularization, t-tests, and chi-squared tests.

Model selection was based on cross-validated performance rather than training accuracy to ensure generalization and prevent overfitting.





Results and Interpretation

Multiple classification models were evaluated, and logistic regression achieved the highest performance, with approximately 60.7% accuracy on validation data. Although more complex models were considered, logistic regression outperformed them all while providing strong interpretability. This is a crucial advantage in healthcare-related applications where understanding prediction drivers is essential.

Feature analysis indicated that predictors such as age, skin tone, and UV exposure had the strongest influence on malignant classification, which is consistent with established medical risk factors.

View full analysis report -> GitHub


Index

sanghyundkim@outlook.com