This repository contains code and resources for building predictive models using smartphone sensor data. The project encompasses data preprocessing, exploratory data analysis (EDA), feature selection, clustering, and various ML models.
The repository is organized into the following directories:
-
/scripts: Contains Python scripts for various stages of the analysis.feature_selection.py: Performs feature selection using linear regression.modeling.py: General modeling functions and utilities.variables.py: Defines and manages variables used throughout the project.clustering.py: Handles clustering of demographic data.preprocessing.py: Manages data preprocessing tasks such as merging datasets and handling missing values.visualization.py: Provides functions for visualizing data and results.
-
/notebooks: Contains Jupyter notebooks documenting the analysis process.01_preprocessing.ipynb: Data preprocessing steps, including merging datasets and imputing missing values.02_processing_Pipeline.ipynb: Pipeline object for value normalization by-column. Transformation of df into wide and slope/intercept versions, for feature extraction. Testing on Hist Gradient Boosting Regressor. Network-based statistic (NBS) for clinical populations vs. healthy population. Hierarchical agglomerative clustering on features + PCA on each cluster. Correlation structures of PCs.03_demographic_clustering.ipynb: Correlation across demographic information. Hierarchical agglomerative clustering on demographic features to group subjects into clusters.03_feature_pca.ipynb: Correlation across features. Hierarchical agglomerative clustering on demographic features + PCA on each cluster. Correlation structures of extracted PCs.04_prediction.ipynb: Linear regression and nested linear regression of depression score and various covariates. Evaluating six machine learning architectures for prediction of depression score based off passive sensor data. Feature importance analysis of a survey of predictive model types. Visualization of R2 for accuracy of different models for each dataset across different time scales. Comparing imputed vs. nonimputed data.
The project follows these main steps:
-
Preprocessing
- Merging datasets.
- Imputing missing values.
- Encoding missingness based on defined thresholds.
-
Exploratory Data Analysis (EDA)
- Visualizing participant data across different weeks.
- Assessing variable distributions and missingness.
- Feature selection using linear regression techniques.
-
Clustering
- Variable clustering to identify related features.
- Demographic data clustering to segment participants.
-
Visualizing RF Model
- Applying Random Forest to build predictive models.
- Visualiza results
-
Predictive Modelling using many models
- Running a wide variety of models
- Hyperparameter tuning
- Evaluating model performance and comparing results.
To replicate the analysis, pip install the requirements.tct:
pip install -r requirements.txt-
Data Preprocessing: Execute the scripts in the
/scriptsdirectory or run the01_preprocessing.ipynbnotebook to preprocess the data. -
Exploratory Data Analysis: Use the EDA notebooks (
02_processing_Pipeline.ipynb,03_demographic_clustering.ipynb, etc.) to explore and visualize the data. -
Feature Selection and Clustering: Apply feature selection methods and perform clustering analyses using the corresponding notebooks.
-
Modeling: Run the modeling notebooks (
04_prediction.ipynb, etc.) to build and evaluate predictive models.