Skip to content

Curated ML & Data Science projects. I build practical, data-driven solutions you can actually use. (And yes, I had fun making them.)

Notifications You must be signed in to change notification settings

nandan2003/ML-portfolio

Repository files navigation

Machine Learning Portfolio

This repository contains two implementations of generative models and language systems: a CPU-trained GPT model and a GAN-based handwritten digit generator.

Repository Contents

  1. End-to-End CPU-Trained GPT System: A transformer-based language model trained from scratch on the FineWeb-Edu dataset using only CPU resources.
  2. Handwritten Digit Generation (DCGAN & cGAN): A Deep Convolutional Generative Adversarial Network (DCGAN) and Conditional GAN (cGAN) trained on the MNIST dataset.

Project 1: End-to-End CPU-Trained GPT System

Status: Deployed Demo: https://helixgpt.azurewebsites.net

This project implements a GPT language model trained entirely on CPU hardware. It covers the full pipeline from raw data acquisition to deployment, demonstrating transformer training feasibility on constrained hardware.

System Architecture

  • Architecture: Decoder-only Transformer (GPT)
  • Parameters: 11.64 Million
  • Context Window: 512 tokens
  • Tokenizer: Byte-Level BPE (Vocab size: 20,257)
  • Training Hardware: Azure Standard E16as v5 (CPU)
  • Inference: FastAPI + Docker (Azure Web App)

Technical Specifications

Component Specification Comparison to GPT-2 Small
Layers 8 ~0.66x
Hidden Size 256 ~0.33x
Attention Heads 8 ~0.66x
Total Params 11M ~0.10x

Data Pipeline

The model was trained on the FineWeb-Edu 100M dataset (approx. 100 million tokens).

  1. Acquisition: Raw text retrieval from HuggingFace.
  2. Tokenization: Custom-trained BPE tokenizer reserving <|endoftext|> (ID 0).
  3. Serialization: Data encoded to uint16 binary shards (.bin) for memory-mapped loading.
  4. Splitting: 90% Training / 10% Validation.

Training Configuration

  • Optimizer: AdamW (Beta1=0.9, Beta2=0.95)
  • Schedule: Cosine learning rate decay with 1000-step warmup.
  • Batch Size: 8
  • Duration: ~9.5 hours (24,413 steps / 1 epoch).
  • Loss: Initial ~9.95 | Final ~4.75.

Directory Structure

End-to-End CPU-Trained GPT System/
├── app/                  # Azure deployment files
├── bpe_tokenizer/        # BPE vocab and merges
├── checkpoints/          # Model weights (.pt)
├── src/
│   ├── data.py           # Binary data loader
│   ├── model.py          # GPT architecture definition
│   ├── tokenizer.py      # BPE logic
│   └── utils.py          # Configuration utilities
├── train_gpt2.py         # Main training loop
└── prepare_dataset.py    # Data processing script

Project 2: Handwritten Digit Generation (DCGAN & cGAN)

This project implements Generative Adversarial Networks to synthesize handwritten digits resembling the MNIST dataset. It includes two distinct architectures: a standard DCGAN for random generation and a Conditional GAN (cGAN) for targeted digit generation.

1. Deep Convolutional GAN (DCGAN)

The DCGAN generates random digit images by learning the latent space distribution of the training data.

Generator Architecture:

  • Input: Random noise vector (100 dimensions).
  • Dense Layer: Projects noise to 7x7x256 feature map.
  • Upsampling: 3x Conv2DTranspose layers with Batch Normalization and LeakyReLU.
  • Output: 28x28x1 image (Tanh activation).

Discriminator Architecture:

  • Input: 28x28x1 image.
  • Downsampling: 2x Conv2D layers (Strides=2) with LeakyReLU and Dropout (0.3).
  • Output: Binary classification (Real vs. Fake).

Training Hyperparameters:

  • Epochs: 1000
  • Batch Size: 256
  • Optimizer: Adam (Learning Rate: 0.0002, Beta1: 0.5)
  • Loss Function: Binary Crossentropy

2. Conditional GAN (cGAN)

The cGAN extends the architecture by conditioning both the generator and discriminator on class labels (0-9), allowing for deterministic generation of specific digits.

Usage

Environment Setup:

pip install tensorflow imageio tensorflow-docs

Generating Random Digits (DCGAN): Load the trained model dcgan_generator.keras and pass a noise vector.

import tensorflow as tf
import matplotlib.pyplot as plt

model = tf.keras.models.load_model('dcgan_generator.keras')
noise = tf.random.normal([1, 100])
prediction = model(noise, training=False)

plt.imshow(prediction[0, :, :, 0], cmap='gray')
plt.show()

Generating Specific Digits (cGAN): Load the trained model cgan_generator.keras and pass both noise and the target label.

import tensorflow as tf
import numpy as np

model = tf.keras.models.load_model('cgan_generator.keras')
noise = tf.random.normal([1, 100])
label = np.array([7]) # Specify digit here

prediction = model([noise, label], training=False)

File Structure

Handwritten Digits Generator/
├── generate_handwritten_digit_images_DCGAN.ipynb  # Training notebook
├── dcgan_generator.keras                          # Saved DCGAN model
├── cgan_generator.keras                           # Saved cGAN model
└── training_checkpoints/                          # Training artifacts

About

Curated ML & Data Science projects. I build practical, data-driven solutions you can actually use. (And yes, I had fun making them.)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published