Machine Learning Portfolio

This repository contains two implementations of generative models and language systems: a CPU-trained GPT model and a GAN-based handwritten digit generator.

Repository Contents

End-to-End CPU-Trained GPT System: A transformer-based language model trained from scratch on the FineWeb-Edu dataset using only CPU resources.
Handwritten Digit Generation (DCGAN & cGAN): A Deep Convolutional Generative Adversarial Network (DCGAN) and Conditional GAN (cGAN) trained on the MNIST dataset.

Project 1: End-to-End CPU-Trained GPT System

Status: Deployed Demo: https://helixgpt.azurewebsites.net

This project implements a GPT language model trained entirely on CPU hardware. It covers the full pipeline from raw data acquisition to deployment, demonstrating transformer training feasibility on constrained hardware.

System Architecture

Architecture: Decoder-only Transformer (GPT)
Parameters: 11.64 Million
Context Window: 512 tokens
Tokenizer: Byte-Level BPE (Vocab size: 20,257)
Training Hardware: Azure Standard E16as v5 (CPU)
Inference: FastAPI + Docker (Azure Web App)

Technical Specifications

Component	Specification	Comparison to GPT-2 Small
Layers	8	~0.66x
Hidden Size	256	~0.33x
Attention Heads	8	~0.66x
Total Params	11M	~0.10x

Data Pipeline

The model was trained on the FineWeb-Edu 100M dataset (approx. 100 million tokens).

Acquisition: Raw text retrieval from HuggingFace.
Tokenization: Custom-trained BPE tokenizer reserving <|endoftext|> (ID 0).
Serialization: Data encoded to uint16 binary shards (.bin) for memory-mapped loading.
Splitting: 90% Training / 10% Validation.

Training Configuration

Optimizer: AdamW (Beta1=0.9, Beta2=0.95)
Schedule: Cosine learning rate decay with 1000-step warmup.
Batch Size: 8
Duration: ~9.5 hours (24,413 steps / 1 epoch).
Loss: Initial ~9.95 | Final ~4.75.

Directory Structure

End-to-End CPU-Trained GPT System/
├── app/                  # Azure deployment files
├── bpe_tokenizer/        # BPE vocab and merges
├── checkpoints/          # Model weights (.pt)
├── src/
│   ├── data.py           # Binary data loader
│   ├── model.py          # GPT architecture definition
│   ├── tokenizer.py      # BPE logic
│   └── utils.py          # Configuration utilities
├── train_gpt2.py         # Main training loop
└── prepare_dataset.py    # Data processing script

Project 2: Handwritten Digit Generation (DCGAN & cGAN)

This project implements Generative Adversarial Networks to synthesize handwritten digits resembling the MNIST dataset. It includes two distinct architectures: a standard DCGAN for random generation and a Conditional GAN (cGAN) for targeted digit generation.

1. Deep Convolutional GAN (DCGAN)

The DCGAN generates random digit images by learning the latent space distribution of the training data.

Generator Architecture:

Input: Random noise vector (100 dimensions).
Dense Layer: Projects noise to 7x7x256 feature map.
Upsampling: 3x Conv2DTranspose layers with Batch Normalization and LeakyReLU.
Output: 28x28x1 image (Tanh activation).

Discriminator Architecture:

Input: 28x28x1 image.
Downsampling: 2x Conv2D layers (Strides=2) with LeakyReLU and Dropout (0.3).
Output: Binary classification (Real vs. Fake).

Training Hyperparameters:

Epochs: 1000
Batch Size: 256
Optimizer: Adam (Learning Rate: 0.0002, Beta1: 0.5)
Loss Function: Binary Crossentropy

2. Conditional GAN (cGAN)

The cGAN extends the architecture by conditioning both the generator and discriminator on class labels (0-9), allowing for deterministic generation of specific digits.

Usage

Environment Setup:

pip install tensorflow imageio tensorflow-docs

Generating Random Digits (DCGAN): Load the trained model dcgan_generator.keras and pass a noise vector.

import tensorflow as tf
import matplotlib.pyplot as plt

model = tf.keras.models.load_model('dcgan_generator.keras')
noise = tf.random.normal([1, 100])
prediction = model(noise, training=False)

plt.imshow(prediction[0, :, :, 0], cmap='gray')
plt.show()

Generating Specific Digits (cGAN): Load the trained model cgan_generator.keras and pass both noise and the target label.

import tensorflow as tf
import numpy as np

model = tf.keras.models.load_model('cgan_generator.keras')
noise = tf.random.normal([1, 100])
label = np.array([7]) # Specify digit here

prediction = model([noise, label], training=False)

File Structure

Handwritten Digits Generator/
├── generate_handwritten_digit_images_DCGAN.ipynb  # Training notebook
├── dcgan_generator.keras                          # Saved DCGAN model
├── cgan_generator.keras                           # Saved cGAN model
└── training_checkpoints/                          # Training artifacts

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
End-to-End CPU-Trained GPT System: From Dataset to Deployment		End-to-End CPU-Trained GPT System: From Dataset to Deployment
Handwritten Digits Generator - DCGAN		Handwritten Digits Generator - DCGAN
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Machine Learning Portfolio

Repository Contents

Project 1: End-to-End CPU-Trained GPT System

System Architecture

Technical Specifications

Data Pipeline

Training Configuration

Directory Structure

Project 2: Handwritten Digit Generation (DCGAN & cGAN)

1. Deep Convolutional GAN (DCGAN)

2. Conditional GAN (cGAN)

Usage

File Structure

About

Uh oh!

Releases

Packages

Languages

nandan2003/ML-portfolio

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Portfolio

Repository Contents

Project 1: End-to-End CPU-Trained GPT System

System Architecture

Technical Specifications

Data Pipeline

Training Configuration

Directory Structure

Project 2: Handwritten Digit Generation (DCGAN & cGAN)

1. Deep Convolutional GAN (DCGAN)

2. Conditional GAN (cGAN)

Usage

File Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages