Skip to content

adilsondias-engineer/24-cpp-order-gateway

Repository files navigation

Project 24: Order Gateway (Low-Latency PCIe Passthrough)

Part of FPGA Trading Systems Portfolio

This project is part of a complete end-to-end trading system:

  • Main Repository: fpga-trading-systems
  • Project Number: 24 of 30
  • Category: C++ Application
  • Dependencies: Project 23 (Order Book - PCIe output), Project 25 (Market Maker - Disruptor consumer)

Platform: Linux Technology: C++20, PCIe (XDMA), LMAX Disruptor Status: Completed


Overview

The Order Gateway is an ultra-low-latency PCIe passthrough layer of the FPGA trading system, acting as a bridge between the FPGA hardware and downstream trading components. It reads BBO (Best Bid/Offer) data from the FPGA via PCIe DMA and immediately publishes to LMAX Disruptor lock-free IPC for downstream processing.

Design Decision: XGBoost inference was relocated to Project 25 for pipeline parallelism. This allows P24 to process the next BBO while P25 runs GPU inference on the previous BBO, reducing effective latency.

Data Flow:

FPGA Order Book (P23) → PCIe DMA → Parse BBO → Validate → Disruptor → Market Maker (P25)

Key Components:

  • PCIeListenerV2: Reads 56-byte BBO packets with magic header from /dev/xdma0_c2h_0
  • Magic Header Sync: Reliable packet boundary detection using 0xBB0BB048 marker
  • BBO Validation: Filters corrupted or invalid data
  • Disruptor Producer: Lock-free IPC to Project 25 (~0.5 μs publish)

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                   Project 24: Order Gateway                       │
│                                                                   │
│  ┌─────────────────┐     ┌──────────────────────────┐            │
│  │  PCIe Listener  │────→│     BBO Parser           │            │
│  │  /dev/xdma0_    │     │  (56-byte packets)       │            │
│  │  c2h_0          │     │  Magic Header Sync       │            │
│  └─────────────────┘     │  Symbol, Bid/Ask, Spread │            │
│   DMA from FPGA (P23)    └──────────┬───────────────┘            │
│   Gen2 x1 (5 Gbps)                  │                             │
│                                     ↓                             │
│                           ┌──────────────────────┐                │
│                           │   BBO Validator      │                │
│                           │   (Filter invalid)   │                │
│                           └──────────┬───────────┘                │
│                                      │                            │
│                                      ↓                            │
│                           ┌──────────────────────┐                │
│                           │  Disruptor Producer  │                │
│                           │  (Lock-Free Publish) │                │
│                           │  Raw BBO Data        │                │
│                           └──────────┬───────────┘                │
│                                      │                            │
└──────────────────────────────────────┼────────────────────────────┘
                                       │
                  POSIX Shared Memory (/dev/shm/bbo_ring_gateway)
                  Ring Buffer: 1024 entries × 128 bytes = 131 KB
                  Lock-Free IPC: Atomic sequence numbers
                                       │
┌──────────────────────────────────────┼────────────────────────────┐
│                                      ↓                            │
│                           ┌──────────────────────┐                │
│                           │  Disruptor Consumer  │                │
│                           │  (Lock-Free Poll)    │                │
│                           └──────────────────────┘                │
│                                      │                            │
│                                      ↓                            │
│                           ┌──────────────────────┐                │
│                           │  XGBoost Predictor   │                │
│                           │  (GPU via CUDA 13.0) │                │
│                           │  84% accuracy        │                │
│                           └──────────────────────┘                │
│                                                                   │
│                   Project 25: Market Maker FSM                    │
└───────────────────────────────────────────────────────────────────┘

Features

1. PCIe Listener with Magic Header Sync

Reads BBO data directly from FPGA via PCIe DMA with reliable packet synchronization:

  • Device: /dev/xdma0_c2h_0 (Xilinx XDMA driver)
  • Direction: Card-to-Host (C2H) - FPGA sends data to host
  • Packet Size: 56 bytes per BBO update (with magic header)
  • Bandwidth: PCIe Gen2 x1 (5 Gbps theoretical, ~1-2 μs per read)
  • Sync Method: Magic header scanning (0x48B00BBB as little-endian uint32)

BBO Packet Format (56 bytes with Magic Header):

Offset Size Field Description
0-3 4 Magic Header 0xBB0BB048 (read as 0x48B00BBB on little-endian)
4-7 4 Packet Length 0x00000038 (read as 0x38000000 on little-endian)
8-15 8 Symbol Stock ticker (ASCII, space-padded)
16-19 4 Bid Price Best bid (4 decimal places)
20-23 4 Bid Size Bid shares
24-27 4 Ask Price Best ask (4 decimal places)
28-31 4 Ask Size Ask shares
32-35 4 Spread Ask - Bid (4 decimal places)
36-39 4 T1 ITCH parse timestamp
40-43 4 T2 CDC FIFO write timestamp
44-47 4 T3 BBO ready for PCIe (250 MHz cycles)
48-51 4 T4 AXI-Stream TX start (250 MHz cycles)
52-55 4 Reserved Padding

Magic Header Synchronization

The PCIeListenerV2 uses magic header scanning for reliable packet boundary detection:

// Magic header constants (byte-swapped for little-endian)
constexpr uint32_t BBO_MAGIC_HEADER = 0x48B00BBB;   // FPGA sends: BB 0B B0 48
constexpr uint32_t BBO_PACKET_LENGTH = 0x38000000;  // FPGA sends: 00 00 00 38

Sync Algorithm:

  1. Scan incoming DMA data for magic header pattern
  2. Verify packet length field matches expected value
  3. If both match, packet boundary is found
  4. Parse BBO data from offset 8 onwards
  5. If sync lost, rescan for next magic header

2. Disruptor IPC (Ultra-Low-Latency)

LMAX Disruptor pattern for lock-free inter-process communication:

  • Shared Memory: /dev/shm/bbo_ring_gateway (POSIX shm)
  • Ring Buffer Size: 1024 entries × 128 bytes = 131 KB
  • IPC Method: Lock-free atomic operations (memory_order_acquire/release)
  • Consumer: Project 25 (Market Maker FSM)
  • Performance: ~0.5 μs publish latency

Disruptor Pattern Benefits:

  • Zero-copy shared memory (no TCP/socket overhead)
  • Lock-free synchronization (atomic sequence numbers)
  • Cache-line aligned structures (prevents false sharing)
  • Power-of-2 ring buffer (fast modulo using bitwise AND)

Configuration

config.json:

{
  "log_level": "info",
  "pcie": {
    "device": "/dev/xdma0_c2h_0",
    "gen2_mode": true,
    "verbose": false,
    "poll_mode": "BUSY_SPIN",
    "poll_timeout_us": 100,
    "spin_iterations": 1000
  },
  "disruptor": {
    "enable": true,
    "shm_name": "gateway"
  },
  "performance": {
    "enable_rt": false,
    "rt_cpu": 4,
    "rt_priority": 80,
    "quiet_mode": false,
    "latency_sample_rate": 100
  }
}

Configuration Options

Section Option Description Default
log_level - Log level: trace, debug, info, warn, error info
pcie.device - PCIe XDMA device path /dev/xdma0_c2h_0
pcie.gen2_mode - Enable PCIe Gen2 mode (250 MHz for T3/T4) true
pcie.verbose - Enable verbose packet logging false
pcie.poll_mode - Polling mode: SLEEP, POLL, HYBRID, BUSY_SPIN BUSY_SPIN
pcie.poll_timeout_us - Timeout for POLL/HYBRID modes (μs) 100
pcie.spin_iterations - Spin iterations for HYBRID mode 1000
disruptor.enable - Enable Disruptor IPC true
disruptor.shm_name - Shared memory name suffix gateway
performance.enable_rt - Enable SCHED_FIFO real-time scheduling false
performance.rt_cpu - CPU core for listener thread affinity 4
performance.rt_priority - SCHED_FIFO priority (1-99) 80
performance.quiet_mode - Suppress console output false
performance.latency_sample_rate - Sample every Nth packet for latency stats 100

Polling Modes

Mode Latency CPU Usage Description
SLEEP ~10ms ~1% Sleep-based polling, lowest CPU
POLL ~100μs ~5% poll() syscall, moderate latency
HYBRID ~50μs ~20% Adaptive poll + spin
BUSY_SPIN <10μs 100% Pure busy-spin, lowest latency

Build Instructions

Prerequisites

  • Linux (Ubuntu 22.04+ recommended)
  • GCC 11+ or Clang 14+ (C++20 support)
  • CMake 3.20+
  • Xilinx XDMA driver installed

Dependencies

# Install system dependencies
sudo apt-get install -y build-essential cmake

# Install vcpkg dependencies
vcpkg install spdlog nlohmann-json

Build

mkdir build
cd build
cmake ..
make -j$(nproc)

XDMA Driver Setup

# Clone and build Xilinx XDMA driver
git clone https://github.com/Xilinx/dma_ip_drivers.git
cd dma_ip_drivers/XDMA/linux-kernel/xdma
make
sudo make install

# Load driver
sudo modprobe xdma

# Verify device
ls -la /dev/xdma0_*

Usage

# Use default config.json
./order_gateway

# Use custom config file
./order_gateway /path/to/config.json

# With RT optimizations (requires CAP_SYS_NICE)
sudo setcap cap_sys_nice=eip ./order_gateway
./order_gateway

Expected Output:

[order_gateway] [info] Order Gateway starting...
[order_gateway] [info] [PCIe V2] Packet size: 56 bytes, verbose: false
[order_gateway] [info] [PCIe V2] Synchronized to packet stream (magic header found)
[order_gateway] [info] [SPY] Bid: 324.7300 (22436) | Ask: 324.7600 (72756) | Spread: 0.0300 | FPGA: 0.172 us
[order_gateway] [info] [AAPL] Bid: 312.0000 (70) | Ask: 321.6700 (120) | Spread: 9.6700 | FPGA: 0.308 us

Performance

Expected Latency Breakdown

Stage Latency Notes
PCIe DMA Read ~1-2 μs XDMA C2H transfer
Magic Header Scan ~0.05 μs Byte pattern matching
BBO Parse ~0.1-0.5 μs Binary to struct
Validation ~0.1 μs Filter invalid data
Disruptor Publish ~0.5 μs Lock-free IPC
Total ~2-4 μs End-to-end (passthrough)

FPGA Latency Measurement

The FPGA timestamps enable measuring FPGA-side latency in two stages:

Clock Domains:

  • T1/T2: RGMII RX domain (125 MHz, 8 ns/cycle)
  • T3/T4: AXI/XDMA domain (Gen2: 250 MHz = 4 ns/cycle, Gen1: 125 MHz = 8 ns/cycle)

Latency Breakdown:

  • Latency A: (T2 - T1) × 8ns — ITCH parse to CDC FIFO write
  • Latency B: (T4 - T3) × 4ns — BBO ready to AXI-Stream TX (Gen2)
  • Total FPGA Latency: Latency A + Latency B

Typical FPGA latency observed: 0.17-0.31 μs (Latency B only)

Note: XGBoost GPU inference (~10-100 μs) is performed in Project 25, enabling pipeline parallelism where P24 processes the next BBO while P25 runs inference on the previous one.


Code Structure

24-order-gateway/
├── config.json               # Configuration file
├── CMakeLists.txt            # Build configuration
├── src/
│   ├── main.cpp              # Entry point
│   ├── order_gateway.cpp     # Main gateway orchestration
│   ├── pcie_listener_v2.cpp  # PCIe DMA reader with magic header sync
│   ├── bbo_parser.cpp        # Binary BBO parser
│   └── bbo_validator.cpp     # BBO validation
├── include/
│   ├── order_gateway.h
│   ├── pcie_listener_v2.h    # V2 listener with 56-byte packet support
│   ├── bbo_parser.h
│   ├── bbo_validator.h
│   └── common/
│       ├── perf_monitor.h    # Performance monitoring
│       └── rt_config.h       # RT scheduling utilities
└── vcpkg.json                # Dependency manifest

January 2026 Updates

Magic Header Packet Format

Updated to support Project 23's new 56-byte packet format with magic header:

  • Packet size: Changed from 48 to 56 bytes
  • Magic header: 0xBB0BB048 at packet offset 0
  • Packet length: 0x00000038 at packet offset 4
  • Byte order: FPGA uses byte_swap_64, host reads as little-endian uint32

Code Changes:

// pcie_listener_v2.cpp - Updated constants
constexpr uint32_t BBO_MAGIC_HEADER = 0x48B00BBB;   // Little-endian
constexpr uint32_t BBO_PACKET_LENGTH = 0x38000000;  // Little-endian

// BboPacketRaw struct updated to 56 bytes with header fields
struct __attribute__((packed)) BboPacketRaw {
    uint32_t magic;          // Bytes 0-3: Magic header
    uint32_t length;         // Bytes 4-7: Packet length
    char     symbol[8];      // Bytes 8-15: Symbol
    // ... rest of fields
};

Known Issues

Occasional Resync: The magic header sync occasionally needs to rescan when packet boundaries shift. This occurs even with IOMMU disabled and doesn't affect overall functionality. The Phase 3 multi-FPGA architecture will use Aurora GTX links instead of PCIe, eliminating this issue.

XDMA Driver Byte Count: The XDMA driver may return more bytes than requested in some cases. The PCIeListenerV2 implements a workaround that clamps the byte count to prevent ring buffer overflow.


Related Projects


Data Flow (Full System)

FPGA (Project 23)
    ↓ PCIe Gen2 x1 (/dev/xdma0_c2h_0)
    ↓ 56-byte BBO packets with magic header
Project 24: Order Gateway  ← YOU ARE HERE
    ├─ PCIeListenerV2 (magic header sync)
    ├─ BBOValidator (filter invalid data)
    └─ Disruptor Producer (raw BBO)
    ↓ Shared Memory (/dev/shm/bbo_ring_gateway)
Project 25: Market Maker
    ├─ Disruptor Consumer (raw BBO)
    ├─ XGBoostPredictor (81% accuracy, ~10-100μs)
    ├─ MarketMakerFSM (strategy logic)
    └─ OrderProducer (orders to P26)
    ↓ Shared Memory (/dev/shm/order_ring_mm)
Project 26: Order Execution
    ├─ Order Consumer
    ├─ Simulated Fill (~50 μs latency)
    └─ Fill Producer (fills to P25)
    ↓ Shared Memory (/dev/shm/fill_ring_oe)
Project 25: (receives fills)
    └─ Updates position, PnL

Build Time: ~30 seconds Hardware Status: Tested with FPGA PCIe transmitter (Project 23) Last Updated: January 2026

About

C++ order gateway v2 - Enhanced performance and reliability

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published