Project 24: Order Gateway (Low-Latency PCIe Passthrough)

Part of FPGA Trading Systems Portfolio

This project is part of a complete end-to-end trading system:

Main Repository: fpga-trading-systems
Project Number: 24 of 30
Category: C++ Application
Dependencies: Project 23 (Order Book - PCIe output), Project 25 (Market Maker - Disruptor consumer)

Platform: Linux Technology: C++20, PCIe (XDMA), LMAX Disruptor Status: Completed

Overview

The Order Gateway is an ultra-low-latency PCIe passthrough layer of the FPGA trading system, acting as a bridge between the FPGA hardware and downstream trading components. It reads BBO (Best Bid/Offer) data from the FPGA via PCIe DMA and immediately publishes to LMAX Disruptor lock-free IPC for downstream processing.

Design Decision: XGBoost inference was relocated to Project 25 for pipeline parallelism. This allows P24 to process the next BBO while P25 runs GPU inference on the previous BBO, reducing effective latency.

Data Flow:

FPGA Order Book (P23) → PCIe DMA → Parse BBO → Validate → Disruptor → Market Maker (P25)

Key Components:

PCIeListenerV2: Reads 56-byte BBO packets with magic header from /dev/xdma0_c2h_0
Magic Header Sync: Reliable packet boundary detection using 0xBB0BB048 marker
BBO Validation: Filters corrupted or invalid data
Disruptor Producer: Lock-free IPC to Project 25 (~0.5 μs publish)

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                   Project 24: Order Gateway                       │
│                                                                   │
│  ┌─────────────────┐     ┌──────────────────────────┐            │
│  │  PCIe Listener  │────→│     BBO Parser           │            │
│  │  /dev/xdma0_    │     │  (56-byte packets)       │            │
│  │  c2h_0          │     │  Magic Header Sync       │            │
│  └─────────────────┘     │  Symbol, Bid/Ask, Spread │            │
│   DMA from FPGA (P23)    └──────────┬───────────────┘            │
│   Gen2 x1 (5 Gbps)                  │                             │
│                                     ↓                             │
│                           ┌──────────────────────┐                │
│                           │   BBO Validator      │                │
│                           │   (Filter invalid)   │                │
│                           └──────────┬───────────┘                │
│                                      │                            │
│                                      ↓                            │
│                           ┌──────────────────────┐                │
│                           │  Disruptor Producer  │                │
│                           │  (Lock-Free Publish) │                │
│                           │  Raw BBO Data        │                │
│                           └──────────┬───────────┘                │
│                                      │                            │
└──────────────────────────────────────┼────────────────────────────┘
                                       │
                  POSIX Shared Memory (/dev/shm/bbo_ring_gateway)
                  Ring Buffer: 1024 entries × 128 bytes = 131 KB
                  Lock-Free IPC: Atomic sequence numbers
                                       │
┌──────────────────────────────────────┼────────────────────────────┐
│                                      ↓                            │
│                           ┌──────────────────────┐                │
│                           │  Disruptor Consumer  │                │
│                           │  (Lock-Free Poll)    │                │
│                           └──────────────────────┘                │
│                                      │                            │
│                                      ↓                            │
│                           ┌──────────────────────┐                │
│                           │  XGBoost Predictor   │                │
│                           │  (GPU via CUDA 13.0) │                │
│                           │  84% accuracy        │                │
│                           └──────────────────────┘                │
│                                                                   │
│                   Project 25: Market Maker FSM                    │
└───────────────────────────────────────────────────────────────────┘

Features

1. PCIe Listener with Magic Header Sync

Reads BBO data directly from FPGA via PCIe DMA with reliable packet synchronization:

Device: /dev/xdma0_c2h_0 (Xilinx XDMA driver)
Direction: Card-to-Host (C2H) - FPGA sends data to host
Packet Size: 56 bytes per BBO update (with magic header)
Bandwidth: PCIe Gen2 x1 (5 Gbps theoretical, ~1-2 μs per read)
Sync Method: Magic header scanning (0x48B00BBB as little-endian uint32)

BBO Packet Format (56 bytes with Magic Header):

Offset	Size	Field	Description
0-3	4	Magic Header	0xBB0BB048 (read as 0x48B00BBB on little-endian)
4-7	4	Packet Length	0x00000038 (read as 0x38000000 on little-endian)
8-15	8	Symbol	Stock ticker (ASCII, space-padded)
16-19	4	Bid Price	Best bid (4 decimal places)
20-23	4	Bid Size	Bid shares
24-27	4	Ask Price	Best ask (4 decimal places)
28-31	4	Ask Size	Ask shares
32-35	4	Spread	Ask - Bid (4 decimal places)
36-39	4	T1	ITCH parse timestamp
40-43	4	T2	CDC FIFO write timestamp
44-47	4	T3	BBO ready for PCIe (250 MHz cycles)
48-51	4	T4	AXI-Stream TX start (250 MHz cycles)
52-55	4	Reserved	Padding

Magic Header Synchronization

The PCIeListenerV2 uses magic header scanning for reliable packet boundary detection:

// Magic header constants (byte-swapped for little-endian)
constexpr uint32_t BBO_MAGIC_HEADER = 0x48B00BBB;   // FPGA sends: BB 0B B0 48
constexpr uint32_t BBO_PACKET_LENGTH = 0x38000000;  // FPGA sends: 00 00 00 38

Sync Algorithm:

Scan incoming DMA data for magic header pattern
Verify packet length field matches expected value
If both match, packet boundary is found
Parse BBO data from offset 8 onwards
If sync lost, rescan for next magic header

2. Disruptor IPC (Ultra-Low-Latency)

LMAX Disruptor pattern for lock-free inter-process communication:

Shared Memory: /dev/shm/bbo_ring_gateway (POSIX shm)
Ring Buffer Size: 1024 entries × 128 bytes = 131 KB
IPC Method: Lock-free atomic operations (memory_order_acquire/release)
Consumer: Project 25 (Market Maker FSM)
Performance: ~0.5 μs publish latency

Disruptor Pattern Benefits:

Zero-copy shared memory (no TCP/socket overhead)
Lock-free synchronization (atomic sequence numbers)
Cache-line aligned structures (prevents false sharing)
Power-of-2 ring buffer (fast modulo using bitwise AND)

Configuration

config.json:

{
  "log_level": "info",
  "pcie": {
    "device": "/dev/xdma0_c2h_0",
    "gen2_mode": true,
    "verbose": false,
    "poll_mode": "BUSY_SPIN",
    "poll_timeout_us": 100,
    "spin_iterations": 1000
  },
  "disruptor": {
    "enable": true,
    "shm_name": "gateway"
  },
  "performance": {
    "enable_rt": false,
    "rt_cpu": 4,
    "rt_priority": 80,
    "quiet_mode": false,
    "latency_sample_rate": 100
  }
}

Configuration Options

Section	Option	Description	Default
`log_level`	-	Log level: trace, debug, info, warn, error	info
`pcie.device`	-	PCIe XDMA device path	/dev/xdma0_c2h_0
`pcie.gen2_mode`	-	Enable PCIe Gen2 mode (250 MHz for T3/T4)	true
`pcie.verbose`	-	Enable verbose packet logging	false
`pcie.poll_mode`	-	Polling mode: SLEEP, POLL, HYBRID, BUSY_SPIN	BUSY_SPIN
`pcie.poll_timeout_us`	-	Timeout for POLL/HYBRID modes (μs)	100
`pcie.spin_iterations`	-	Spin iterations for HYBRID mode	1000
`disruptor.enable`	-	Enable Disruptor IPC	true
`disruptor.shm_name`	-	Shared memory name suffix	gateway
`performance.enable_rt`	-	Enable SCHED_FIFO real-time scheduling	false
`performance.rt_cpu`	-	CPU core for listener thread affinity	4
`performance.rt_priority`	-	SCHED_FIFO priority (1-99)	80
`performance.quiet_mode`	-	Suppress console output	false
`performance.latency_sample_rate`	-	Sample every Nth packet for latency stats	100

Polling Modes

Mode	Latency	CPU Usage	Description
SLEEP	~10ms	~1%	Sleep-based polling, lowest CPU
POLL	~100μs	~5%	poll() syscall, moderate latency
HYBRID	~50μs	~20%	Adaptive poll + spin
BUSY_SPIN	<10μs	100%	Pure busy-spin, lowest latency

Build Instructions

Prerequisites

Linux (Ubuntu 22.04+ recommended)
GCC 11+ or Clang 14+ (C++20 support)
CMake 3.20+
Xilinx XDMA driver installed

Dependencies

# Install system dependencies
sudo apt-get install -y build-essential cmake

# Install vcpkg dependencies
vcpkg install spdlog nlohmann-json

Build

mkdir build
cd build
cmake ..
make -j$(nproc)

XDMA Driver Setup

# Clone and build Xilinx XDMA driver
git clone https://github.com/Xilinx/dma_ip_drivers.git
cd dma_ip_drivers/XDMA/linux-kernel/xdma
make
sudo make install

# Load driver
sudo modprobe xdma

# Verify device
ls -la /dev/xdma0_*

Usage

# Use default config.json
./order_gateway

# Use custom config file
./order_gateway /path/to/config.json

# With RT optimizations (requires CAP_SYS_NICE)
sudo setcap cap_sys_nice=eip ./order_gateway
./order_gateway

Expected Output:

[order_gateway] [info] Order Gateway starting...
[order_gateway] [info] [PCIe V2] Packet size: 56 bytes, verbose: false
[order_gateway] [info] [PCIe V2] Synchronized to packet stream (magic header found)
[order_gateway] [info] [SPY] Bid: 324.7300 (22436) | Ask: 324.7600 (72756) | Spread: 0.0300 | FPGA: 0.172 us
[order_gateway] [info] [AAPL] Bid: 312.0000 (70) | Ask: 321.6700 (120) | Spread: 9.6700 | FPGA: 0.308 us

Performance

Expected Latency Breakdown

Stage	Latency	Notes
PCIe DMA Read	~1-2 μs	XDMA C2H transfer
Magic Header Scan	~0.05 μs	Byte pattern matching
BBO Parse	~0.1-0.5 μs	Binary to struct
Validation	~0.1 μs	Filter invalid data
Disruptor Publish	~0.5 μs	Lock-free IPC
Total	~2-4 μs	End-to-end (passthrough)

FPGA Latency Measurement

The FPGA timestamps enable measuring FPGA-side latency in two stages:

Clock Domains:

T1/T2: RGMII RX domain (125 MHz, 8 ns/cycle)
T3/T4: AXI/XDMA domain (Gen2: 250 MHz = 4 ns/cycle, Gen1: 125 MHz = 8 ns/cycle)

Latency Breakdown:

Latency A: (T2 - T1) × 8ns — ITCH parse to CDC FIFO write
Latency B: (T4 - T3) × 4ns — BBO ready to AXI-Stream TX (Gen2)
Total FPGA Latency: Latency A + Latency B

Typical FPGA latency observed: 0.17-0.31 μs (Latency B only)

Note: XGBoost GPU inference (~10-100 μs) is performed in Project 25, enabling pipeline parallelism where P24 processes the next BBO while P25 runs inference on the previous one.

Code Structure

24-order-gateway/
├── config.json               # Configuration file
├── CMakeLists.txt            # Build configuration
├── src/
│   ├── main.cpp              # Entry point
│   ├── order_gateway.cpp     # Main gateway orchestration
│   ├── pcie_listener_v2.cpp  # PCIe DMA reader with magic header sync
│   ├── bbo_parser.cpp        # Binary BBO parser
│   └── bbo_validator.cpp     # BBO validation
├── include/
│   ├── order_gateway.h
│   ├── pcie_listener_v2.h    # V2 listener with 56-byte packet support
│   ├── bbo_parser.h
│   ├── bbo_validator.h
│   └── common/
│       ├── perf_monitor.h    # Performance monitoring
│       └── rt_config.h       # RT scheduling utilities
└── vcpkg.json                # Dependency manifest

January 2026 Updates

Magic Header Packet Format

Updated to support Project 23's new 56-byte packet format with magic header:

Packet size: Changed from 48 to 56 bytes
Magic header: 0xBB0BB048 at packet offset 0
Packet length: 0x00000038 at packet offset 4
Byte order: FPGA uses byte_swap_64, host reads as little-endian uint32

Code Changes:

// pcie_listener_v2.cpp - Updated constants
constexpr uint32_t BBO_MAGIC_HEADER = 0x48B00BBB;   // Little-endian
constexpr uint32_t BBO_PACKET_LENGTH = 0x38000000;  // Little-endian

// BboPacketRaw struct updated to 56 bytes with header fields
struct __attribute__((packed)) BboPacketRaw {
    uint32_t magic;          // Bytes 0-3: Magic header
    uint32_t length;         // Bytes 4-7: Packet length
    char     symbol[8];      // Bytes 8-15: Symbol
    // ... rest of fields
};

Known Issues

Occasional Resync: The magic header sync occasionally needs to rescan when packet boundaries shift. This occurs even with IOMMU disabled and doesn't affect overall functionality. The Phase 3 multi-FPGA architecture will use Aurora GTX links instead of PCIe, eliminating this issue.

XDMA Driver Byte Count: The XDMA driver may return more bytes than requested in some cases. The PCIeListenerV2 implements a workaround that clamps the byte count to prevent ring buffer overflow.

Related Projects

23-order-book/ - FPGA order book (PCIe data source)
25-market-maker/ - Market Maker FSM (Disruptor consumer)
26-order-execution/ - Order Execution (simulated fills)
28-complete-system/ - System Orchestrator

Data Flow (Full System)

FPGA (Project 23)
    ↓ PCIe Gen2 x1 (/dev/xdma0_c2h_0)
    ↓ 56-byte BBO packets with magic header
Project 24: Order Gateway  ← YOU ARE HERE
    ├─ PCIeListenerV2 (magic header sync)
    ├─ BBOValidator (filter invalid data)
    └─ Disruptor Producer (raw BBO)
    ↓ Shared Memory (/dev/shm/bbo_ring_gateway)
Project 25: Market Maker
    ├─ Disruptor Consumer (raw BBO)
    ├─ XGBoostPredictor (81% accuracy, ~10-100μs)
    ├─ MarketMakerFSM (strategy logic)
    └─ OrderProducer (orders to P26)
    ↓ Shared Memory (/dev/shm/order_ring_mm)
Project 26: Order Execution
    ├─ Order Consumer
    ├─ Simulated Fill (~50 μs latency)
    └─ Fill Producer (fills to P25)
    ↓ Shared Memory (/dev/shm/fill_ring_oe)
Project 25: (receives fills)
    └─ Updates position, PnL

Build Time: ~30 seconds Hardware Status: Tested with FPGA PCIe transmitter (Project 23) Last Updated: January 2026

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
benchmark		benchmark
docs		docs
include		include
model		model
src		src
.gitattributes		.gitattributes
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CMakeUserPresets.json		CMakeUserPresets.json
LICENSE		LICENSE
README.md		README.md
build-with-buildroot.sh		build-with-buildroot.sh
config.json		config.json
config_debug.json		config_debug.json
vcpkg-configuration.json		vcpkg-configuration.json
vcpkg.json		vcpkg.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project 24: Order Gateway (Low-Latency PCIe Passthrough)

Part of FPGA Trading Systems Portfolio

Overview

Architecture

Features

1. PCIe Listener with Magic Header Sync

Magic Header Synchronization

2. Disruptor IPC (Ultra-Low-Latency)

Configuration

Configuration Options

Polling Modes

Build Instructions

Prerequisites

Dependencies

Build

XDMA Driver Setup

Usage

Performance

Expected Latency Breakdown

FPGA Latency Measurement

Code Structure

January 2026 Updates

Magic Header Packet Format

Known Issues

Related Projects

Data Flow (Full System)

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

adilsondias-engineer/24-cpp-order-gateway

Folders and files

Latest commit

History

Repository files navigation

Project 24: Order Gateway (Low-Latency PCIe Passthrough)

Part of FPGA Trading Systems Portfolio

Overview

Architecture

Features

1. PCIe Listener with Magic Header Sync

Magic Header Synchronization

2. Disruptor IPC (Ultra-Low-Latency)

Configuration

Configuration Options

Polling Modes

Build Instructions

Prerequisites

Dependencies

Build

XDMA Driver Setup

Usage

Performance

Expected Latency Breakdown

FPGA Latency Measurement

Code Structure

January 2026 Updates

Magic Header Packet Format

Known Issues

Related Projects

Data Flow (Full System)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages