This project is part of a complete end-to-end trading system:
- Main Repository: fpga-trading-systems
- Project Number: 24 of 30
- Category: C++ Application
- Dependencies: Project 23 (Order Book - PCIe output), Project 25 (Market Maker - Disruptor consumer)
Platform: Linux Technology: C++20, PCIe (XDMA), LMAX Disruptor Status: Completed
The Order Gateway is an ultra-low-latency PCIe passthrough layer of the FPGA trading system, acting as a bridge between the FPGA hardware and downstream trading components. It reads BBO (Best Bid/Offer) data from the FPGA via PCIe DMA and immediately publishes to LMAX Disruptor lock-free IPC for downstream processing.
Design Decision: XGBoost inference was relocated to Project 25 for pipeline parallelism. This allows P24 to process the next BBO while P25 runs GPU inference on the previous BBO, reducing effective latency.
Data Flow:
FPGA Order Book (P23) → PCIe DMA → Parse BBO → Validate → Disruptor → Market Maker (P25)
Key Components:
- PCIeListenerV2: Reads 56-byte BBO packets with magic header from
/dev/xdma0_c2h_0 - Magic Header Sync: Reliable packet boundary detection using 0xBB0BB048 marker
- BBO Validation: Filters corrupted or invalid data
- Disruptor Producer: Lock-free IPC to Project 25 (~0.5 μs publish)
┌──────────────────────────────────────────────────────────────────┐
│ Project 24: Order Gateway │
│ │
│ ┌─────────────────┐ ┌──────────────────────────┐ │
│ │ PCIe Listener │────→│ BBO Parser │ │
│ │ /dev/xdma0_ │ │ (56-byte packets) │ │
│ │ c2h_0 │ │ Magic Header Sync │ │
│ └─────────────────┘ │ Symbol, Bid/Ask, Spread │ │
│ DMA from FPGA (P23) └──────────┬───────────────┘ │
│ Gen2 x1 (5 Gbps) │ │
│ ↓ │
│ ┌──────────────────────┐ │
│ │ BBO Validator │ │
│ │ (Filter invalid) │ │
│ └──────────┬───────────┘ │
│ │ │
│ ↓ │
│ ┌──────────────────────┐ │
│ │ Disruptor Producer │ │
│ │ (Lock-Free Publish) │ │
│ │ Raw BBO Data │ │
│ └──────────┬───────────┘ │
│ │ │
└──────────────────────────────────────┼────────────────────────────┘
│
POSIX Shared Memory (/dev/shm/bbo_ring_gateway)
Ring Buffer: 1024 entries × 128 bytes = 131 KB
Lock-Free IPC: Atomic sequence numbers
│
┌──────────────────────────────────────┼────────────────────────────┐
│ ↓ │
│ ┌──────────────────────┐ │
│ │ Disruptor Consumer │ │
│ │ (Lock-Free Poll) │ │
│ └──────────────────────┘ │
│ │ │
│ ↓ │
│ ┌──────────────────────┐ │
│ │ XGBoost Predictor │ │
│ │ (GPU via CUDA 13.0) │ │
│ │ 84% accuracy │ │
│ └──────────────────────┘ │
│ │
│ Project 25: Market Maker FSM │
└───────────────────────────────────────────────────────────────────┘
Reads BBO data directly from FPGA via PCIe DMA with reliable packet synchronization:
- Device:
/dev/xdma0_c2h_0(Xilinx XDMA driver) - Direction: Card-to-Host (C2H) - FPGA sends data to host
- Packet Size: 56 bytes per BBO update (with magic header)
- Bandwidth: PCIe Gen2 x1 (5 Gbps theoretical, ~1-2 μs per read)
- Sync Method: Magic header scanning (0x48B00BBB as little-endian uint32)
BBO Packet Format (56 bytes with Magic Header):
| Offset | Size | Field | Description |
|---|---|---|---|
| 0-3 | 4 | Magic Header | 0xBB0BB048 (read as 0x48B00BBB on little-endian) |
| 4-7 | 4 | Packet Length | 0x00000038 (read as 0x38000000 on little-endian) |
| 8-15 | 8 | Symbol | Stock ticker (ASCII, space-padded) |
| 16-19 | 4 | Bid Price | Best bid (4 decimal places) |
| 20-23 | 4 | Bid Size | Bid shares |
| 24-27 | 4 | Ask Price | Best ask (4 decimal places) |
| 28-31 | 4 | Ask Size | Ask shares |
| 32-35 | 4 | Spread | Ask - Bid (4 decimal places) |
| 36-39 | 4 | T1 | ITCH parse timestamp |
| 40-43 | 4 | T2 | CDC FIFO write timestamp |
| 44-47 | 4 | T3 | BBO ready for PCIe (250 MHz cycles) |
| 48-51 | 4 | T4 | AXI-Stream TX start (250 MHz cycles) |
| 52-55 | 4 | Reserved | Padding |
The PCIeListenerV2 uses magic header scanning for reliable packet boundary detection:
// Magic header constants (byte-swapped for little-endian)
constexpr uint32_t BBO_MAGIC_HEADER = 0x48B00BBB; // FPGA sends: BB 0B B0 48
constexpr uint32_t BBO_PACKET_LENGTH = 0x38000000; // FPGA sends: 00 00 00 38Sync Algorithm:
- Scan incoming DMA data for magic header pattern
- Verify packet length field matches expected value
- If both match, packet boundary is found
- Parse BBO data from offset 8 onwards
- If sync lost, rescan for next magic header
LMAX Disruptor pattern for lock-free inter-process communication:
- Shared Memory:
/dev/shm/bbo_ring_gateway(POSIX shm) - Ring Buffer Size: 1024 entries × 128 bytes = 131 KB
- IPC Method: Lock-free atomic operations (memory_order_acquire/release)
- Consumer: Project 25 (Market Maker FSM)
- Performance: ~0.5 μs publish latency
Disruptor Pattern Benefits:
- Zero-copy shared memory (no TCP/socket overhead)
- Lock-free synchronization (atomic sequence numbers)
- Cache-line aligned structures (prevents false sharing)
- Power-of-2 ring buffer (fast modulo using bitwise AND)
config.json:
{
"log_level": "info",
"pcie": {
"device": "/dev/xdma0_c2h_0",
"gen2_mode": true,
"verbose": false,
"poll_mode": "BUSY_SPIN",
"poll_timeout_us": 100,
"spin_iterations": 1000
},
"disruptor": {
"enable": true,
"shm_name": "gateway"
},
"performance": {
"enable_rt": false,
"rt_cpu": 4,
"rt_priority": 80,
"quiet_mode": false,
"latency_sample_rate": 100
}
}| Section | Option | Description | Default |
|---|---|---|---|
log_level |
- | Log level: trace, debug, info, warn, error | info |
pcie.device |
- | PCIe XDMA device path | /dev/xdma0_c2h_0 |
pcie.gen2_mode |
- | Enable PCIe Gen2 mode (250 MHz for T3/T4) | true |
pcie.verbose |
- | Enable verbose packet logging | false |
pcie.poll_mode |
- | Polling mode: SLEEP, POLL, HYBRID, BUSY_SPIN | BUSY_SPIN |
pcie.poll_timeout_us |
- | Timeout for POLL/HYBRID modes (μs) | 100 |
pcie.spin_iterations |
- | Spin iterations for HYBRID mode | 1000 |
disruptor.enable |
- | Enable Disruptor IPC | true |
disruptor.shm_name |
- | Shared memory name suffix | gateway |
performance.enable_rt |
- | Enable SCHED_FIFO real-time scheduling | false |
performance.rt_cpu |
- | CPU core for listener thread affinity | 4 |
performance.rt_priority |
- | SCHED_FIFO priority (1-99) | 80 |
performance.quiet_mode |
- | Suppress console output | false |
performance.latency_sample_rate |
- | Sample every Nth packet for latency stats | 100 |
| Mode | Latency | CPU Usage | Description |
|---|---|---|---|
| SLEEP | ~10ms | ~1% | Sleep-based polling, lowest CPU |
| POLL | ~100μs | ~5% | poll() syscall, moderate latency |
| HYBRID | ~50μs | ~20% | Adaptive poll + spin |
| BUSY_SPIN | <10μs | 100% | Pure busy-spin, lowest latency |
- Linux (Ubuntu 22.04+ recommended)
- GCC 11+ or Clang 14+ (C++20 support)
- CMake 3.20+
- Xilinx XDMA driver installed
# Install system dependencies
sudo apt-get install -y build-essential cmake
# Install vcpkg dependencies
vcpkg install spdlog nlohmann-jsonmkdir build
cd build
cmake ..
make -j$(nproc)# Clone and build Xilinx XDMA driver
git clone https://github.com/Xilinx/dma_ip_drivers.git
cd dma_ip_drivers/XDMA/linux-kernel/xdma
make
sudo make install
# Load driver
sudo modprobe xdma
# Verify device
ls -la /dev/xdma0_*# Use default config.json
./order_gateway
# Use custom config file
./order_gateway /path/to/config.json
# With RT optimizations (requires CAP_SYS_NICE)
sudo setcap cap_sys_nice=eip ./order_gateway
./order_gatewayExpected Output:
[order_gateway] [info] Order Gateway starting...
[order_gateway] [info] [PCIe V2] Packet size: 56 bytes, verbose: false
[order_gateway] [info] [PCIe V2] Synchronized to packet stream (magic header found)
[order_gateway] [info] [SPY] Bid: 324.7300 (22436) | Ask: 324.7600 (72756) | Spread: 0.0300 | FPGA: 0.172 us
[order_gateway] [info] [AAPL] Bid: 312.0000 (70) | Ask: 321.6700 (120) | Spread: 9.6700 | FPGA: 0.308 us
| Stage | Latency | Notes |
|---|---|---|
| PCIe DMA Read | ~1-2 μs | XDMA C2H transfer |
| Magic Header Scan | ~0.05 μs | Byte pattern matching |
| BBO Parse | ~0.1-0.5 μs | Binary to struct |
| Validation | ~0.1 μs | Filter invalid data |
| Disruptor Publish | ~0.5 μs | Lock-free IPC |
| Total | ~2-4 μs | End-to-end (passthrough) |
The FPGA timestamps enable measuring FPGA-side latency in two stages:
Clock Domains:
- T1/T2: RGMII RX domain (125 MHz, 8 ns/cycle)
- T3/T4: AXI/XDMA domain (Gen2: 250 MHz = 4 ns/cycle, Gen1: 125 MHz = 8 ns/cycle)
Latency Breakdown:
- Latency A: (T2 - T1) × 8ns — ITCH parse to CDC FIFO write
- Latency B: (T4 - T3) × 4ns — BBO ready to AXI-Stream TX (Gen2)
- Total FPGA Latency: Latency A + Latency B
Typical FPGA latency observed: 0.17-0.31 μs (Latency B only)
Note: XGBoost GPU inference (~10-100 μs) is performed in Project 25, enabling pipeline parallelism where P24 processes the next BBO while P25 runs inference on the previous one.
24-order-gateway/
├── config.json # Configuration file
├── CMakeLists.txt # Build configuration
├── src/
│ ├── main.cpp # Entry point
│ ├── order_gateway.cpp # Main gateway orchestration
│ ├── pcie_listener_v2.cpp # PCIe DMA reader with magic header sync
│ ├── bbo_parser.cpp # Binary BBO parser
│ └── bbo_validator.cpp # BBO validation
├── include/
│ ├── order_gateway.h
│ ├── pcie_listener_v2.h # V2 listener with 56-byte packet support
│ ├── bbo_parser.h
│ ├── bbo_validator.h
│ └── common/
│ ├── perf_monitor.h # Performance monitoring
│ └── rt_config.h # RT scheduling utilities
└── vcpkg.json # Dependency manifest
Updated to support Project 23's new 56-byte packet format with magic header:
- Packet size: Changed from 48 to 56 bytes
- Magic header: 0xBB0BB048 at packet offset 0
- Packet length: 0x00000038 at packet offset 4
- Byte order: FPGA uses byte_swap_64, host reads as little-endian uint32
Code Changes:
// pcie_listener_v2.cpp - Updated constants
constexpr uint32_t BBO_MAGIC_HEADER = 0x48B00BBB; // Little-endian
constexpr uint32_t BBO_PACKET_LENGTH = 0x38000000; // Little-endian
// BboPacketRaw struct updated to 56 bytes with header fields
struct __attribute__((packed)) BboPacketRaw {
uint32_t magic; // Bytes 0-3: Magic header
uint32_t length; // Bytes 4-7: Packet length
char symbol[8]; // Bytes 8-15: Symbol
// ... rest of fields
};Occasional Resync: The magic header sync occasionally needs to rescan when packet boundaries shift. This occurs even with IOMMU disabled and doesn't affect overall functionality. The Phase 3 multi-FPGA architecture will use Aurora GTX links instead of PCIe, eliminating this issue.
XDMA Driver Byte Count: The XDMA driver may return more bytes than requested in some cases. The PCIeListenerV2 implements a workaround that clamps the byte count to prevent ring buffer overflow.
- 23-order-book/ - FPGA order book (PCIe data source)
- 25-market-maker/ - Market Maker FSM (Disruptor consumer)
- 26-order-execution/ - Order Execution (simulated fills)
- 28-complete-system/ - System Orchestrator
FPGA (Project 23)
↓ PCIe Gen2 x1 (/dev/xdma0_c2h_0)
↓ 56-byte BBO packets with magic header
Project 24: Order Gateway ← YOU ARE HERE
├─ PCIeListenerV2 (magic header sync)
├─ BBOValidator (filter invalid data)
└─ Disruptor Producer (raw BBO)
↓ Shared Memory (/dev/shm/bbo_ring_gateway)
Project 25: Market Maker
├─ Disruptor Consumer (raw BBO)
├─ XGBoostPredictor (81% accuracy, ~10-100μs)
├─ MarketMakerFSM (strategy logic)
└─ OrderProducer (orders to P26)
↓ Shared Memory (/dev/shm/order_ring_mm)
Project 26: Order Execution
├─ Order Consumer
├─ Simulated Fill (~50 μs latency)
└─ Fill Producer (fills to P25)
↓ Shared Memory (/dev/shm/fill_ring_oe)
Project 25: (receives fills)
└─ Updates position, PnL
Build Time: ~30 seconds Hardware Status: Tested with FPGA PCIe transmitter (Project 23) Last Updated: January 2026