12 minute read

Talk: What is Low Latency C++? - Timur Doumler - CppNow 2023

Resources:

Introduction

This is a high-level overview of techniques used in low-latency programming across different industries. It is in parts a summary of the above linked talks Timur gave.

(I will give mostly provide the notes I took while watching this talk as well as some further links to other mentioned talks.)

What is this low latency thing about?

  • latency (time between question and answer) vs throughput (how much work per unit of time). latency bifurcates into
    • ultra low latency: minimize average response time, e.g. HFT
    • real-time programming: hard upper bound, optimize for worst case, e.g. audio, gaming
  • efficiency: definition is to minimize work needed for task. thus, it lies inbetween latency and throughput.
  • common terms for low latency (even across industries)
    • hot path: one main thread which has hard time constraints. other threads are more relaxed.
    • real-time: program needs to produce correct result within a time limit (deadline)
  • C++ advantages: manual memory management, zero-cost abstractions, many libraries
  • C++ techniques: efficient programming, programming for deterministic execution time
  • measuring is important and hard

Efficient programming

  • efficient: do the minimal required amount of work for given task
  • important topics but covered more in depth in other talks
    • knowledge of programming language
      • avoid unnecessary copies, e.g.
        std::vector<std::string> strings = { /* */ };
        for (auto s : strings) // bad: copies of strings are costly
        for (const auto& s : strings) // good
        
      • avoid unnecessary function calls: use inline; prefer std::variant, CRTP, “deducing this” (C++23) over virtual functions
      • compile time computations: constexpr, TMP, lookup tables at compile time
    • knowledge of libraries used
    • knowledge of compiler (optimizer, ABI)
  • knowledge of hardware architecture: CPU architecture, cache hierarchy, prefetcher, branch predictor

Programming for deterministic execution time

The following must be avoided on the hot path:

  • dynamic memory allocation
    • no data types that allocate
    • no algorithms that allocate, e.g. std::stable_sort, std::stable_partition_, std::inplace_merge
    • no data structures that allocate
      • use: std::{array,pair,tuple,optional,variant}
      • don’t use: std::{any, function, vector, deque,...} (type erasure)
      • workaround: custom allocator
        • general-purpose allocators (e.g. tcmalloc, rpmalloc) optimize for average not worst case. thus not useful for low latency
        • preallocate memory (std::pmr allocators)
        • custom data types
      • lambdas do not allocate but coroutines might
  • block thread
    • problems with mutexes
      • unclear when mutex is released
      • priority inversion: low prio thread holds mutex, high prio thread waits for mutex. thus high prio thread turned into a low prio thread.
    • atomic: indivisble, race-free (but might involve mutex)
    • lock-free: at least one thread makes progress
      • thus no mutex possible (because thread holding mutex might go to sleep and then no progress)
    • wait-free: all threads make progress
      • only use std::atomic, std::atomic_ref (C++20)
      • no CAS loop on atomic
      • two main scenarios:
        • passing data to/from hot path thread: wait-free queue
          • different flavours possible. most used: single-producer single-consumer via ring buffer
        • sharing data with hot path thread:
          • hot path reads:
            • try_lock: tries to read data via locking mutex but fails if mutex already locked. needs fallback strategy
              • example ```cpp struct Coefficients; Coefficients coeffs; crill::spin_mutex mtx; // std::mutex involves syscalls. better spinlock with pregressive fallback

              // hot path’s thread void process(audio_buffer& buffer) { if (std::unique_lock lock(mtx, std::try_to_lock); lock.owns_lock()) process(buffer, coeffs); else /* fallback strategy, e.g. use values from previous loop */ }

              // non hot path thread void update_coeffs(Coefficients new_coeffs) { std::lock_guard lock(mtx); // locking here is ok coeffs = new_coeffs; } ```

            • spin-on-write: use atomic unique_ptr and make sure that when one thread changes the pointer another thread is not using the old object ```cpp std::unique_ptr storage; std::atomic<Coefficients*> coeffs;

            // hot path’s thread void process(audio_buffer& buffer) { auto* current_coeffs = coeffs.exchange(nullptr); // nullptr -> make sure other thread does not invalidate object inbetween process(buffer, *current_coeffs); coeffs.store(current_coeffs); }

            // non hot path thread void update_coeffs(Coefficients new_coeffs) { auto new_storage = std::make_unique(new_coeffs); for (auto* expected = storage.get(); !coeffs.compare_exchange_weak(expected, new_storage.get(); expected = storage.get()) /* spin */; storage = std::move(new_storage); } ```

            • issue: 2(!) std::atomic::exchange for reading

            • RCU (Read, Copy, Update)

          • hot path writes:
            • double-buffering
              • writing always in buffer A
              • reader comes along. atomically swaps buffer (now writer writes into buffer B)
              • reader reads buffer A
              • example shown in “What is Low Latency C++? Part 2” 1:07:45 (taken from different talk)
            • SeqLock
              • use atomic counter
              • writer: increment counter, write data, increment counter
              • reader: wait until even counter, read, check if counter still the same
          • simplest solution: put data into std::atomic if it fits
  • I/O
    • with other threads: use single-producer single-consumer queue
    • with other processes: shared memory
    • with hardware: direct memory access
  • exceptions: no exceptions, error codes, std::optional, std::expected
  • context switching (user/kernel space): use thread priority
  • syscalls: avoid them
  • calling into unknown code
  • loops without