Talk: What is low latency C++?

April 23, 2024 12 minute read

Talk: What is Low Latency C++? - Timur Doumler - CppNow 2023

Resources:

Introduction

This is a high-level overview of techniques used in low-latency programming across different industries. It is in parts a summary of the above linked talks Timur gave.

(I will give mostly provide the notes I took while watching this talk as well as some further links to other mentioned talks.)

What is this low latency thing about?

latency (time between question and answer) vs throughput (how much work per unit of time). latency bifurcates into
- ultra low latency: minimize average response time, e.g. HFT
- real-time programming: hard upper bound, optimize for worst case, e.g. audio, gaming
efficiency: definition is to minimize work needed for task. thus, it lies inbetween latency and throughput.
common terms for low latency (even across industries)
- hot path: one main thread which has hard time constraints. other threads are more relaxed.
- real-time: program needs to produce correct result within a time limit (deadline)
C++ advantages: manual memory management, zero-cost abstractions, many libraries
C++ techniques: efficient programming, programming for deterministic execution time
measuring is important and hard
- Fedor Pikus: The Art of Writing Efficient Programs (Packt Publishing, 2021)
- Optimising a Real-Time Audio Processing Library - Dave Rowland - ADC22
- profiling, performance analysis, inspect assembly, benchmarking (see collection of tools on hackingcpp.com - see cpp-resources)
- microbenchmark are tricky: warm up cache, randomize heap, different optimizations

Efficient programming

efficient: do the minimal required amount of work for given task
important topics but covered more in depth in other talks
- knowledge of programming language
  - avoid unnecessary copies, e.g.
    std::vector<std::string> strings = { /* */ }; for (auto s : strings) // bad: copies of strings are costly for (const auto& s : strings) // good
  - avoid unnecessary function calls: use inline; prefer std::variant, CRTP, “deducing this” (C++23) over virtual functions
  - compile time computations: constexpr, TMP, lookup tables at compile time
- knowledge of libraries used
- knowledge of compiler (optimizer, ABI)
knowledge of hardware architecture: CPU architecture, cache hierarchy, prefetcher, branch predictor
- CPU
  - multiple pipelines working in parallel. each pipeline has different stages.
  - CPU hazards: interrupting optimal execution of pipelines
    - branch hazard: next instruction depends on condition. branch predictor might help.
      - branchless programming: avoiding branch mispredicts
        
        Branchless Programming in C++ - Fedor Pikus (CppCon 2021)
        
        examples: ```cpp bool x = …; bool y = …;
        
        // slow: extra branch (x OR y ELSE) if (x || y) do_this(); else do_that();
        
        // fast: evaluates x, y in one go ((x OR y) ELSE) if (bool(x) + bool(y)) do_this(); else do_that(); ```
      - [likely] does not affect branch predictor, but code layout
    - data hazard: result from one pipeline is needed for another pipeline.
      - example:
        // converts c-style string to unsigned int representation unsigned atoui(const char* c) { unsigned ret = 0; while ((*c >= '0') && (*c <= '9')) ret = ret * 10 + (*(c++) - '0'); // hazard! iteration depends on previous iteration! return ret; }
      - Writing Fast Code I - Andrei Alexandrescu (code::dive conference 2015)
    - hardware hazard: CPU has properties, e.g. limited adders/shifters.
  - SIMD: operation on multiple objects in parallel
    - usage: auto-vectorization, explicit SIMD libraries
    - SWAR (“SIMD Within A Register”):
- cache hierarchy
  - access times: register < L1 cache < L2 cache < main memory
  - data cache (accessing memory) vs instruction cache
  - data cache miss: ~100 cycles. need to be avoided!
    - cache-friendly: data locality, cachelines, contiguous traversal
    - use std::flat_{map,set} (C++23). container adaptor ontop of std::vector
    - cache-friendly algorithms(!) available
  - instruction cache misses: avoid branches, virtual functions
  - keep cache warm
    - data cache: periodically poke data
    - instruction cache: periodically run hot path with dummy input
  - talks:
- hardware problems
  - CPU throttling: slowdown because of overheating. need to transform code to dissipate heat better.
  - low energy mode because of low activity

Programming for deterministic execution time

The following must be avoided on the hot path:

dynamic memory allocation
- no data types that allocate
- no algorithms that allocate, e.g. std::stable_sort, std::stable_partition_, std::inplace_merge
- no data structures that allocate
  - use: std::{array,pair,tuple,optional,variant}
  - don’t use: std::{any, function, vector, deque,...} (type erasure)
  - workaround: custom allocator
    - general-purpose allocators (e.g. tcmalloc, rpmalloc) optimize for average not worst case. thus not useful for low latency
    - preallocate memory (std::pmr allocators)
    - custom data types
  - lambdas do not allocate but coroutines might
block thread
- problems with mutexes
  - unclear when mutex is released
  - priority inversion: low prio thread holds mutex, high prio thread waits for mutex. thus high prio thread turned into a low prio thread.
- atomic: indivisble, race-free (but might involve mutex)
- lock-free: at least one thread makes progress
  - thus no mutex possible (because thread holding mutex might go to sleep and then no progress)
- wait-free: all threads make progress
  - only use std::atomic, std::atomic_ref (C++20)
  - no CAS loop on atomic
  - two main scenarios:
    - passing data to/from hot path thread: wait-free queue
      - different flavours possible. most used: single-producer single-consumer via ring buffer
    - sharing data with hot path thread:
      - hot path reads:
        
        try_lock: tries to read data via locking mutex but fails if mutex already locked. needs fallback strategy
        
        example ```cpp struct Coefficients; Coefficients coeffs; crill::spin_mutex mtx; // std::mutex involves syscalls. better spinlock with pregressive fallback
        
        // hot path’s thread void process(audio_buffer& buffer) { if (std::unique_lock lock(mtx, std::try_to_lock); lock.owns_lock()) process(buffer, coeffs); else /* fallback strategy, e.g. use values from previous loop */ }
        
        // non hot path thread void update_coeffs(Coefficients new_coeffs) { std::lock_guard lock(mtx); // locking here is ok coeffs = new_coeffs; } ```
        
        spin-on-write: use atomic unique_ptr and make sure that when one thread changes the pointer another thread is not using the old object ```cpp std::unique_ptr storage; std::atomic<Coefficients*> coeffs;
        
        // hot path’s thread void process(audio_buffer& buffer) { auto* current_coeffs = coeffs.exchange(nullptr); // nullptr -> make sure other thread does not invalidate object inbetween process(buffer, *current_coeffs); coeffs.store(current_coeffs); }
        
        // non hot path thread void update_coeffs(Coefficients new_coeffs) { auto new_storage = std::make_unique(new_coeffs); for (auto* expected = storage.get(); !coeffs.compare_exchange_weak(expected, new_storage.get(); expected = storage.get()) /* spin */; storage = std::move(new_storage); } ```
        
        issue: 2(!) std::atomic::exchange for reading
        
        RCU (Read, Copy, Update)
        
        reading: single load
        
        writing: copying in new data, deferred reclamation of previous data
        
        deferred reclamation: “delete later when it’s safe”. 3 alternatives to implement: RCU, Hazard pointer, atomic shared ptr
        
        Read, Copy, Update, then what? RCU for non-kernel programmers - Fedor Pikus (CppCon 2017)
        
        atomic shared ptr: lock-free, but lots of CAS loops. slow and not wait-free.
      - hot path writes:
        
        double-buffering
        
        writing always in buffer A
        
        reader comes along. atomically swaps buffer (now writer writes into buffer B)
        
        reader reads buffer A
        
        example shown in “What is Low Latency C++? Part 2” 1:07:45 (taken from different talk)
        
        SeqLock
        
        use atomic counter
        
        writer: increment counter, write data, increment counter
        
        reader: wait until even counter, read, check if counter still the same
      - simplest solution: put data into std::atomic if it fits
I/O
- with other threads: use single-producer single-consumer queue
- with other processes: shared memory
- with hardware: direct memory access
exceptions: no exceptions, error codes, std::optional, std::expected
context switching (user/kernel space): use thread priority
syscalls: avoid them
calling into unknown code
loops without

Paul

Talk: What is low latency C++?

Introduction

What is this low latency thing about?

Efficient programming

Programming for deterministic execution time

You May Also Enjoy

Rust in Action by Tim McNamara

Splitting intervals - Part 3: Adding a trailing, optional boolean argument

Splitting intervals - Part 2: Making template errors suck less

Splitting intervals - Part 1