Talk: What is low latency C++?
Talk: What is Low Latency C++? - Timur Doumler - CppNow 2023
Resources:
- videos on YouTube: part 1, part 2
- slides: part 1 part 2
- Timur’s previous talks on low latency
- C++ in the Audio Industry (CppCon 2015)
- C++ in the Audio Industry, Episode II: Floating Atomics (JUCE Summit 2015)
- Want fast C++? Know your hardware! (CppCon 2016)
- Lock-free programming with modern C++ (ACCU 2017)
- Using Locks in Real-Time Audio Processing, Safely (ADC 2020)
- Real-time Programming with the C++ Standard Library (CppCon 2021)
- A Lock-Free Atomic Shared Pointer in Modern Cpp (CppCon 2022)
- Thread synchronisation in real-time audio processing with RCU (ADC 2022)
Introduction
This is a high-level overview of techniques used in low-latency programming across different industries. It is in parts a summary of the above linked talks Timur gave.
(I will give mostly provide the notes I took while watching this talk as well as some further links to other mentioned talks.)
What is this low latency thing about?
- latency (time between question and answer) vs throughput (how much work per unit of time). latency bifurcates into
- ultra low latency: minimize average response time, e.g. HFT
- real-time programming: hard upper bound, optimize for worst case, e.g. audio, gaming
- efficiency: definition is to minimize work needed for task. thus, it lies inbetween latency and throughput.
- common terms for low latency (even across industries)
- hot path: one main thread which has hard time constraints. other threads are more relaxed.
- real-time: program needs to produce correct result within a time limit (deadline)
- C++ advantages: manual memory management, zero-cost abstractions, many libraries
- C++ techniques: efficient programming, programming for deterministic execution time
- measuring is important and hard
- Fedor Pikus: The Art of Writing Efficient Programs (Packt Publishing, 2021)
- Optimising a Real-Time Audio Processing Library - Dave Rowland - ADC22
- profiling, performance analysis, inspect assembly, benchmarking (see collection of tools on hackingcpp.com - see cpp-resources)
- microbenchmark are tricky: warm up cache, randomize heap, different optimizations
Efficient programming
- efficient: do the minimal required amount of work for given task
- important topics but covered more in depth in other talks
- knowledge of programming language
- avoid unnecessary copies, e.g.
std::vector<std::string> strings = { /* */ }; for (auto s : strings) // bad: copies of strings are costly for (const auto& s : strings) // good
- avoid unnecessary function calls: use
inline
; preferstd::variant
, CRTP, “deducing this” (C++23) over virtual functions - compile time computations:
constexpr
, TMP, lookup tables at compile time
- avoid unnecessary copies, e.g.
- knowledge of libraries used
- knowledge of compiler (optimizer, ABI)
- knowledge of programming language
- knowledge of hardware architecture: CPU architecture, cache hierarchy, prefetcher, branch predictor
- CPU
- multiple pipelines working in parallel. each pipeline has different stages.
- CPU hazards: interrupting optimal execution of pipelines
- branch hazard: next instruction depends on condition. branch predictor might help.
- branchless programming: avoiding branch mispredicts
- Branchless Programming in C++ - Fedor Pikus (CppCon 2021)
- examples: ```cpp bool x = …; bool y = …;
// slow: extra branch (x OR y ELSE) if (x || y) do_this(); else do_that();
// fast: evaluates x, y in one go ((x OR y) ELSE) if (bool(x) + bool(y)) do_this(); else do_that(); ```
[likely]
does not affect branch predictor, but code layout
- branchless programming: avoiding branch mispredicts
- data hazard: result from one pipeline is needed for another pipeline.
- example:
// converts c-style string to unsigned int representation unsigned atoui(const char* c) { unsigned ret = 0; while ((*c >= '0') && (*c <= '9')) ret = ret * 10 + (*(c++) - '0'); // hazard! iteration depends on previous iteration! return ret; }
- Writing Fast Code I - Andrei Alexandrescu (code::dive conference 2015)
- example:
- hardware hazard: CPU has properties, e.g. limited adders/shifters.
- branch hazard: next instruction depends on condition. branch predictor might help.
- SIMD: operation on multiple objects in parallel
- usage: auto-vectorization, explicit SIMD libraries
- SWAR (“SIMD Within A Register”):
- cache hierarchy
- access times: register < L1 cache < L2 cache < main memory
- data cache (accessing memory) vs instruction cache
- data cache miss: ~100 cycles. need to be avoided!
- cache-friendly: data locality, cachelines, contiguous traversal
- use
std::flat_{map,set}
(C++23). container adaptor ontop ofstd::vector
- cache-friendly algorithms(!) available
- instruction cache misses: avoid branches, virtual functions
- keep cache warm
- data cache: periodically poke data
- instruction cache: periodically run hot path with dummy input
- talks:
- C++ Algorithmic Complexity, Data Locality, Parallelism, Compiler Optimizations, & Some Concurrency - Avi Lachmish (CppCon 2022)
- What Do You Mean by “Cache Friendly”? - Björn Fahller (C++ on Sea 2022)
- Want fast C++? Know your hardware! - Timur Doumler (CppCon 2016)
- Cpu Caches and Why You Care - Scott Meyers (code::dive 2014)
- hardware problems
- CPU throttling: slowdown because of overheating. need to transform code to dissipate heat better.
- low energy mode because of low activity
- CPU
Programming for deterministic execution time
The following must be avoided on the hot path:
- dynamic memory allocation
- no data types that allocate
- no algorithms that allocate, e.g.
std::stable_sort
,std::stable_partition_
,std::inplace_merge
- no data structures that allocate
- use:
std::{array,pair,tuple,optional,variant}
- don’t use:
std::{any, function, vector, deque,...}
(type erasure) - workaround: custom allocator
- general-purpose allocators (e.g. tcmalloc, rpmalloc) optimize for average not worst case. thus not useful for low latency
- preallocate memory (
std::pmr
allocators) - custom data types
- lambdas do not allocate but coroutines might
- use:
- block thread
- problems with mutexes
- unclear when mutex is released
- priority inversion: low prio thread holds mutex, high prio thread waits for mutex. thus high prio thread turned into a low prio thread.
- atomic: indivisble, race-free (but might involve mutex)
- lock-free: at least one thread makes progress
- thus no mutex possible (because thread holding mutex might go to sleep and then no progress)
- wait-free: all threads make progress
- only use
std::atomic
,std::atomic_ref
(C++20) - no CAS loop on atomic
- two main scenarios:
- passing data to/from hot path thread: wait-free queue
- different flavours possible. most used: single-producer single-consumer via ring buffer
- sharing data with hot path thread:
- hot path reads:
try_lock
: tries to read data via locking mutex but fails if mutex already locked. needs fallback strategy- example ```cpp struct Coefficients; Coefficients coeffs; crill::spin_mutex mtx; // std::mutex involves syscalls. better spinlock with pregressive fallback
// hot path’s thread void process(audio_buffer& buffer) { if (std::unique_lock lock(mtx, std::try_to_lock); lock.owns_lock()) process(buffer, coeffs); else /* fallback strategy, e.g. use values from previous loop */ }
// non hot path thread void update_coeffs(Coefficients new_coeffs) { std::lock_guard lock(mtx); // locking here is ok coeffs = new_coeffs; } ```
- spin-on-write: use atomic unique_ptr and make sure that when one thread changes the pointer another thread is not using the old object
```cpp
std::unique_ptr
storage; std::atomic<Coefficients*> coeffs;
// hot path’s thread void process(audio_buffer& buffer) { auto* current_coeffs = coeffs.exchange(nullptr); // nullptr -> make sure other thread does not invalidate object inbetween process(buffer, *current_coeffs); coeffs.store(current_coeffs); }
// non hot path thread void update_coeffs(Coefficients new_coeffs) { auto new_storage = std::make_unique
(new_coeffs); for (auto* expected = storage.get(); !coeffs.compare_exchange_weak(expected, new_storage.get(); expected = storage.get()) /* spin */; storage = std::move(new_storage); } ``` -
issue: 2(!) std::atomic::exchange for reading
-
RCU (Read, Copy, Update)
- reading: single load
- writing: copying in new data, deferred reclamation of previous data
- deferred reclamation: “delete later when it’s safe”. 3 alternatives to implement: RCU, Hazard pointer, atomic shared ptr
- atomic shared ptr: lock-free, but lots of CAS loops. slow and not wait-free.
- hot path writes:
- double-buffering
- writing always in buffer A
- reader comes along. atomically swaps buffer (now writer writes into buffer B)
- reader reads buffer A
- example shown in “What is Low Latency C++? Part 2” 1:07:45 (taken from different talk)
- SeqLock
- use atomic counter
- writer: increment counter, write data, increment counter
- reader: wait until even counter, read, check if counter still the same
- double-buffering
- simplest solution: put data into
std::atomic
if it fits
- hot path reads:
- passing data to/from hot path thread: wait-free queue
- only use
- problems with mutexes
- I/O
- with other threads: use single-producer single-consumer queue
- with other processes: shared memory
- with hardware: direct memory access
- exceptions: no exceptions, error codes, std::optional, std::expected
- context switching (user/kernel space): use thread priority
- syscalls: avoid them
- calling into unknown code
- loops without