The outer function, called from the user code, is simply a placeholder for the atomic size_t, which is initialized to zero. It then initializes the first inner function, which parallelizes the code further:
template <typename SrcIt, typename DstIt, typename Pred>auto par_copy_if_sync(SrcIt first,SrcIt last,DstIt dst,Pred p,size_t chunk_sz){ auto dst_write_idx = std::atomic_size_t{ 0 }; _inner_par_copy_if_sync(first, last, dst, dst_write_idx, p, chunk_sz); return std::next(dst, dst_write_idx);}