layout	title
post	第二十期

C++ 中文周刊第20期

从reddit/hackernews/lobsters/meetingcpp摘抄一些c++动态。

每周更新

周刊项目地址 github，在线地址｜知乎专栏

欢迎投稿，推荐或自荐文章/软件/资源等，请提交 issue

资讯

编译器信息最新动态推荐关注hellogcc公众号

本周周报github直达

cppcheck发布新版本https://sourceforge.net/p/cppcheck/news/2021/07/cppcheck-25/

文章

C++ folly库解读（三）Synchronized —— 比std::lock_guard/std::unique_lock更易用、功能更强大的同步机制

这个思路在第三期介绍过介绍过，就是libguarded 的设计

/// libguarded
plain_guarded<int> data;
...
int getter() const {
  auto p = data.lock();
  return *p;
}
/// folly
class RequestHandler {
  ...
  Synchronized<RequestQueue> requestQueue_;
  Synchronized<std::map<std::string, Endpoint>> requestEndpoints_;
  Synchronized<HandlerState> workState_;
  ...
};

void RequestHandler::processRequest(const Request& request) {
  stop_watch<> watch;
  checkRequestValidity(request);
  requestQueue_.wlock()->push_back(request);
  stats_->addStatValue("requestEnqueueLatency", watch.elapsed());
  LOG(INFO) << "enqueued request ID " << request.getID();
}

用法基本一致。不太清楚这种设计谁先谁后

Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator

tcmalloc出了一篇论文

重复造轮子的喜悦：从零开始用C++写一个最基础的BitTorrent客户端

还是挺有意思的，基本的调api

c++ tip of week 233 Did you know that C++20 made typename more optional?

c++20之后typename可以省一些，~~不过对于新人理解后面的代码不知道是不是一种障碍~~

template<class T> /*typename*/T::type return_type();   // okay
template<class T> void void_parameter(/*typename*/T::type); // error: variable or field 'parameter' declared void
template<class T> auto auto_parameter(/*typename*/T::type); // okay

template<class T>
struct traits {
  using type = /*typename*/ T::type; // okay
};

可以这里把玩一下

Queens！编译期八皇后综合应用初探C++模板元编程《八》

他写的这个系列都挺好的，了解一些模版元编程的支持。~~虽然用处不大~~

Doc Tools for C++ Libraries

列了几个c++文档生成方案

hyde. 比较好用，可以集成到cmake里

standardese 比较好用，也可以集成到cmake里

还有Doxygen + Breathe + Exhale + Sphinx, 这个比较难用

How a Bug(?) in the Linux CRC-32 Checksum Turned out not to be a Bug

这个背景知识是CRC算法，原理这里就不展开了，可以看这个和这个

这里涉及到一个算式，用的参数，使用不一样居然没什么影响

0x1DB710640 camp:

Rust

Intel with a guide and code

Other “clean” implementations

0x1DB710641 camp:

Linux

Zlib

Golang

GNUPG

TODO:这里的数学我没研究明白，比如Barrett reduction算法

VLA are... problematic

弹性数组，我劝你别用

安全问题，破坏栈空间

代码维护一坨屎

编译器也要为了vla修补各种坑

弹性数组的弹性数组，想想是不是想吐了

sizeof适配vla

vla和goto混在一起，想想就不想活了

用malloc害能咋的！

Be Wise, Sanitize - Keeping Your C++ Code Free From Bugs

还是讲的usan asan那些东西

Transrangers in Rust: a C++/Rust comparison

一个性能对比，push style的range，也就是transrange库，和对应的rust实现pushgen，对比普通的循环/range，在不同编译器下的表现，基本上吊锤range-v3。rust还优化了一些，效果更好一点 ~~当前range 优化较差，甚至不如for循环~~

push的设计看起来是是一种趋势。~~数据库query engine也有pull和push区分。传统的是pull拉取行。演化上向push filter来发展~~

测试场景

Test 1: filter|transform over 1M integers (Rust filter.map.)

Test 2: concat|take(1.5M)|filter|transform over two vectors of 1M integers each (Rust chain.take.filter.map.)

Test 3: unique|filter over 100k integers (Rust dedup.filter.)

Test 4: join|unique|filter|transform over a collection of 10 vectors of 100k integers each (Rust flatten.dedup.filter.map.)

Test 5: transform(unique)|join|filter|transform over a collection of 10 vectors of 100k integers each. (Rust map(dedup).flatten.filter.map.)

Test 6: zip(·,·|transform)|transform(sum)|filter over two vectors of 1M integers each (Rust zip(.map).map(sum).filter.)

测试时间数据(ms)

Test	GCC 11.1 transrangers	GCC 11.1 Range-v3	Clang 12.0 transrangers	Clang 12.0 Range-v3	Rust pushgen	Rust iterators `for_each`	Rust iterators `try_for_each`	Rust iterators `for` loop
Test 1	211	512	176	511	174	171	829	422
Test 2	1091	4075	1056	7231	923	2743	2834	2911
Test 3	24	73	28	88	24	93	38	65
Test 4	752	603	249	896	304	465	1022	1901
Test 5	286	997	270	954	306	324	730	675
Test 6	931	1016	345	1199	356	353	1089	751

lean tricks

这是个老博客了，偶尔发现的，把这里介绍的小点子直接转载过来了，有几个曾经介绍过，剩下的不清楚在现在是否还有优势

计算对齐

template<typename U>
 static inline char* align_for(char* ptr)
 {
     const std::size_t alignment = std::alignment_of<U>::value;
     return ptr + (alignment - (reinterpret_cast<std::uintptr_t>(ptr) % alignment)) % alignment;
 }

获得最接近且不小于当前数的二次幂

这个是获得比最小的不小于x的2n

。

template<typename T>
 static inline T ceil_to_pow_2(T x)
 {
     static_assert(std::is_integral<T>::value && !std::numeric_limits<T>::is_signed, "ceil_to_pow_2 is intended to be used only with unsigned integer types");

     // Adapted from http://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2
     --x;
     x |= x >> 1;
     x |= x >> 2;
     x |= x >> 4;
     for (std::size_t i = 1; i < sizeof(T); i <<= 1) {
         x |= x >> (i << 3);
     }
     ++x;
     return x;
 }

三个数获得最小值或最大值

避免判断分支的stall，于是直接用标志位来计算了。出处见pbrt。

int a[3];
int bits = ((a[0] < a[1]) << 2) + ((a[0] < a[2]) << 1) + (a[1] < a[2] 
const int smallest[8] =  [2, 1, 2, 1, 2, 2, 0, 0]
int smallest_v = a[smallest[bits]]
int bits = ((a[0] > a[1]) << 2) + ((a[0] > a[2]) << 1) + (a[1] > a[2] 
const int biggest[8] = [2, 1, 2, 1, 2, 2, 0, 0]
int biggest_v = a[biggest[bits]]

整数常量除法优化

Intel Haswell 架构的 DIV 32位除法指令的延迟（latency）是 28 个周期，吞吐率是 10 个周期。作为比较，同一架构下 MUL 32位乘法指令的延迟只是 4 个周期，吞吐率只是半个周期。所以对于常见的除以10的计算更好的方法是转化为乘法。

int a = 3728463;
int b = a/10;
int64_t c = 3435973837;
int d = static_cast<int32_t>((static_cast<int64_t>(a) * c)>>35)
assert( b == d)

这里的b == d之所以会成立是因为c>>35 = 0.10000000058 ，在精度范围内没有误差。编译器内部维护了很多小整数的除法优化表，对于常量整数的除法都可以转换为64位乘法然后右移的方式。所以下面的这段代码的两个除法热点就可以被优化了：

uint32_t p1 = /*...*/;
int kappa = 10;
uint32_t div = 1000000000;
while (kappa > 0) {
 d = p1 / div; // 第一个除法
 if (d || *len)
     buffer[(*len)++] = static_cast<char>('0' + static_cast<char>(d));
 p1 %= div;
 kappa--;
 div /= 10;    // 第二个除法
 // ...
}

这里的div并不是常量，在循环过程中会变，转化为常量之后就可以很大程度的提高性能。

while (kappa > 0) {
 uint32_t d = 0;
 switch (kappa) {
     case  9: d = p1 /  100000000; p1 %=  100000000; break;
     case  8: d = p1 /   10000000; p1 %=   10000000; break;
     case  7: d = p1 /    1000000; p1 %=    1000000; break;
     case  6: d = p1 /     100000; p1 %=     100000; break;
     case  5: d = p1 /      10000; p1 %=      10000; break;
     case  4: d = p1 /       1000; p1 %=       1000; break;
     case  3: d = p1 /        100; p1 %=        100; break;
     case  2: d = p1 /         10; p1 %=         10; break;
     case  1: d = p1;              p1 =           0; break;
     default:;
 }
 if (d || *len)
     buffer[(*len)++] = static_cast<char>('0' + static_cast<char>(d));
 kappa--;
 // ...
}

这里不需要担心case里的两个除法，现代的编译器针对取模运算和除法运算右边的数相同时，会优化为一条指令。相关资料来自于Milo 的blog

utf8的parse优化

utf8 的字节编码是变长的，实际使用过程中经常需要parse之后转变为定长的来处理。这里的热点就是扫描分割边长编码，这里我们为了简单起见只处理最长为4字节的变长编码。最简单的实现是这样的：

     vector<uint32_t> utf8_to_uint(const string& text) const
     {
         unsigned char u, v, w, x, y, z;
         vector<uint32_t> utf8_result;
         int num_chars = 0;
         uint32_t num_bytes = text.length();
         long iii = 0;
         while (iii < num_bytes)
         {
             uint32_t cur_utf8_char = 0;
             z = text[iii];
             if (z <= 127)
             {
                 cur_utf8_char = z;
             }
             if (z >= 192 && z <= 223)
             {
                 iii++;
                 y = text[iii];
                 cur_utf8_char = (z - 192) * 64 + (y - 128);
             }
             if (z >= 224 && z <= 239)
             {
                 iii++; y = text[iii];
                 iii++; x = text[iii];
                 cur_utf8_char = (z - 224) * 4096 + (y - 128) * 64 + (x - 128);
             }
             if ((240 <= z)
             {
                 iii++; y = text[iii];
                 iii++; x = text[iii];
                 iii++; w = text[iii];
                 cur_utf8_char = (z - 240) * 262144 + (y - 128) * 4096 + (x - 128) * 64 + (w - 128);
             }
             utf8_result.push_back(cur_utf8_char);
             iii++;
         }
         return utf8_result;
     }

下面有一些比较好的代码可以参考：

do {
 c = pStr[0]; if (globals::s_parse_flags[c] & 1)
     { ++pStr; break; }
 c = pStr[1]; if (globals::s_parse_flags[c] & 1)
     { pStr += 2; break; }
 c = pStr[2]; if (globals::s_parse_flags[c] & 1)
     { pStr += 3; break; }
 c = pStr[3];
 pStr += 4;
} while (!(globals::s_parse_flags[c] & 1));

这段代码粗看起来很诡异，事实上他利用了循环展开，直接让cpu有了一次性装载四个字节并判断的能力。对于第一个版本的改动就是对z <= 127进行循环展开，也试图判断四个字节。

空白字符的parse优化

在很多lex程序中都需要跳过空白字符，例如\t \n \r这四个。简单的实现会涉及到很多的branch，例如下面的代码:

template<typename InputStream>
void SkipWhitespace(InputStream& is) {
 internal::StreamLocalCopy<InputStream> copy(is);
 InputStream& s(copy.s);
while (s.Peek() == ' '  ||
        s.Peek() == '\n' ||
        s.Peek() == '\r' ||
        s.Peek() == '\t')
 {
     s.Take();
 }
}

此时可以利用intel sse4.2 pcmpistrm指令来加速比较，这个指令可以一次对一组16个字符与另外一组字符做比较，最多支持1616，虽然对于空白字符来说我们只有16 4, 不过也算加速很大了。具体实现如下:

inline const char *SkipWhitespace_SIMD(const char* p) {
 // ... 非对齐处理
static const char whitespace[16] = " \n\r\t";
 const __m128i w = _mm_load_si128((const __m128i *)&whitespace[0]);
for (;; p += 16) {
     const __m128i s = _mm_load_si128((const __m128i *)p);
     const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s, 
         _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
         _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
if (r != 0) {   // some of characters is non-whitespace
#ifdef _MSC_VER         // Find the index of first non-whitespace
         unsigned long offset;
         _BitScanForward(&offset, r);
         return p + offset;
#else
         return p + __builtin_ffs(r) - 1;
#endif
}

这里还有一个可以优化的地方，就是第一个字符串已经是非空白字符了，此时再去调用sse其实消耗很大，所以可以优化第一个字节的判断：

inline const char *SkipWhitespace_SIMD(const char* p) {
 // Fast return for single non-whitespace
 if (*p == ' ' || *p == '\n' || *p == '\r' || *p == '\t')
     ++p;
 else
     return p;
// ...
}

获取一个整数10进制字符数量

核心思想是生成二进制位对应的表，然后根据这个数字的leading zero count 来读表判断多少位。

array<array<uint64_t, 2>, 64> generate_delimit()
// 生成分隔数组
{
 array<array<uint64_t, 2>, 64> result = {};
 int _pre = 1;
 uint64_t temp_2 = 0;
 uint64_t temp_10 = 10;
 for (int i = 0; i < 64; i++)
 {
     temp_2 = (temp_2 << 1) + 1;
     uint64_t current_bit = ceil(log10(temp_2));
     if (current_bit <= _pre)
     {
         result[i][0] = temp_2;
         result[i][1] = _pre;
     }
     else
     {
         result[i][0] = temp_10 - 1;
         result[i][1] = _pre;
         _pre = current_bit;
         temp_10 = temp_10 * 10;
     }
 }

 return result;

};
uint32_t digits10(uint64_t v)
{
 const static array<array<uint64_t, 2>, 64> bit_table = generate_delimit();
 if (v == 0)
 {
     return 1;
 }
 auto cur_index = 63 - __lzcnt64(v);// 这行指令是msvc特有的 其他平台的名字不同
 return bit_table[cur_index][1] + (v > bit_table[cur_index][0]);

}

2 ways to use static with functions in C++

static bool isWineColour(const std::string& iWineCoulour) {
  static const std::array<std::string, 3> wineCoulours{ "white", "red", "rose" };
  return std::find(wineCoulours.begin(), wineCoulours.end(), iWineCoulour)
         != wineCoulours.end();
}

作者不懂static是修饰函数的。。。并且讨论了一波。这static用法经典面试题了。

Writing a simple logging function in C++20

enum class log_level : char
{
   Info = 'I',
   Warning = 'W',
   Error = 'E'
};

auto as_local(std::chrono::system_clock::time_point const tp)
{
   return std::chrono::zoned_time{ std::chrono::current_zone(), tp };
}

std::string to_string(auto tp)
{
   return std::format("{:%F %T %Z}", tp);
}

std::string to_string(std::source_location const source)
{
   return std::format("{}:{}:{}", 
      std::filesystem::path(source.file_name()).filename().string(),
      source.function_name(),
      source.line());
}

void log(log_level const level, 
         std::string_view const message, 
         std::source_location const source = std::source_location::current())
{
   std::cout
      << std::format("[{}] {} | {} | {}", 
                     static_cast<char>(level), 
                     to_string(as_local(std::chrono::system_clock::now())), 
                     to_string(source), 
                     message)
      << '\n';
}

void execute(int, double)
{
   log(log_level::Error, "Error in execute!");
}
int main()
{
   log(log_level::Info, "Logging from main!");
   execute(0, 0);
}

我试了，fmt/timed_zone各种不支持，看个乐

视频

C++ Weekly - Ep 279 - Quick Perf Tip: Use The Right Iterator Comparison

手把手教你看汇编，找慢的原因

C++Now YT - Using Concepts: C++ Design in a Concept World (part 1 of 2) - Jeff Garland - CppNow 2021] 介绍concept
C++Now YT - The Worst Best Practices - Jason Turner - CppNow 2021] 讨论从前的一些实践的好坏。回忆环节

项目

https://github.com/foonathan/lexy 一个lexer库，比较活跃。boost.splirt确实比较难用。看到这个图笑出声
https://github.com/openppl-public 高洋他们一直做的hpc库开源了。图像处理hpc。
https://github.com/ligurio/unreliablefs 一个故障注入用的fuse 文件系统
https://github.com/robertmuth/Cwerg 一个轻量的编译器后端
https://github.com/vtereshkov/umka-lang 一个用c实现的类似go语法的嵌入式脚本语言
https://github.com/soasis/text 实现了text encoding
https://github.com/robn/yoctochat 一个用io-uring实现的聊天室
https://github.com/MarcDirven/cpp-lazy 一个lazy evaluation库

本文永久链接

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

020.md

020.md

C++ 中文周刊第20期

资讯

编译器信息最新动态推荐关注hellogcc公众号

文章

计算对齐

获得最接近且不小于当前数的二次幂

三个数获得最小值或最大值

整数常量除法优化

utf8的parse优化

空白字符的parse优化

获取一个整数10进制字符数量

视频

项目

Files

020.md

Latest commit

History

020.md

File metadata and controls

C++ 中文周刊 第20期

资讯

编译器信息最新动态推荐关注hellogcc公众号

文章

计算对齐

获得最接近且不小于当前数的二次幂

三个数获得最小值或最大值

整数常量除法优化

utf8的parse优化

空白字符的parse优化

获取一个整数10进制字符数量

视频

项目

C++ 中文周刊第20期