Using Compiler and ELF to Avoid Ifdef and Flags Hell for Selecting Runtime Features

Page content

Using Compiler and ELF to Avoid Ifdef and Flags Hell for Selecting Runtime Features

I was digging into redis source recently and i came to realise some lesser know features
of gcc/clang needs some more publicity, so i will write about them here.

The problem: You want to get best of the CPU/Hardware/Features but usual way to do this
is either build for specific hardware or use runtime checks both have drawbacks.
First on a new hardware you need a new build and results in


#ifdef ARCH_OR_CPU_HAVE_FEATURE
do_some_cpu_optimised_function();
#else
do_some_function();
#endif

Second adds a branch which is a bad thing on a modern superscalar CPU.

if (cpu_have_feature(...))
	do_cpu_optimised_function();
else
	do_regular_function();

Both don’t look good and this is over simplified variant.

There are semi cures for this like Linux kernel static keys
and untested just for reference user space static key implementation
The idea of static keys is to make the branch predictable so pipeline is not disturbed.
One BIG advantage of static keys is that they can be changed at runtime.

ELF to the resuce.

__attribute__((ifunc(“resolver”))) both in gcc and clang.

Marks a function that the address of a declaration should be resolved at runtime by calling a resolver function. AT LINK TIME.
This requires recent binutils and compilers. And some arches come with quirks, be careful if you are not using
mainstream x86_64/Linux.

What does a resolver function do? if can find the correct function to use for the given runtime environment.
(see example below) which is used later. This also save using a function pointer but beware of the quirks:
“Windows target supports it on AArch64, but with different semantics: the ifunc is replaced with a global function pointer, and the call is replaced with an indirect call. The function pointer is initialized by a constructor that calls the resolver.”
So the gain on Windows is not complete.

One more thing to note in the example below: __attribute__ ((target (“sse4.2”)))
In the example if you compile without -march=native or equivalent enabling the popcnt instruction.
It will not be generated by the compiler - hence the specific build we want to avoid.
By combining __attribute__ ((target (“sse4.2”))) (which enables the instruction generation)
and __attribute__((ifunc(“resolver”))) we can achieve the best for our runtime environment
with minimal to no overhead.

The example exposes one drawback gcc’s __builtin_popcountll()
can not be used as function pointer so it can not be returned from the resolver function.
And to solve this a little code dublication is necessary. But it is acceptable due to the gain.

I’ve left a comment here: https://github.com/redis/redis/pull/13962#issuecomment-3036620776

What about the approach below ?

Using compiler’s ifunc(s) which are resolved at elf link time - gain is no function pointers and no extra branches at runtime. in that case a minus is that __builtin functions can not be used as pointers, so a little code duplications is required but that would help with inlining the popcount64 when cpu does not support popcnt.

/* Binary vectors distance. */
float vectors_distance_bin_native(const uint64_t *x, const uint64_t *y, uint32_t dim)
  __attribute__ ((target ("sse4.2"))); /* popcnt is part of sse4.2 instruction set so enable generation  */

float vectors_distance_bin_native(const uint64_t *x, const uint64_t *y, uint32_t dim)
{
    uint32_t len = (dim+63)/64;
    uint32_t opposite = 0;
    for (uint32_t j = 0; j < len; j++) {
        uint64_t xor = x[j]^y[j];
        opposite += __builtin_popcountll(xor); /* can not use builtin with ifunc */
    }
    return (float)opposite*2/dim;
}

float vectors_distance_bin_own(const uint64_t *x, const uint64_t *y, uint32_t dim)
{
    uint32_t len = (dim+63)/64;
    uint32_t opposite = 0;
    for (uint32_t j = 0; j < len; j++) {
        uint64_t xor = x[j]^y[j];
        opposite += popcount64(xor);
    }
    return (float)opposite*2/dim;
}

typedef float (*vectors_distance_bin_t)(const uint64_t *x, const uint64_t *y, uint32_t dim);

static vectors_distance_bin_t hnsw_vdb_resolver(void)
{
        if (__builtin_cpu_supports("popcnt"))
            return vectors_distance_bin_native;

        return vectors_distance_bin_own;
}

float vectors_distance_bin(const uint64_t *x, const uint64_t *y, uint32_t dim) __attribute__((ifunc("hnsw_vdb_resolver")));

References:
https://gcc.gnu.org/wiki/FunctionMultiVersioning
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html