Monday, June 18, 2018

GCC 8: link time and interprocedural optimization

GCC 8.1 was released and Linux distros are starting to pick it up. This week hit the openSUSE Thumbleweed update with full rebuild (thanks to Martin Liška). Thus it is good time to look back into what changed in my favourite areas of the compiler. Here I will focus on the Inter-procedural optimization and link-time optimization (LTO). I hope to find time to follow up with some of x86 tuning changes which was quite fun in GCC 8 development cycle as ell.

In 2015 i wrote about Link time and inter-procedural optimization improvements in GCC 5 and in 2014 3 part series on link-time optimization (brief history, building Firefox with LTO, and building libreoffice). I was quiet since then. Good news is that we did not stop working on those areas - I was just busy by coding and real life.

In this article I review the changes roughly in reverse chronological order (starting with GCC 8 and going to GCC 6 and 5). I will write more (including benchmarks) later on.  It took me too long to write this one anyway - I have intended it for GCC 8 release date ;)

Early debug info

Back in 2015 LTO was basically production ready. It was able to build real world applications like Firefox and Libreoffice, do that in acceptable time and with acceptable memory use (this can always improve of course), obtain working binary which was noticeably smaller and faster. Andi Kleen also did a lot of work to get Linux kernel working with LTO. See a recent article on this.

About last missing feature needed to make LTO acceptable for official builds was debug information. While it was possible to build with LTO and -g and debug the resulting binary, the debug information was kind of messed up C, instead of debug info corresponding to the language program was originally written in. This is finally solved so we also work towards building OpenSUSE with LTO by default. I will write about this project later.

Problem of debug info was internal organization of the compiler which kept number of language details in language specific way and turned them into debug output (usually DWARF) at the end of compilation by means of the front-end hooks.  With link-time optimization one can mix multiple languages together and front-ends are not available at the link-time when debug info is produced. Also preserving all the high level information potentially useful for debugging is costy and cost us quite a lot of disk space and link-time memory use.

For several years, the quality of debug output is considered very important even for optimized code. Significant effort has been put into making optimization output agnostic of the -g setting (so you can rebuild binary with debug info after your program core dumped and use it to debug the core dump) and to make debug info meaningful even in presence of optimizations. DWARF generator itself is about 30k lines of code and that is just part of the infrastructure. Variable tracking used to generate debug info for optimized code (where the location where variable is stored depends on position in the program) is another 10k lines of code and that is just part of the implementation.

After long discussions in 2013-2014 Aldy Hernandez and Richard Beiner started to work on necessary debug info changes. The main idea is to produce DWARF early during compilation, store it into object files and during link-time just copy necessary fragments to final object files without need for compiler to parse it and update it. Richard continued implementing LTO parts in 2015 (here is project's wiki and Richar's presentation at GNU Tools Cauldron 2016). For GCC 8 this work was merged to mainline and debugging is supposed to work well.

(just for comparsion LLVM, being designed from the ground-up, had different design which formalises more how the debug info is attached to the IL.For C/C++ this is done by basically attaching DWARF fragments into the IL by set of intrincisc functions. Because these annotations are kept on the side, they seems to be expensive for large translation units.)

Remaining issues with early debug

The implementation scales very well (because link-time optimizer does not need to care about debug information) and will allow subsequent cleanups to LTO streaming which will result in faster linktime and smaller memory footprints (I have just started working on this for  GCC 9 last week).

While debugging with GDB seems to work well, there are problems with consumers of DWARF. For example, libbacktrace used to output GCC crashes will refuse to annotate backtraces with function names (see PR78063). I suppose it will take some time until all common DWARF consumers are fully updated. Volunteers are welcome!

Other problem is that currently only ELF targets are supported.

Profile representation rewrite

Another rather big infrastructural change is rewrite of the way GCC maintain profile information. Profile is an information about expected branch probabilities, basic block frequencies, call frequencies and other related info. This is used to drive optimizations that needs to make tradeoffs - about code size versus speed (like in the inliner or loop unroller), or about speeding up one path at the cost of slowing another (like in the spill code placement of register allocator).

GCC supports three types of profile:
  1. the profile read via profile feedback (-fprofile-use). This is most precise profile info which is gathered by training instrumented binary (built with -fprofile-generate). Use of profile feedback results in very nice speedups and code size reductions but it is rather difficult to integrate into build machineries. For this reason only few packages are built with profile feedback by default.
  2. the statically guessed profile (-fguess-branch-probabilities). This profile info is based on "educated guesses" of the compiler and used by default at all compilation levels.
  3. profile collected by low-overhead profiling (auto-fdo). This is rather experimental code which makes GCC to use perf profile data.
Each this profile has different precision and thus is handled by some optimizers slightly different (for example, statically guessed profile is almost useless to tell you expected number of iterations of a loop, but it still works very well for spill placement).

Since the initial implementation, the profile was represented by fixed-point edge probabilities, basic block frequences (scaled in range 0...10000) and by execution counts read from profile feedback. In addition to that function global flag was set to let optimizers know what type of profile is available (and if at all).

This approach worked for well over a decade, but had several limitations:
  1. There was no way to tell if basic-block frequency is 0 because someone forgot to set it up, or because it really is very unlikely to be executed.
  2. After code transformation (such as loop vectorization) the profile needs to be updated. Because counts basically duplicate info present in probabilities and frequencies one had to duplicate the updating code and often small bugs crept in. Profile quality can not be automatically verified by internal checks of the compiler. This led to profile maintenance bugs which staid in the compiler for long time until they was noticed either by code audit or when they popped up as important performance issue in some benchmark.
  3. The code was intended to get information about hot regions of programs, but it was not good for proving that some regions of programs are cold. Determining cold regions of programs is becoming increasingly important to keep code size under control.
  4. Things became difficult when function was inlined and caller/callee had different profiles (say one had profile feedback and other not). This scenarios recently became much more common because of LTO and also C++ COMDAT functions.
To solve the problem GCC now features new types profile_count and profile_probablity which represent the profile in more precise and robust way. They also propagate information about quality of the profile.

New static cold region detection was implemented which is able to say that i.e. code leading to abort or code reachable only by exception is cold. On i386 we now enable function splitting by default even without profile feedback and we can offload cold parts of functions into cold section. I plan to proceed with enabling function splitting on other targets in GCC 9.

As a result, the profile is now maintained a lot more reliably and most importantly the hot/cold partitioning no longer makes frequent mistakes.

On i386 you can see effect of this for code like:
void test (int a)
  {
    if (a == 5)
      abort ();
  }

Will now compile as:

.section text
        .globl  test
        .type   test, @function
test:
        subq    $8, %rsp
        cmpl    $5, %edi
        je      .L3
        addq    $8, %rsp
        ret

        .section        .text.unlikely
        .type   test.cold.0, @function
test.cold.0:
.L3:
        call    abort
     
So call to abort is offloaded into cold text section. It depends on application how much of code size can be eliminated this way. Without profile feedback it is usually less than 5%, however for example for GCC itself with sanity checking enabled it is about 15%.

Newly also GCC makes difference between function that is declared noreturn and function that is declared with both noreturn and cold attributes. The first fits i.e. for function exit, while the second for abort which always terminates unlikely run of the program.

There is new -fsuggest-attrbute=cold which will warn for functions exported from compilation unit that can be proved to be cold. Annotating them in the headers allows GCC to detect more cold code in other translation units even when link-time optimization is not enabled.

malloc attribute discovery

While GCC is auto detecting couple of function attributes for a while (such as noreturn, const or pure), newly we also auto-detect malloc attribute. This was implemented by Prathamesh Kulkarni and improves code generation because GCC knows that memory returned by function with malloc attribute is new and can not be pointed-to by other pointers.

Again, there is also -Wsuggest-attribute=malloc where GCC will output suggestions for functions that can be annotated in header files to improve cross-module optimization without LTO.

Common case where the discovery is useful are simple malloc wrappers. For example in the following code GCC now will determine that main always returns 0.
#include <stdlib.h>
int var;
int *varptr = &var;

__attribute__ ((noinline))
void *my_malloc (size_t size)
{
  void *ptr = malloc (size);
  if (!ptr)
    abort ();
  return ptr;
}

int
main(void)
{
  int *mem = (int *)my_malloc (sizeof (int));
  var = 0;
  *mem = 1;
  return var;
}

Inter-procedural value range and bitwise constant propagation

One of bigger user-visible changes in GCC 7, also implemented by Kugan Vivekanandarajah, is inter-procedural bitwise and value-range propagation. Probably the most interesting special case of this is the propagation of information that given pointer is non-NULL. For example, in the following code the test will be optimized away even if the function is not inlined:
__attribute__ ((noinline)) 
static void test (int *ptr)
 {
   if (!ptr)
     abort ();
 }

void call(void)
 {
   int a;
   test (&a);
 }
This is pretty common pattern in C++ programs, where virtual function calls implicitly generate non-NULL test while often the values passed are known to be non-NULL by the inter-procedural context.

New type merging during link-time optimization

While GCC 7 had relatively few visible new features in the LTO and inter-procedrual optimization area (rather focused on improving compile time performance and fixing bugs), GCC 6 saw multiple changes in the basic infrastructure.

One of most time consuming for me was revamp of type merging, that is, deciding what types are interoperable between different units. This is a difficult area, because we support multiple languages and every language has different inter-operability rules. Probably the most crazy, if you consider one language alone, is C standard. This is because there rules was retrofitted into the language that originally did not speak about cross-module optimization. So, for example, one can inter-operate unions with fields defined in different order and other unexpected things.

However situation become more difficult when mixing different languages. All the other language standards have description of inter-operability to C and each handles the situation bit differently.  For this reason GCC is throwing away a lot of information, such as structure/field names which may differ when mixing C and Fortran programs, just to be standards compliant.

A new user-visible feature is warning about type mismatches across units.  For example linking:
struct a {int a;} a={1};
And
extern int a;

int
main()
{
  return a;
}
Is undefined by C standard because type of a is structure in one unit and integer in another. Without link-time optimization such code will usually work because both data types have same code size, but with LTO this is dangerous because accesses in the different unit will not alias each other.  Compiling with GCC7+ and -flto you get following warning:

b.c:1:12: warning: type of ‘a’ does not match original declaration [-Wlto-type-mismatch]
 extern int a;
            ^
a.c:1:19: note: type ‘struct a’ should match type ‘int’
 struct a {int a;} a={1};
                   ^
a.c:1:19: note: ‘a’ was previously declared here
a.c:1:19: note: code may be misoptimized unless -fno-strict-aliasing is used
While rebuilding openSUSE distro with LTO we found quite few occurrences of this bug in real world, so hopefully it will become useful tool just like the C++ One Definition Rule violation warnings implemented for LTO few releases back which was in meanwhile used by Firefox, Libreoffice and others to fix quite interesting bugs.

An important effect of type merging rewrite was increase in efetivity of the type-based-alias analysis package. For GCC build itself it is about 10fold increase in the number of disambiguations.

I will show some performance numbers later, but improving the alias analysis at link-time helped to solve some performance regressions we saw compared to non-LTO builds especially on non-x86 targets.

Support for syntactic aliases (FORTIFY_SOURCE)

FORTIFY_SOURCE is glibc security feature which is supposed to catch buffer overflows. The problem with it is that it is implemented by somewhat crazy set of GCC extensions which annotate function declaration with warning and error attributes. These annotations are intended to be evaulated only after the code is optimized to catch cases where possible buffer overflows can happen.

Supporting this with LTO was bit of a pain, because if some sources are fortified, while other not, the global declarations exists in multiple semantically different variants. Symbol table was built about the assumption that there is 1-1 correspondence between symbols and declarations and there was simply no place to preserve this type of information. As a result we needed to make rather involved surgery to make symbol table support so-called syntactic aliases. This is also useful to handle some additional side corners,but most importantly it makes programs which do use fortifications, like Firefox, to compile correctly.

Incremental linking support

Incremental linking merges multiple object files into one object file. It is used, for instance, by Linux kernel to improve parallelism of the build (by incrementally linking parts of kernel together so the final link has less work to do). With LTO incremental linking is a touchy issue - what is supposed to happen? Is the resulting object file supposed to contain IL for further incremental linking or final assembly?

Andi Kleen implemented basic incremental linking for LTO in 2015. This linking is however done by simply merging all intermediate language sections into one object file and thus it does not lead to any performance benefits. This can by done by ld -r. Until GCC 6 gcc -r produced random invalid code (because symbol table in GCC did not understand what incremental linking is. Since GCC 6 it will lead to valid incremental link producing non-LTO object file. Finally the full IL merging was implemented for GCC 9. I hope it will be useful to speed up linking of bigger applications like Firefox or Libreoffice same way as it speeds up Kernel on non-LTO builds.

... And many little improvements, bugfixes and cleanups

The list of changes I mentioned is definitely not complete. It contains all "big" changes I can think of, but a lot of daily work on improving compiler is about small details which are too many to try.  Please try GCC 8, report problems and help us to get it better :)

5 comments:

  1. I've noticed LTO builds talking a lot longer to complete with gcc7 vs gcc5. This looks to be due to reduced parallelism. Has that situation improved for gcc8 and is there a way to get gcc7 to approach the performance of gcc5?

    ReplyDelete
    Replies
    1. It is hard to say without extra information. I keep watching build time of firefox which is slowly impriving since gcc5, but there may be regressions elsewhere. It may be interesting to look at -ftime-report of the link. It will tell how much time is spent by the series WPA stage and what passes takes most.

      Delete
    2. ... and there is at least one bug in the area I fixed recently which made GCC to sometimes produce very many very small partitions which took ages to compile. This happened due to silly overflow in partitioning. I have fixed it in GCC 7 and 6 brances as well, so I think last minor release of GCC 7 and possibly 6 already picked up the fix, too.

      Delete
  2. Thanks for the fantastic write up, and the hard work making GCC better!

    ReplyDelete
  3. Very cool writeup. Your optimization efforts are appreciated by thousands of developers around the world!

    ReplyDelete