Thursday, May 2, 2019

GCC 9: Link-time and inter-procedural optimization improvements

GCC 9 was branched last Friday (see a partial list of changes) and if all goes well, it will be released tomorrow. It is thus time for me to write an update on recent developments in my favourite areas: the inter-procedural analysis (IPA) and link-time optimization (IPO).

This continues the series I wrote on GCC 8, GCC 6GCC 5, and GCC 4.9 (part 1 and 2). (Sorry, no GCC 7 blog post at all. I was lazy.)

Big picture

GCC was originally optimizing at a function basis (with a simple inliner being the only inter-procedural optimization done by the compiler). Inter-procedural framework dates back to 2003 starting as a simple hack to solve compile time explosion caused by template heavy C++ code which started to appear at that time. Since this framework is fully transparent to the users, it was enabled by default for many years.

Link-time optimization branch was created in 2005 and merged into mainline in 2009. It took many more years to turn LTO from a "benchmark toy" to something that can build significant portion of programs we run today.

At a time of my first report, in 2014 GCC 4.9 was able to link-time optimize real world applications (I am testing Firefox quite regularly since 2010 and keey my eye on other programs including Libreoffice, Chromium and Linux kernel). Resulting binaries were better than non-LTO builds but still the widespread adoption of LTO was blocked by many "small" issues including debug info quality, linking times, memory use and various bugs showing up in more complicated projects. It took a while, but basically all those issues was addressed.

The main LTO related change of GCC 8 was the completion of early debug infrastructure. This fixed the last real showstopper and enabled number of cleanups in the implementation by properly separating the intermediate language used by GCC (Gimple) from the debug information (which is now turned into DWARF early and, in large part, copied through into final executable).

In GCC 9 we was finally able to fine-tune the infrastructure by dropping a lot of data previously used for debug-info generation from the LTO streams. We plan to make LTO more mainstream now.

Enabling LTO by default for OpenSUSE Tumbleweed

LTO in GCC was always intended to become default mode of compilation for release builds (rather than being limited to high performance computing and benchmarks). We believe that things are ready now (and basically since GCC 8 already).

At SUSE we work on turning LTO by default for OpenSUSE Tumbleweed. Martin Liška did a lot of work to get working staging build with LTO and plan to do the switch after GCC 9 lands Tumbleweed which should not be too long after the release. Initial tests looks promising - only about 150 packages needs LTO disabled so far.

The resulting Linux distro works and passes testing. So far we just gathered data on the size of individual binaries in the system.

The overall size of installed binaries decreases by 5% and debug info size by 17%. Note that these numbers include also packages not built with LTO (for example go that uses its own compiler) and packages that use LTO by default already (like LibreOffice). Thus the real savings for enabling LTO are higher than this.



Histogram of binary size changes comparing GCC 9 without LTO to GCC 9 with LTO.
To summarize:
  • 65% binaries (counted as portion of overall size, not number of files) decrease in size by more than 1%. Some of them quite substantially.
  • 12% of binaries does not change at all.  This may be due to fact that they are built without LTO, built by other compiler than GCC or there is a configure problem. I have removed the five biggest offenders before producing the histogram: go, postgress, libqt-declarative, boost-extra and skopeo.
  • 8% of binaries increase in size by more than 1%. Some of them because of aggressive optimization flags, but there are cases which seems like build machinery issues. For example all binaries in python-qt packge increase 27 times!
To avoid confusion, I should stress that prior every major release of GCC Tumbleweed is rebuilt and open-QA tested with the new compiler. Same testing is also done by RedHat and others. This has very important effect on the quality of the compiler - building 10k packages is more of testing than bootstrapping compiler itself and running the testsuite. This time we (well, mainly Martin Liška) did it, for the first time, also with LTO.

Similar experiments are also done by Gentoo. Mandriva also switched from GCC non-LTO to LLVM LTO for their builds, but there seems to be no quality data available. It would be very interesting to get some comparison done.

What is new in GCC 9?

    Changes motivated by Firefox benchmarking

    Around Christmas I've spent quite a lot of time testing Firefox with GCC and comparing it to Clang builds using the official Firefox benchmark servers. My findings are discussed in this post.

    Overall GCC 8 seemed to compare well to GCC 6, GCC 7, Clang 6,7 and 8 builds. Most importantly leading to significantly smaller and measurably faster binaries than Clang 7 when profile feedback and LTO is used. This became default for some distributions including RedHat. Funnily enough, SUSE still does not build with profile feedback but LTO was enabled. What needs to be resolved is to get Firefox train run working in the containers used by SUSE's build service.

    Many interesting issues was uncovered during this project including:
    1. GCC inliner settings are too restrictive for large C++ code-bases with LTO. Increasing the limits leads to signficant runtime improvements
    2. GCC's -O2 does not auto-inline that seems to be important limiting factor for modern C++ code.
      For example Firefox's has function isHTMLWhitespace which is called very many times during parsing. It is not inline and GCC built binaries are thus slower unless -O3 or profile feedback is used. See PR88702.
    3. Firefox is using Skia for rendering some of more fancy primitives. Skia contains SSE and AVX optimized rendering routines which are written using Clang only vector extensions. It seems it is not hard to get them ported to GNU vector extensions (supported by both compilers) and some work in this direction was already done (for example, Jakub Jelínek added support for builtins for vector conversions). So far i did not however have chance to work on this more.
    4. Some bugs in GCC was fixed. Most importantly a nasty wrong code issue with devirtualizing a calls in thunks.
    All correctness issues made it into GCC 9 and was also backported to GCC 8 and GCC 7. I also squeezed in some late changes to inliner to improve code quality. However -O2 and bigger inliner retuning will be only done for GCC 10.

    Results of this retuning can be seen here (takes while to load). Probably most notable improvement is 13% speedup of tp5o responsiveness benchmark. This benchmark tests the response time while rendering 50 most common pages from 2011. GCC built Firefox binary is also significantly smaller as I show below.

    LTO streaming cleanups and scalability improvements

    Separating debug info from the intermediate language permits a lot of cleanups. We went through all streamed data and verified it is still relevant. The changes are too numerous to list all, but let me describe the most important changes.

    Type simplification

    Probably most surprising improvement was accomplished by simplifying types prior streaming. GCC merges identical types at link-time, but it turned out that there was many duplicated caused by merging, for example:
    struct a {int a;};
    struct b {struct a *a;};
    foo (struct b) { }
    
    with
    struct a;
    struct b {struct a *a;};
    extern foo (struct b);
    bar () {foo (NULL);}
    
    Here in the first compile unit the struct b contains pointer to struct a which is complete. In the second unit the representation of struct b has pointer to incomplete tyep struct a. This prevents the two types to be merged.

    While this seems like a minor problem, it is not. Large projects contains complicated data structure and the bigger the structure is, more pointer it contains, and thus the higher is the chance that the representations will end up being different. For example, in GCC, the core structure gimple, representing our intermediate language. It ended up duplicated 320 times. As a consequence all declarations, constants and other objects having type gimple ended up being duplicated, too. This chain reaction turned out to be surprisingly expensive.

    To solve this problem, GCC free lang data pass was extended to produce incomplete variant of every pointed-to type and replace pointers to complete structures by pointers to incomplete structures.

    Improved scalability of partitioning

    Simplifying types enabled further improvements. GCC link-time optimization supports parallel (and distributed) link-time optimization (with -flto=n where n is number of threads or with -flto=jobserv). This is implemented by:
    1. reading all global information (types, declarations, symbol table and optimizations summaries) into the link-time optimizer (known as WPA, or whole program analysis, stage) and performing inter-procedural optimization (unreachable code removal, inlining, identical code folding, devirtualization, constant propagation, cloning and more).

      Once inter-procedural optimization is done program is partitioned into a fixed number of partitions and streamed into temporary object files.
    2. Optimizing every partition independently applying the global decisions done at the WPA stage (this is known as ltrans, or local transformation, stage)
    Clearly the serial stage is a main concern when it comes to scalability to higher number of cores. However until GCC 9 we also had important problems with increasing number of partitions. This number needs to be fixed across build hosts (so it can not depend on actual parallelism available on the given machine) and is controlled by --param lto-partitions. I have set this parameter to 32 in 2010. My testing machine had 16 threads and I though we should set the bar bit higher. While in 2010 you hardly had CPU with more threads than that, this is definitely not true in 2019.

    We managed to get two digit improvements on the streaming performance each major release for several years, but it was still impossible to increase default number of partitions by default without significantly penalizing build times on hosts with few cores.
    Fortunately it has turned out that the type simplification and few other cleanups almost completely solved this for GCC 9.

     I have omitted sizes for gcc 8 with larger number of partitions to avoid re-scaling graph and keeping gcc 9 data visible.

    As can be seen on the chart, increasing number of partitions from 32 to 128 in GCC 8 almost doubled the amount of streamed data. With GCC 9 it grows only by 18% and even on my 16 thread testing machine I now get speedups for increasing the default number of partitions past 16 because of improved memory locality during the ltrans compilation.

    I have decided to set new default to 128 but I do not think it would be problem to go higher. In particular it should be manageable to keep partitions corresponding to the original object files and cache the files between LTO builds to improve compile/edit cycle.

    Faster WPA stage

    As an effect the mentioned (and more) changes the WPA (whole program analysis) stage got significantly faster and more memory efficient. Comparing GCC 8 and 9 linking Firefox libxul.so with profile feedback I get:
    1. Overall WPA time reduced from 128 to 103 seconds (24% improvement) .
    2. Time needed to stream in types and declaration reduced from 69 to 59 seconds (16% improvement) and uses 15% less memory. This is mostly because of the described streaming improvements enabled by early debug work in GCC 8.
    3. Inliner heuristics time reduced from 24.4 seconds to 20.4 secons (17% improvement). This is mostly because of Martin Liška work on faster was to attach optimization summaries to symbol table.
    4. Time needed to stream out partitions reduced from 11 seconds to 7 (36% improvement). Despite the fact that now 128 partitions rather than 32 are produced. The main win here is due caused by type simplification.
    WPA stage still remains a bottleneck. Fortunately there is still relatively low hanging fruit here. In particular we plan to work on speeding up the identical code folding pass (that take about 15% of WPAtime), inliner heuristics, parallelizing the stream in stage and improving stream out stage by using threads rather than fork. Hopefully GCC 10 will see significant improvements here again.

    Benchmarks

    LTO link-times and memory use

    This is a brief sumary of Firefox build times with link-time optimization and profile guided optimization (PGO) on my somewhat aged testing machine (an 8 core Buldozer from 2014). The first thing to worry about when switching to LTO is the linking time and link-time memory use. There is noticeable improvements from GCC 8 to GCC 9:



    While compile times look very good compared to Clang (and in general I think the compile times of both compilers are now largely comparable), GCC still consumes more memory. Partly this is due to more ambitious implementation of LTO which does more of whole program analysis compared to Clang's thin LTO.

    Here is a graph showing memory and CPU use during the link. It clearly separates the WPA serial stage from parallel ltrans. Note that Clang thinLTO chooses to use 8 threads rahter than 16. Increasing parallelism does not help build times because overall user time grows as well.
    GCC 8 CPU and memory LTO optimizing Firefox libxul
    GCC 9 shows memory use improvements by about 30%.
    I am not at all happy about the peak at the end of serial stage which bumps overall memory use from 5GB to 7.5GB. This is caused by the streaming parallelism and can be controlled by --param lto-streaming-parallelism. This pass did not used to be problem since memory use was dominated by local transformation. For GCC 10 I will work on eliminating this issue and perhaps backport it to GCC 9.2. In a bigger issue it is not a disaster, because Firefox build consumes up to 10GB elsewhere. For this reason I decided to not rush with a solution for GCC 9.1.
    Clang 8 with ThinLTO does fewer whole program decisions saving the serial stage of build.
    Overall I am very curious how the compile time and memory use of the two different LTO implementation in GCC and LLVM will continue evolving. It would make sense to implement thin-LTO layer atop of our WHOPR LTO compilation model. Until now I however think the development effort is better spend in optimizing WPA stage still.

    Overall build times and code quality

    I have done some performance comparison of various builds of Firefox at Christmas, see my previous blog post on it.

    In summary GCC built binaries performs better than Clang 8 but one needs to work around problems with Skia having vector optimizer graphics rendering routines only in Clang builds. Difference between GCC 8 and GCC 9 is not huge, but still noticeable. Most importantly it turned out that GCC 8 inliner was tuned bit too low and in some benchmark the LTO binaries ended up being slower than non-LTO. GCC 9 increases limits and gets more consistent. This however also lead to some code size increases:



    GCC with LTO produces significantly smaller binaries than Clang. This is consistent across bigger applications I tested, not only limited to Firefox (note that in my testing I used default optimization flags used by Firefox, which is a combination of -O2 and -O3).  I think there are two main reasons:
    1. With LTO I believe the main reason is the fact that GCC does inlining decisions at whole program basis, while Clang's thinLTO decides locally. This makes it possible for GCC to do better code size/performance tradeoffs.
    2.  With profile feedback another factor is the fact that GCC optimizes cold regions for size. It seems that LLVM folks had worked on similar feature recently.
    As not so good news, non-PGO code segment increase by 7% between GCC 8 and GCC 9. Increasing code size, of course, also makes compiler bit slower. This is a result of the inliner re-tuning and was overall painful decision to make, but I think necessary. It makes to sense for LTO binaries to be considerably slower than non-LTO and across compilers I tested, GCC LTO implementation is most code size aware. These parameters was never seriously tuned except for SPEC benchmarks which are too small to show the issues.


    Testing was done on my 8 core 16 thread AMD Buldozer machine. Both GCC and Clang was built with LTO and profile feedback. As in the past years, the compile times are not very different. GCC tends to win without debug info and lose with debug info. Part of the reason is that GCC produces significantly more detailed debug info compared to clang and it thus takes longer to generate it and assemble together. GCC 8 produce 2.3GB of debug info sections, GCC 9 2.5GB and LLVM7 1.8GB. It is hard to qualitatively compare debug info, but this blog has some statistics.

    SPEC 2017 benchmarks

    In case you did not have enough numbers to look at, here is performance of SPEC2017 normalized to GCC 8 with -Ofast -march=native this time run on more current Zen based CPU. As more common with SPEC, bigger numbers are better.
    This is based on benchmarks collected by Martin Jambor. Full LTO was used for Clang.


    Overall enabling LTO improves integer part of SPEC2017 suite by about 4%, profile guided optimization by about 2% and together one gets 6.5% improvement. GCC 9 outperforms Clang 8 by 6-10%. We used AOCC 1.3 to compile exchange that is the only Fortran benchmark in the suite.

    For floating point part LTO brings about 3% improvements, same does PGO and together one can expect about 6% improvements. Some benchmarks are missing with Clang 8 because they does not build (mostly due to use of Fortran). Again we used AOCC 1.3 for bwaves, cam4 and roms.

    Once we set up the new Fortran frontends for LLVM, we may be able to fill in the blanks. I am aware that PGI compare GCC to Flang at their slides. I believe the comparison is flawed because -march=native was not used for GCC. Without this flag SSE is used for floating point code generation and this is significantly slower than AVX2. Also Flang comes with its own math library that used to be faster than glibc and provide vectorized math functions. Glibc was improved significantly recently and vector functions was added. For serious math performance thus glibc update is greatly recommended :)

    This is a summary how SPEC numbers developed in last 3 years giving little bit extra context. Again the exchange benchmark for Clang 8 was replaced by one built by AOCC.





    If you are not a compiler developer single digit SPEC differences may look small, but the benchmark suite is generally memory bound and changes reported here are important. Overall LTO (and even more so LTO+PGO) can be big win for complex programs with flat profiles (such as Perl, GCC in the SPEC testsuite). It is less important for projects where internal loops are rather small and can be hand-optimized and put into a single file (like, say xz or exchange). Sometimes compiler just gets lucky - for example the 50% speedup of Povray is mostly due to one fortunately partial inlining decision. This is something which could have been easily fixed at source level if the original developers noticed the problem by profiling.

    In longer term I believe switching to LTO enabled build environment should make the code bases easier to maintain. Letting compiler to do more work reduces the pressure to hand optimize code placement (such as putting more code to inline functions in headers), add bit of extra robustness (such as ODR violation warnings) and therefore let developers to focus on more important issues. At the moment we compare performance of LTO optimized programs which were for ages tuned for non-LTO compilers. I would expect the importance of LTO to grow not only because new optimizations will be added but also because code bases will be re-tuned for the new environment.


    I believe GCC 9 is a solid release. With LTO framework mature enough I hope we can finally see the performance and code size gains from enabling it by default for distribution builds. With bit of luck this should happen for OpenSUSE Tumbleweed really soon and hopefully others will follow. It took only 14 years to implement that :)

    For GCC 10 I plan to continue working on improving WPA times and also again look into implementing new fancy optimizations.