This continues the series I wrote on GCC 8, GCC 6, GCC 5, and GCC 4.9 (part 1 and 2). (Sorry, no GCC 7 blog post at all. I was lazy.)
Big picture
GCC was originally optimizing at a function basis (with a simple inliner being the only inter-procedural optimization done by the compiler). Inter-procedural framework dates back to 2003 starting as a simple hack to solve compile time explosion caused by template heavy C++ code which started to appear at that time. Since this framework is fully transparent to the users, it was enabled by default for many years.Link-time optimization branch was created in 2005 and merged into mainline in 2009. It took many more years to turn LTO from a "benchmark toy" to something that can build significant portion of programs we run today.
At a time of my first report, in 2014 GCC 4.9 was able to link-time optimize real world applications (I am testing Firefox quite regularly since 2010 and keey my eye on other programs including Libreoffice, Chromium and Linux kernel). Resulting binaries were better than non-LTO builds but still the widespread adoption of LTO was blocked by many "small" issues including debug info quality, linking times, memory use and various bugs showing up in more complicated projects. It took a while, but basically all those issues was addressed.
The main LTO related change of GCC 8 was the completion of early debug infrastructure. This fixed the last real showstopper and enabled number of cleanups in the implementation by properly separating the intermediate language used by GCC (Gimple) from the debug information (which is now turned into DWARF early and, in large part, copied through into final executable).
In GCC 9 we was finally able to fine-tune the infrastructure by dropping a lot of data previously used for debug-info generation from the LTO streams. We plan to make LTO more mainstream now.
Enabling LTO by default for OpenSUSE Tumbleweed
LTO in GCC was always intended to become default mode of compilation for release builds (rather than being limited to high performance computing and benchmarks). We believe that things are ready now (and basically since GCC 8 already).At SUSE we work on turning LTO by default for OpenSUSE Tumbleweed. Martin Liška did a lot of work to get working staging build with LTO and plan to do the switch after GCC 9 lands Tumbleweed which should not be too long after the release. Initial tests looks promising - only about 150 packages needs LTO disabled so far.
The resulting Linux distro works and passes testing. So far we just gathered data on the size of individual binaries in the system.
The overall size of installed binaries decreases by 5% and debug info size by 17%. Note that these numbers include also packages not built with LTO (for example go that uses its own compiler) and packages that use LTO by default already (like LibreOffice). Thus the real savings for enabling LTO are higher than this.
Histogram of binary size changes comparing GCC 9 without LTO to GCC 9 with LTO. |
- 65% binaries (counted as portion of overall size, not number of files) decrease in size by more than 1%. Some of them quite substantially.
- 12% of binaries does not change at all. This may be due to fact that they are built without LTO, built by other compiler than GCC or there is a configure problem. I have removed the five biggest offenders before producing the histogram: go, postgress, libqt-declarative, boost-extra and skopeo.
- 8% of binaries increase in size by more than 1%. Some of them because of aggressive optimization flags, but there are cases which seems like build machinery issues. For example all binaries in python-qt packge increase 27 times!
Similar experiments are also done by Gentoo. Mandriva also switched from GCC non-LTO to LLVM LTO for their builds, but there seems to be no quality data available. It would be very interesting to get some comparison done.
What is new in GCC 9?
Changes motivated by Firefox benchmarking
Around Christmas I've spent quite a lot of time testing Firefox with GCC and comparing it to Clang builds using the official Firefox benchmark servers. My findings are discussed in this post.Overall GCC 8 seemed to compare well to GCC 6, GCC 7, Clang 6,7 and 8 builds. Most importantly leading to significantly smaller and measurably faster binaries than Clang 7 when profile feedback and LTO is used. This became default for some distributions including RedHat. Funnily enough, SUSE still does not build with profile feedback but LTO was enabled. What needs to be resolved is to get Firefox train run working in the containers used by SUSE's build service.
Many interesting issues was uncovered during this project including:
- GCC inliner settings are too restrictive for large C++ code-bases with LTO. Increasing the limits leads to signficant runtime improvements
- GCC's -O2 does not auto-inline that seems to be important limiting factor for modern C++ code.
For example Firefox's has function isHTMLWhitespace which is called very many times during parsing. It is not inline and GCC built binaries are thus slower unless -O3 or profile feedback is used. See PR88702. - Firefox is using Skia for rendering some of more fancy primitives. Skia contains SSE and AVX optimized rendering routines which are written using Clang only vector extensions. It seems it is not hard to get them ported to GNU vector extensions (supported by both compilers) and some work in this direction was already done (for example, Jakub Jelínek added support for builtins for vector conversions). So far i did not however have chance to work on this more.
- Some bugs in GCC was fixed. Most importantly a nasty wrong code issue with devirtualizing a calls in thunks.
Results of this retuning can be seen here (takes while to load). Probably most notable improvement is 13% speedup of tp5o responsiveness benchmark. This benchmark tests the response time while rendering 50 most common pages from 2011. GCC built Firefox binary is also significantly smaller as I show below.
LTO streaming cleanups and scalability improvements
Separating debug info from the intermediate language permits a lot of cleanups. We went through all streamed data and verified it is still relevant. The changes are too numerous to list all, but let me describe the most important changes.Type simplification
Probably most surprising improvement was accomplished by simplifying types prior streaming. GCC merges identical types at link-time, but it turned out that there was many duplicated caused by merging, for example:withstruct a {int a;}; struct b {struct a *a;}; foo (struct b) { }
Here in the first compile unit the struct b contains pointer to struct a which is complete. In the second unit the representation of struct b has pointer to incomplete tyep struct a. This prevents the two types to be merged.struct a; struct b {struct a *a;}; extern foo (struct b); bar () {foo (NULL);}
While this seems like a minor problem, it is not. Large projects contains complicated data structure and the bigger the structure is, more pointer it contains, and thus the higher is the chance that the representations will end up being different. For example, in GCC, the core structure gimple, representing our intermediate language. It ended up duplicated 320 times. As a consequence all declarations, constants and other objects having type gimple ended up being duplicated, too. This chain reaction turned out to be surprisingly expensive.
To solve this problem, GCC free lang data pass was extended to produce incomplete variant of every pointed-to type and replace pointers to complete structures by pointers to incomplete structures.
Improved scalability of partitioning
Simplifying types enabled further improvements. GCC link-time optimization supports parallel (and distributed) link-time optimization (with -flto=n where n is number of threads or with -flto=jobserv). This is implemented by:- reading all global information (types, declarations, symbol table and optimizations summaries) into the link-time optimizer (known as WPA, or whole program analysis, stage) and performing inter-procedural optimization (unreachable code removal, inlining, identical code folding, devirtualization, constant propagation, cloning and more).
Once inter-procedural optimization is done program is partitioned into a fixed number of partitions and streamed into temporary object files. - Optimizing every partition independently applying the global decisions done at the WPA stage (this is known as ltrans, or local transformation, stage)
We managed to get two digit improvements on the streaming performance each major release for several years, but it was still impossible to increase default number of partitions by default without significantly penalizing build times on hosts with few cores.
Fortunately it has turned out that the type simplification and few other cleanups almost completely solved this for GCC 9.
I have omitted sizes for gcc 8 with larger number of partitions to avoid re-scaling graph and keeping gcc 9 data visible.
As can be seen on the chart, increasing number of partitions from 32 to 128 in GCC 8 almost doubled the amount of streamed data. With GCC 9 it grows only by 18% and even on my 16 thread testing machine I now get speedups for increasing the default number of partitions past 16 because of improved memory locality during the ltrans compilation.
I have decided to set new default to 128 but I do not think it would be problem to go higher. In particular it should be manageable to keep partitions corresponding to the original object files and cache the files between LTO builds to improve compile/edit cycle.
Faster WPA stage
As an effect the mentioned (and more) changes the WPA (whole program analysis) stage got significantly faster and more memory efficient. Comparing GCC 8 and 9 linking Firefox libxul.so with profile feedback I get:- Overall WPA time reduced from 128 to 103 seconds (24% improvement) .
- Time needed to stream in types and declaration reduced from 69 to 59 seconds (16% improvement) and uses 15% less memory. This is mostly because of the described streaming improvements enabled by early debug work in GCC 8.
- Inliner heuristics time reduced from 24.4 seconds to 20.4 secons (17% improvement). This is mostly because of Martin Liška work on faster was to attach optimization summaries to symbol table.
- Time needed to stream out partitions reduced from 11 seconds to 7 (36% improvement). Despite the fact that now 128 partitions rather than 32 are produced. The main win here is due caused by type simplification.
Benchmarks
LTO link-times and memory use
This is a brief sumary of Firefox build times with link-time optimization and profile guided optimization (PGO) on my somewhat aged testing machine (an 8 core Buldozer from 2014). The first thing to worry about when switching to LTO is the linking time and link-time memory use. There is noticeable improvements from GCC 8 to GCC 9:While compile times look very good compared to Clang (and in general I think the compile times of both compilers are now largely comparable), GCC still consumes more memory. Partly this is due to more ambitious implementation of LTO which does more of whole program analysis compared to Clang's thin LTO.
Here is a graph showing memory and CPU use during the link. It clearly separates the WPA serial stage from parallel ltrans. Note that Clang thinLTO chooses to use 8 threads rahter than 16. Increasing parallelism does not help build times because overall user time grows as well.
GCC 8 CPU and memory LTO optimizing Firefox libxul |
GCC 9 shows memory use improvements by about 30%. |
Clang 8 with ThinLTO does fewer whole program decisions saving the serial stage of build. |
Overall build times and code quality
I have done some performance comparison of various builds of Firefox at Christmas, see my previous blog post on it.In summary GCC built binaries performs better than Clang 8 but one needs to work around problems with Skia having vector optimizer graphics rendering routines only in Clang builds. Difference between GCC 8 and GCC 9 is not huge, but still noticeable. Most importantly it turned out that GCC 8 inliner was tuned bit too low and in some benchmark the LTO binaries ended up being slower than non-LTO. GCC 9 increases limits and gets more consistent. This however also lead to some code size increases:
GCC with LTO produces significantly smaller binaries than Clang. This is consistent across bigger applications I tested, not only limited to Firefox (note that in my testing I used default optimization flags used by Firefox, which is a combination of -O2 and -O3). I think there are two main reasons:
- With LTO I believe the main reason is the fact that GCC does inlining decisions at whole program basis, while Clang's thinLTO decides locally. This makes it possible for GCC to do better code size/performance tradeoffs.
- With profile feedback another factor is the fact that GCC optimizes cold regions for size. It seems that LLVM folks had worked on similar feature recently.
Testing was done on my 8 core 16 thread AMD Buldozer machine. Both GCC and Clang was built with LTO and profile feedback. As in the past years, the compile times are not very different. GCC tends to win without debug info and lose with debug info. Part of the reason is that GCC produces significantly more detailed debug info compared to clang and it thus takes longer to generate it and assemble together. GCC 8 produce 2.3GB of debug info sections, GCC 9 2.5GB and LLVM7 1.8GB. It is hard to qualitatively compare debug info, but this blog has some statistics.
SPEC 2017 benchmarks
In case you did not have enough numbers to look at, here is performance of SPEC2017 normalized to GCC 8 with -Ofast -march=native this time run on more current Zen based CPU. As more common with SPEC, bigger numbers are better.This is based on benchmarks collected by Martin Jambor. Full LTO was used for Clang.
Overall enabling LTO improves integer part of SPEC2017 suite by about 4%, profile guided optimization by about 2% and together one gets 6.5% improvement. GCC 9 outperforms Clang 8 by 6-10%. We used AOCC 1.3 to compile exchange that is the only Fortran benchmark in the suite.
For floating point part LTO brings about 3% improvements, same does PGO and together one can expect about 6% improvements. Some benchmarks are missing with Clang 8 because they does not build (mostly due to use of Fortran). Again we used AOCC 1.3 for bwaves, cam4 and roms.
Once we set up the new Fortran frontends for LLVM, we may be able to fill in the blanks. I am aware that PGI compare GCC to Flang at their slides. I believe the comparison is flawed because -march=native was not used for GCC. Without this flag SSE is used for floating point code generation and this is significantly slower than AVX2. Also Flang comes with its own math library that used to be faster than glibc and provide vectorized math functions. Glibc was improved significantly recently and vector functions was added. For serious math performance thus glibc update is greatly recommended :)
This is a summary how SPEC numbers developed in last 3 years giving little bit extra context. Again the exchange benchmark for Clang 8 was replaced by one built by AOCC.
If you are not a compiler developer single digit SPEC differences may look small, but the benchmark suite is generally memory bound and changes reported here are important. Overall LTO (and even more so LTO+PGO) can be big win for complex programs with flat profiles (such as Perl, GCC in the SPEC testsuite). It is less important for projects where internal loops are rather small and can be hand-optimized and put into a single file (like, say xz or exchange). Sometimes compiler just gets lucky - for example the 50% speedup of Povray is mostly due to one fortunately partial inlining decision. This is something which could have been easily fixed at source level if the original developers noticed the problem by profiling.
In longer term I believe switching to LTO enabled build environment should make the code bases easier to maintain. Letting compiler to do more work reduces the pressure to hand optimize code placement (such as putting more code to inline functions in headers), add bit of extra robustness (such as ODR violation warnings) and therefore let developers to focus on more important issues. At the moment we compare performance of LTO optimized programs which were for ages tuned for non-LTO compilers. I would expect the importance of LTO to grow not only because new optimizations will be added but also because code bases will be re-tuned for the new environment.
I believe GCC 9 is a solid release. With LTO framework mature enough I hope we can finally see the performance and code size gains from enabling it by default for distribution builds. With bit of luck this should happen for OpenSUSE Tumbleweed really soon and hopefully others will follow. It took only 14 years to implement that :)
For GCC 10 I plan to continue working on improving WPA times and also again look into implementing new fancy optimizations.
Isn't there some function inlining enabled in gcc with -Os? Would that be enough for -O2 to handle the worst issues with not inlining?
ReplyDeleteIn any case it is Firefox fault for not declaring the small function inline ;)
With -Os GCC will inline when it thinks that code will shrink. Some guesswork is needed here because you need to predict how well code will be optimized.
DeleteWith -O2 we inline all functions that would be inlined with -Os and also functions declared inline until quite generous limit.
For example the function isHTMLWhiteSpace which slows down Firefox (and yes, I think this one should be inline) is implemented as:
int IsHTMLWhitespace(int aChar) {
return aChar == 0x0009 || aChar == 0x000A ||
aChar == 0x000C || aChar == 0x000D ||
aChar == 0x0020;
}
Here GCC thinks that inlining function will expand code.
It seems to be a lost battle to make C++ code use inline kewords wisely. A lot of code gets implicit inline keyword because it appears in header file and a lot of other code omits inlining.
So trusting inline keyword less and making compiler to auto-inline to some extend is probably good way to go. It needs a lot of benchmarking clearly.
I really thought inline is not really useful for optimization purposes anyway cause it gets ignored (or merely increases the inlining threshold slightly in clang's case).
DeleteAlso, adding to that list of yours, some people don't understand inline at all and blindly use it, even on huge functions.
One issue I found with LTO is that assembly annotations of functions like symbol versioning and alias names (using .symver and .global) seem to get lost (dropped because unused?) causing linking to fail. Is there any solution workaround for this?
ReplyDeleteCan you build glibc for example with LTO enabled?
toplevel asm statement is an issue because compiler has no way known what exactly it does. There is PR for adding symbol versioning attribute which I plan to do for GCC 10. For Thumbleweed we simply disable LTO for those packages that indeed sucks.
DeleteWith glibc the additional issue is that it is an runtime library and to support LTO on that one needs to do additional magic because compiler may decide to output runtime library calls (such as memcpy/memset) later in the optimization queue or optimize one call (such as printf) to another (such as puts). If those special functions are part of the LTO translation units this gets into way with unreachable code removal and other optimizations. So at the moment glibc is not on the menu for LTO. However it is also unlikely to benefit much from LTO since its nature is to be quite thoroughly hand optimized.
Is there any way you could embed the graphs as images (SVG?)? My work blocks Google Docs.
ReplyDelete