Friday, March 18, 2016

Building libreoffice with GCC 6 and LTO


New GCC is just around the corner. For me, as a GCC developer, this means a period of bugfixing, benchmarking and fine-tuning. Two years ago I wrote about my experiment  of building libreoffice with link time optimization (LTO). At that time LTO just got into shape of being able to build such large applications. Since that I am keeping my eye on this and try to be sure LTO keeps improving. I did not have time to publish any tests for GCC 5. Here is update for current trunk of GCC 6 compared to GCC 4.9.0, GCC 5.3, LLVM 3.5 and trunk.



Prerequisities

Buliding libreoffice with LTO became relatively easy - Libreoffice's build machinery provide --enable-lto which does all the necessary magic. Since two years ago it was extended to handle parallelization and it also does take care of the linker plugin magic. One only needs to double check is that binutils is recent enough and built with the plugin support (hopefully all main distros are doing that now).  This can be verified by:
$ ld --help | grep plugin
  -plugin PLUGIN              Load named plugin
  -plugin-opt ARG             Send arg to last-loaded plugin
For clang it is necessary to use gold as main linker and to follow the official instructions to build LLVM's gold plugin which needs to be installed by hand into the search path of binutils.

Setup

I use Debian 8.3 on 8 core AMD box (AMD Opteron(TM) Processor 6272, 1.4Ghz) having 64GB RAM. Hardware is the same as in my 2014 report. OS was updated.

GCC is configured with:
configure --enable-languages=c,c++,fortran --enable-checking=release --disable-werror --disable-multilib --with-build-config=bootstrap-lto --disable-plugin
and built using make profiledbootstrap. This setup is used by some distros and deliver somewhat smaller and faster binary than normal non-LTO and non-FDO bootstrap.

LLVM 3.5 is taken from Debian's package. LLVM 3.8 is the debian 8 binary at http://llvm.org/releases/download.html#3.8.0. LLVM trunk (rev 263221) is built with:
cmake -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=no -DLLVM_BINUTILS_INCDIR=/aux/hubicka/binutils-install/include/
I have rebuilt LLVM by itself after obtaining the first binary that was built by GCC. I did not find an official way to build LLVM with FDO. According to this post building LLVM by GCC with FDO can actually improve LLVM's performance by 16% over my setup. The build I made follows official release build instructions.

Libreoffice is from Mar 3 2016 and is configured (for all compilers) with:
autogen.sh --enable-mergelibs  --disable-symbols --without-doxygen --enable-python=no --disable-gstreamer-1-0 --enable-lto --disable-firebird-sdbc --with-system-cairo
Some features are disabled because they did not build for me (broken dependencies, syntax errors, etc.). Of course non-lto builds use --disable-lto.

Binutils

I used Debian provided GNU ld for all the builds except for LLVM's LTO binaries wich was build with gold (only then I found that GNU ld won't work for LLVM's LTO).

Compile time

I measured compile time both as wall time and as system+user time of running make -j16.

These results was quite a surprise for me, so I re-run everything to be sure that the results are correct. While GCC got about 6% faster since 2014, LLVM got about 24% slower. As a result GCC now builds faster than LLVM.

Update: There is some discussion on compile times of clang+llvm with different versions of LLVM and GCC at http://lists.llvm.org/pipermail/llvm-dev/2016-March/096488.html.

GCC releases was traditionally getting slower and slower. It is good to see that this trend has been reversed (at least for Libreoffice, but SUSE's CSiBE benchmark bot confirms similar situation: while GCC 5 is slightly slower than GCC 4.9, current trunk is faster than both).

Note that GCC 6 change the C++ language standard from GNU++08 to GNU++14. This change, for many codebases, means a lot bigger overhead of libstdc++ include files. This naturally leads to slower compile times (up to twofold slowdown is seen in compiling xalancbmk, see the July change in https://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/483_xalancbmk-time_big.png). Libreoffice seems little affected, probably because it sets standard manually for most of files compiled.

Later during my testing I decided to build more binaries with -O3 -flto. Those are not included in the charts because some of unit tests had to be disabled and thus the numbers are not directly comparable.


Wall times are especially interesting for LTO. While LLVM's LTO adds almost no extra user time, the serial linking of libmerge.so results in large wall time increase. GCC gets around by parallelizing the linking step.

Note that Google folks works on parallel LTO for LLVM, too, which is inspired by their LIPO  in GCC.

Compile time memory use

Non-LTO memory use

Memory usage for GCC 4.9, non-LTO. The peak of 11GB is quite large, but keep in mind that this happens only because of large parallelism in my build. Libreoffice builds with make -j4 on my 8GB laptop without problems.
Memory usage for trunk shows small improvement, reducing peak from 11GB to 10GB.
LLVM 3.9  takes less memory, about 6GB in peak.This is about the same as LLVM 3.5 does, so did not include the plot.

LTO memory use

Peak memory use with GCC 4.9 and LTO enabled. It is clearly visible how linking of libmergedso starts (in about 3100s) and how memory skyrockets to 18GB. Again, reducing parallelism would fight the memory use, but it is definitely quite large. Two years ago I reported 12GB with the same compiler, Libreoffice has grown up.
GCC 5.0 reduces the peak to about 12GB. Compile time is mostly unchanged, but the linktimes improve.
Trunk needs 10GB and the compilation stage got noticeably faster (one can see the libmerged.so peak appearing earlier). Peak memory use is now about the same as w/o LTO.
LLVM's LTO shows clear split between the stage when parallel compilation is done and the serial link each of them consuming about half of the compilation process. Memory use is about 7GB which is less than for GCC, but GCC would use less memory if parallelism was disabled or reduced.

Binary size

Binary sizes tells quite interesting story. With GCC LTO reduces binary size. The largest reduction is in GCC 5.3, the smallest in GCC 4.9. GCC 6.1 somewhat regressed. The reason is that GCC 5.3 enables identical code folding that is surprisingly effective on Libreoffice. In GCC 6.1 the type based alias analysis was improved and became more fine grained. This unfortunately disabled some identical code folding, because the folding happens early and thus needs to match all the metadata that may be used by optimizers later. I hope to track this problem for next release. Still -flto reduces code size by 13%. It is also interesting to observe that the code size cost of -O3 -flto is minimal and it enables GCC to do more optimizations.

LLVM produces slightly smaller binary without LTO. The main reason is the semantic interposition. GCC assumes that symbol in shared libraries may be overwritten and dynamic linking time which represents and optimization barrier. LLVM does not assume that and happily propagates across calls to those interposable symbols. For this reason I added -fsemantic-interposition flag into GCC 5.1. Semantic interposition is also one of reasons for the difference of "rest" section which include EH and unwind tables: with interposition expected compiler can not propagate the fact that given function is not going to throw just because it sees no throw in the body available at compile/link time.

There is very nice non-LTO code size reduction between LLVM 3.5 and 3.8 in non-LTO build. LLVM's LTO increase code size and the code segment is about 29% bigger than GCC's.

Dynamic linking


Dynamic linking used to be one of major issues of large C++ applications (Libreoffice and KDE). Due to number of improvements into the dynamic linker (such as adding precomputed hashes) and optimizations made by developers of these applications (such as merging libraries into bigger blobs or starting applications by loading a dynamic library) this is no longer so critical.

Number of relocations resolved by dynamic linker can be collected by LD_DEBUG=statistics environment. It is nice to see that compilers are improving on optimizing out the relocations. The table shows number of relocations resolved by dynamic linker and number of relocations resolved from cache when loading the largest library of libreoffice needed for command line conversion from ODT to PDF.

I must say that I am somewhat disappointed that GCC's LTO seems to help just little given number of optimizations explicitly developed for this. Perhaps on different binary or perhaps there is a bug in the implementation - I need to check that. There is however noticeable progress. It would be nice to know why LTO for LLVM seems to bring more improvements even though the actual linking overhead is bigger on LLVM binaries.

I tried to check if these differences translate to improvement of libreoffice startup by measuring time needed to print --help, but the differences are within noise. Most of startup cost is now elsewhere.

Runtime

Like two years ago, I had problem to find a test that would show anything but noise. The following is the test I did  converting OpenDocument standard by:
instdir/program/soffice --headless --convert-to pdf --outdir /tmp ~/OpenDocument-v1.2-os.odt


Even when run with perf, runtime priority with caches cleaned and filled by a dry run, 100 repetitions, this test show basically no off-noise difference between compilers. Only off-noise information is that  -O3 -flto binaries are faster than others (for both GCC and LLVM).

I tried benchmarks from https://gerrit.libreoffice.org/gitweb?p=benchmark.git;a=tree (and regretted the fact that I did not built python scription to my libreoffice binaries). Most benchmarks shows similarly even results except for one test (StocksPrice_time_correlation.xls):

which is interesting by LLVM -flto -O2 winning over GCC -flto -fno-semantic-interposition -O2. Profiling the test shows that the internal loop is dominated by function ScFormulaCell::NeedsInterpret which is inlined by LLVM at -O2 -flto but not by GCC unless -O3 -flto is used. The code size is estimated to grow (the function is not completely trivial) and GCC -O2 inlines only functions explicitly declared inline or those where code size is expected to shrink. It seems that the -O2/-O3 split may be re-considered for LTO builds. Clearly -O3 expands code just minimally which is not the case of non-LTO builds.

From the profiles it is clear that the spreadsheet benchmarks are of microbenchmarking nature (developed to test GPU acceleration efforts) and they are not very good indicator of compiler optimization, because it all matters how well few tight internal loops are optimized. It may be interesting project to build more elaborate Libreoffice benchmark.

I plan to write later on Firefox which has more elaborate benchmark suite.

Summary

For the first time in my tests, GCC 5+ compiles faster than LLVM 3.8+. While this fact may be possibly reversed by optimizing LLVM with GCC's FDO (I did not check that), the compile time performance of both compilers (on Libreoffice build) is within the range of micro-optimizations.

LLVM still consume less memory than GCC (6GB compared to 10GB with make -j16) during non-LTO compilation (reduced the use by about 1GB in two years).

A more progress is visible with GCC LTO. Memory use was reduced from 18GB to 10GB (same as non-LTO build). -flto still makes the compilation slower, but because Libreoffice was updated to use -flto=n, the overall slowdown is quite small. In fact, GCC with LTO builds Libreoffice on my setup faster than LLVM 3.8 without LTO.

The binary sizes improved. GCC with LTO produces smallest binary. Comparing binary sizes between GCC and LLVM needs to be done with care, because LLVM ignores semantic interposition which makes noticeable difference. It may make sense to add -fno-semnatic-interposition into the Libreoffice's set of default flags. GCC 5 seen dramatic code size reductions with LTO due to identical code folding optimization. Sadly some of these benefits are gone in GCC 6 because of improvements in type based alias analysis. Hopefully next stage1 the identical code folding will be updated to handle this more carefully. There is enough of code duplication so it may make sense for Libreoffice developers to look into it and reduce it at source level and experiment with Gold's ICF feature. The duplicated functions can be fished out of -fdump-ipa-icf dump, but it may make sense to add user facing warning.
GCC produce binaries which needs fewer non-cached lookups during the dynamic linking. LLVM seems to derive more benefits from LTO (I would like to understand why).

It is hard to get any off-noise benchmark results out of Libreoffice. The limited results shows no significant difference between compilers or releases. -O3 -flto binaries are faster than the default builds and small performance improvements are also seen with LTO. Given that GCC's -O3 -flto binary is significantly smaller than the -O2 binary and just tiny bit bigger than -O2 -flto, it seems to make sense to re-consider -O2 optimizations with LTO during next stage1: given that whole program (well DSO at best) is available, compiler may be able to make more informed decisions. This seems better than trying to convince users to switch to -O3 with LTO.

LLVM's non-LTO binaries got smaller between LLVM 3.5 and LLVM 3.8 and code segment sizes are now comparable with GCC's. LLVM's data segments are smaller, because GCC aligns more; partly to be backward compatible with its own bug where overly large alignments was assumed on externally defined statics vars. LLVM also produce smaller EH and unwind tables. These are not loaded to memory until exception occurs, but it would be still nice to keep them small. I believe the size difference is mostly due to fact that GCC for some releases now use push/pop instructions by default (to reduce code size) and those needs more unwind descriptions. LLVM also did not produce precise unwind info for function prologues and epilogues (this violates x86_64 ABI)

LLVM's LTO increase code size. According to my limited profiling, LLVM seems to be now more aggressive than GCC on code expanding inlinining at -O2. GCC's -O2 -flto binary has 29% smaller code segment than LLVM's but it is slightly slower in one of spreadsheet calculation benchmark by missing an inline in the internal loop.

What remains problem is the debug info with LTO. While GCC will work with -flto -g, the debug info basically corresponds to C language and debugging C++ programs may be a challenge. Major rewrite (by Richard Biener) is supposed to land in GCC 7. LLVM 3.9 still more than doubles linking time and increases peak memory use to 16GB (for non-parallelized linking stage).

In 2014 I was able to hack build machinery to produce binaries with profile feedback. This no longer works. I will try to debug that, but I decided to not include it in this report, so it stops snowballing.

9 comments:

  1. Hi. Is there a chance to bring these tests to next compiling of Official L. O., if it is not, already?
    Another thing: can you extricate why when I use L. O. headless to convert to (PDF) I got (ever!) smaller files (in a LibreOffice brochure, I got 6MB against 6.3MB from viewer interface! All default and no perceptible difference in images rendering).

    $ /opt/libreoffice5.1/program/soffice --headless --convert-to pdf /home/morvan/curso/LibreOffice/apostila/LibreOffice.org.Modular.Apostila.Hist.Calc.Writer.2016.r07.odt --outdir /tmp

    And, finnaly, congratulate you by your great work.

    ReplyDelete
  2. Optimizations in GCC are nice, but better is musl libc, being already there in a sense. I would like the LibreOffice codebase to compile for musl without patches. See current patches at Alpine Linux, a very popular distro for Docker users. The lead developer was in fact hired by Docker, and works there now.

    Thanks for all your work on LibreOffice.

    ReplyDelete
  3. Looks like they just added PGO builds for you ;)
    http://reviews.llvm.org/rL263834

    ReplyDelete
  4. It seems to me scanning the graphs roughly (and looking at CPU utilization) that "compiling proper" is still about 75% of the time of the equivalent stage for GCC, and that GCC's big win is in linking. My point is that

    - for LLVM developers, the takeaway should be that their single highest impact improvement right now is getting their linking to be as impressive as GCC's.
    - there's a noticeable dead period (no CPU utilization) for both compilers at the point I assume is just after compilation proper is done; and I assume this is time spent pulling in binaries from disk into RAM (libraries and other code built at some other time?).
    There seems to be scope here for the build system to improve things by shifting this work to preloading from disk that overlaps with the earlier CPU-intense stage of the processing. For GCC this dead time looks like around 5% or so, which is still a nice speedup in the overall build, and for LLVM it looks a lot larger, like 10 to 15%.
    (I could imagine that in the past this sort of "optimization" would be a terrible idea because it would result in the disk head bouncing around between the position of the libraries and the position of text files being pulled in for compiling; but with SSDs it might be time to revisit that assumption.)

    Do both these observations make sense? Or have I completely misunderstood what the graphs are telling us?

    ReplyDelete
    Replies
    1. The red line on CPU usage graphs represent 1 CPU (there are 16 threads on my setup). In LLVM LTO build the end of compilation stage is at about 3500s where CPU usage drops to 1 CPU and the single threaded link happens. I think the libmerged.so is finished with linking at 4500s judging from drop in memory use followed by other smaller libraries.

      In GCC 6 graph the CPU usage drops at 2800s, so the compile stage is also faster. The memory and CPU use of GCC LTO link is always a single threaded WPA stage (visible in the graph) followed by parallel ltrans stages. You can clearly see the link of libmgerged.so to finish at about 3100s.

      I think the differences are explained by different organizations of LTO in both copmilers. While LLVM basically runs complete optimization queue at compile time and re-runs it at link time, GCC splits optimization between early and late. Early optimizations are done at compile time and are intended to be win-win ones only (i.e. no trading of code size for speed). At linktime the optimization queue proceeds from that point without re-doing optimization.

      The advantage of this model is faster compile time and it makes it posible to decide on code expanding optimizations in more informed manner at link-time. The disadvantage is that easily the LTO optimized code may be slower than non-LTO if the link-time optimizers gets confused by all the cross-module possibilities.

      Delete
    2. OK, thanks. Seeing the graphs in their small form on the page, I thought they were all equally scaled in time (because some are slightly wider than others). I didn't realize they had different scales.

      Thanks for the discussion regarding the split optimizations of GCC (obviously a smart optimization), and the discussion of the single-threaded stages.

      Delete
  5. How did you collect/plot the CPU time/memory usage? Is there a script you can share? Thanks!

    ReplyDelete
  6. Are there still plans to include improved debug info in LTO-compiled binaries with the upcoming release of gcc7? I did a quick search, but nothing concrete turned up.

    ReplyDelete
    Replies
    1. It has slipped one release, but it is finally in GCC 8.

      Delete