Sunday, December 30, 2018

Even more fun with building and benchmarking Firefox with GCC and Clang

Recent switch of official Firefox builds to Clang on all platforms has triggered some questions about the quality of code produced by GCC compared to Clang. I use Firefox as an ocassional testcase for GCC optimization and thus I am somewhat familiar with its build procedure and benchmarking. In my previous post I ran benchmarks of Firefox64 comparing GCC 8 built binary to official builds using Clang which turned out to look quite good for GCC.

My post triggered a lot of useful discussion and thus I can write an update. Thanks to Nathan Froyd I obtained level 1 access to Mozilla repository and Mozilla's try server which allows me to do tests on the official setup. Mozilla's infrastructure features some very cool benchmarking tools that reliably separates useful data from the noise and run more tests than I did last time

I am also grateful to Jakub Jelínek, who spent a lot of time bisecting a nasty misoptimization of spell checker initialization. I also did more joint work with Jakub, Martin Liška and Martin Stránský on enabling LTO and PGO on OpenSUSE and Fedora packages. (Based on LTO enabled git branch Martin Liška maintained for a while.)

      Treeherder

      Treeherder, the build system used by Mozilla developers seems surprisingly flexible and kind of fun to use. I had impression that it is major pain to change its configuration and update to newer toolchain. In fact it is as easy as one can hope for. I was surprised that with minimal rights I can do it myself. With an outline from Nathan I have decided to do the following:
      1. revert builds back to GCC 6,
      2. update build system to GCC 8, 
      3. enable link-time-optimization (LTO) and compare performance to official builds 
      to see if there are code quality issues or bugs I can fix.

      It took three days to get configuration updated and at Christmas eve I got first working LTO build with GCC 8. The build metrics looked great, but indeed was number of performance regressions (unlike in tests I did two weeks ago where basically all benchmarks looked good).

      I had to see a lot of cat dancing animations, sacrifice few cows to Sr. Ignucius and replace 5 object files by precompiled Clang assembler (these contains hand optimized AVX vectorized rendering routines written in the Clang only extensions of GNU vector extensions which I explain later). 4 days later I resolved most of performance problems (and enjoyed the festivities as well).

      In this post I discuss what I learnt. I focus on builds with link-time-optimization (LTO) and profile-guided optimization (PGO) only because it is what official builds use now and because it makes it easier to get apple-to-apple comparisons. I plan to also write about performance with -O2 and -O2 -flto. Here the fact that both compilers differs in interpretation of optimizations levels shows up.

      Try server benchmarks: Talos comparison of GCC 8 and Clang 6 binaries build with link-time-optimization (LTO) and profile guided optimization (PGO)
      This dancing cat is overseeing Firefox's benchmarks.
      If you do some Firefox performance work you are better to be a cat person.
      .

      Treeherder has so far the best benchmarking infrastructure I worked with. Being able to run benchmarks in controlled and reproducible environment is very important and I really like to ability to click on individual benchmarks and see history and noise of Firefox mainline. It is also cute that one can mark individual regressions, link them with bugzilla and there are performance serifs doing so.

      GCC benchmarking has been for years ruled by simple scripts and GNU plot. Martin Liška recently switched to LNT which has a lot of nice environments but I guess LNT can borrow some ideas :)
      This is a screenshot of the dancing cat page comparing binary sizes of GCC 8 LTO+PGO binary to Clang 6 LTO+PGO and other build metrics. Real version may take a while to load and will consider all changes as insignificant because it does not have data from multiple builds. Screenshot actually compares my build to trunk of 28th December 2018 which is not far from my branchpoint.
      The "section sizes" values are reduction of binaries. Here largest is libxul.so that goes down from 111MB to 82, 26% reduction. Installer size reduces by 9.5%. What is not reported but is important is that code segment reduces by 33%. Build time difference are within noise.

      Number of warnings is red, but I guess more warnings are good it comes to compiler comparison.
      Screenshot of dancing cat comparsion of my GCC 8 LTO+PGO build with Clang 6 LTO+PGO to the official build from the point I branched. Dancing cat will give you a lot of extra information: you can look at sub-tests and it shows tests where changes are not considered important or within noise. You can also click to graph and see progress over time.

      The following benchmarks sees important improvements:
      1. tp5o responsiveness (11.45%) tracks reponsiveness of the browser on the tp5o page set. This is 51 most popular webpages from 2011 accroding to Alexa with 3 noisy ones removed. List of the webpages is seen in the subtests of tp5o benchmark

        There is interesting discussion about it in bugzila. This is complex test and I would like to believe that the win is caused by the careful choice of code size wrt performance, but I have no real proof for that.
      2. tps (5.09%) is a tab switching benchmark on the tp5o pageset.
      3. dromaeo_dom (4.74%) is a synthetic benchmark testing manipulations with DOM trees. It consists of 4 subtests and thus the profile is quite flat. See subtests. Run it from official dromaeo page.
      4. sessionrestore (3.57%), as name suggests, measures time to restore a session. Again it looks like quite interesting benchmark training quite large part of the Firefox.
      5. sessionrestore_no_auto_restore (3.05%) seems similar to previous benchmark. 
      6. dromaeo_css (2.99%) is synthetic benchmark testing CSS. It consists of 6 subtests and thus the profile is quite flat. See subtests. Run it from official dromaeo page
      7. tp5o (2.43%) is benchmark loading tp5o webpages. This is really nice overall performance test i think. See subtests which also lists the page. The improvements are quite uniform across the spectra.

      The following 3 benchmarks sees important regressions:
      1. displaylist_mutate (6.5%) is measuring time taking to render page after changing display list. It looks like a good benchmark because its profile is very flat which also made it hard for me to quickly pinpoint the problem. One thing I noticed is that GCC build binary has some processes showing up in profile that does not show in clang's so it may be some kind of configuration difference.
        You can run it yourself.
      2. cpstartup (2.87%) is testing time opening new tab (which starts component process I think) and getting ready to load page. This looks like an interesting benchmark but since it is just 2.96% I did not run it locally. It may be just the fact that train run does not really open/close many tabs and thus a lot of code is considered cold
      3. rasterflood_svg (2.7%) is testing speed of rendering square patterns. It spends significant time in hand optimized vector rendering loops. I analysed the benchmarks and reduced regressions as described below since profile is simle. I did not look at the remaining 2% difference. Run it yourself.

      There are some additional changes considered unimportant but off-noise:


      Improvements: 
      1. tpaint (4.77%)
      2. tp5o_webext (2.59%) 
      3. tp6_facebook (2.53%)
      4. tabpaint (2.43%)
      5. about_preferences_basic (2.23%)
      6. tp6_google (1.66%)
      7. a11r (1.66%)
      Regressions:
      1. tsvgx (1.28%)
      2. tart (0.67%)

      All these tests are described in Talos webpage.

      Out of 40 performance tests, had 20 off-noise changes and except 5 in favour of GCC.
      You can compare it with report on the benefits for switch from GCC 6 PGO to Clang 6 LTO+PGO in this bug 1481721. Note that speedometer is no longer run as part of the Talos benchmarks. I ran it locally and improvement over Clang was 5.3%. Clearly for both compilers LTO does have important effect on the performance.

      It is interesting to see that Firefox official tests mix rather simple micro-benchmarks with more complex tests. This makes it bit more difficult to actually understand the overall performance metrics.

      Overall I see nothing fundamentally inferior in GCC's code generation and link-time optimization capabilities compared to Clang. In fact GCC implemetnation of scalable LTO (originally called WHOPR, see also here) is more aggressive about whole program analysis (that is, it does almost all inter-procedural decision on whole program scope) than Clang's ThinLTO (which by design makes as much as possible on translation unit scope where translation workers may pick some code from other translation units as instructed by this linker). ThinLTO design is inspired by the fact that almost all code quality benefits from LTO in today compilers originate from unreachable code removal and inlining. On other other hand, optimizing at whole program scope makes it possible to better balance code size and performance and implement more transforms. I have spent a lot of time on optimizing compiler to get WHOPR scalable (which, of course, helped to cleanup the middle-end in general). I am happy that so far the build times with GCC looks very competitive and we have more room for experimenting with advanced LTO transformations.

      Performance regressions turned out to be mostly compiler tuning issues that are easy to solve. Important exception is the problem with Clang only extensions which affects rasterflood_gradiend and some tsvg subtest explained in section about Skia. Making Skia vector code GCC compatible should not be terribly hard as described later.

      Update: I gave second chance to displaylist_mutate and found it is actually missed inline. GCC inliner is bit tuned down for Firefox and and can trade some more size for speed. Using --param inline-unit-growth=40 --param ealry-inlining-insns=20 fixes the regression and brings some really good improvements over the spectra. While binary is still 22% smaller than Clang build.  If I increase limits even more, I get even more improvements. I will now celebrate end of year and once next year I will analyse this and writemore.

      I am in process of fine-tuning inlined for GCC 9 so I will take Firefox as additional testcase.

      Getting GCC 8 LTO+PGO builds to work.

      Following Nathan's outline it was actually easy to update configuration to fetch GCC8 and build it instead of GCC6.

      I enabled LTO same way as for Clang build and added:
      export AR="$topsrcdir/gcc/bin/gcc-ar"
      export NM="$topsrcdir/gcc/bin/gcc-nm"
      export RANLIB="$topsrcdir/gcc/bin/gcc-ranlib"
      to build configuration in build/unix/mozconfig.unix. This is needed to get LTO static libraries working correcly. Firefox already has same defines for llvm-ar, llvm-nm and llvm-ranlib. Without this change one gets undefined symbols at link-time.

      I added patch to disable watchdog to get profile data collected correctly. This is problem I noticed previously and is now bug 1516081 which is my first experiment with Firefox patch submission procedure (Phabricator) which I found particularly entertaining by requiring me to sacrifice few games from my phone in order to install some autentificating app.

      Next problem to solve was undefined symbol in sandbox. This is fixed by the following patch taken from Martin Liška's Firefox RPM:
      diff --git a/security/sandbox/linux/moz.build b/security/sandbox/linux/moz.build
      --- a/security/sandbox/linux/moz.build
      +++ b/security/sandbox/linux/moz.build
      @@ -99,9 +99,8 @@ if CONFIG['CC_TYPE'] in ('clang', 'gcc')
       # gcc lto likes to put the top level asm in syscall.cc in a different partition
       # from the function using it which breaks the build.  Work around that by
       # forcing there to be only one partition.
      -for f in CONFIG['OS_CXXFLAGS']:
      -    if f.startswith('-flto') and CONFIG['CC_TYPE'] != 'clang':
      -        LDFLAGS += ['--param lto-partitions=1']
      +if CONFIG['CC_TYPE'] != 'clang':
      +    LDFLAGS += ['--param', 'lto-partitions=1']

       DEFINES['NS_NO_XPCOM'] = True
       DisableStlWrapping()
      The code to add necessary --param lto-partitions=1 already exists, but somehow it is not enabled correctly. I guess it was not updated for new --enable-lto. The problem here is that sandbox contains toplevel asm statement defining symbols. This is not supported for LTO (because there is no way to tell compiler that the symbol exists) and it is recommended to simply disable LTO in such cases. This is now bug 1516803.

      Silencing new warnings

      Official build uses -Werror so compilation fails when warnings are produced.I had to disable some warnings few where GCC complains and Clang is happy:

      I ended up disabling:
      1. -Wodr. This is C++ One Definition Rule violation detector I wrote 5 years ago. It reports real bugs even though some of them may be innocent in practice.

        In short C++ Definition Rule (ODR) says that you should not have more than one definition of same name. This is very hard to keep in program of size of Firefox unless you are very consistent with namespaces. ODR violation can leads to surprises where, for example, virtual method ends up being dispatched to virtual method of completely different class which happens to clash in name mangling. This is particularly dangerous when, as Firefox does, you link multiple versions of same library into one binary.

        These warnings are detected only with LTO. I started to look into fixes and found that GCC 8 is bit too verbose. For example it outputs ODR violation on the class itself and then on every single method the class has (because its this pointer parameter is mismatched). I silenced some of wranings for GCC 9. GCC 9 now finds 23 violations which are reported as bug 1516758. GCC 8 reported 97.
      2. -Wlto-type-mismatch. This is warning about mismatched declarations across compilation units such as when you declare variable int in one unit but unsigned int in another. Those are real bugs and should be fixed. Again I reduced verbosity of this warning for GCC 9 so things are easier to analyse. Reported as bug 1516793.
      3. -Walloc-size-larger-than=. This produces warnings when you try to allocate very large arrays. In case of Firefox the size is pretty gigantic.

        GCC produces 20 warnings on Firefox and they do not seem particularly enlightening here.
        audio_multi_vector_unittest.cc:36:68: warning: argument 1 value ‘18446744073709551615’ exceeds maximum object size 9223372036854775807 [-Walloc-size-larger-than=]

        array_interleaved_ = new int16_t[num_channels_ * array_length()];
        What is says is that the function was inlined and array_length ended up being compile time constant of 18446744073709551615=FFFFFFFFFFFFFFFF. It is a question why such context exists
      4. -Wfree-nonheap-objects. As name suggests this warns when you call free on something which is clearly not on heap, such as automatic variable. It reported 4 places where this happens across quite large inline stacks so I did not debug them yet.
      5. -Wstringop-overflow=. This warns when string operation would overrun its target. This detects 8 places which may or may not be possible buffer overflows. I did not analyse them.
      6. -Wformat-overflow. This warns when i.e. sprintf formatting string can lead to output larger than is the destination buffer. It outputs diagnostics like:
        video_capture_linux.cc:151:21: warning: ‘%d’ directive writing between 1 and 11 bytes into a region of size 10 [-Wformat-overflow=]

        sprintf(device, "/dev/video%d", (int) _deviceId);
        Where someone apparently forgot about the 0 terminating the string. It trigger 15 times.
         Martin Liška and Jakub Jelínek have patches which I hope will get upstream soon.

      Supplying old libstdc++ to pass ABI compatibility test

      After getting first binary, I ran into problem that GCC 8 built binaries require new libstdc++ which was not accepted by ABI checks and if you disable those tests the benchmarking server will not run the binary.

      Normally one can install multiple versions of GCC and use -I and -L to link against older libstdc++. I did not want to spent too much time on figuring out how to get official build system to install two GCC versions at once, I simply made my own tarball of GCC 8.2 where I replaced libstdc++ by one from GCC 6.4.

      Working around the two GCC bugs

      First bug I wrote about previously and is related to the way GCC remembers optimization options from individual compilation commands and combines them together during link-time optimization. There was an omission in the transformation which merges static constructors and destructors together which made it to pick random flags. By bad luck those flags happened to include -mavx2 and thus binaries crashed on Bulldozer machine I use for remote builds (they would probably still work in Talos). It is easy to work this around by adding -fdisable-ipa-cdtor to LDFLAGS.

      Second bug reproduces with GCC 8 and PGO builds only. Here GCC decides to inline into thunk which in turn triggers an omission in the analysis pass of devritualization. Normally C++ methods take pointer to the corresponding objects as this pointer. Thunks are special because they take pointer adjusted by some offset. It needs a lot of bad luck to trigger this and get wrong code and I am very grateful to Jakub Jelínek who spent his afternoon by isolating a testcase.

      I use:
      diff --git a/extensions/spellcheck/src/moz.build b/extensions/spellcheck/src/moz.build
      --- a/extensions/spellcheck/src/moz.build
      +++ b/extensions/spellcheck/src/moz.build
      @@ -28,3 +28,8 @@ EXPORTS.mozilla += [

       if CONFIG['CC_TYPE'] in ('clang', 'gcc'):
           CXXFLAGS += ['-Wno-error=shadow']
      +
      +# spell checker triggers bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88561
      +# in gcc 7 and 8. It will be fixed in GCC 7.5 and 8.3
      +if CONFIG['CC_TYPE'] in ('gcc'):
      +    CXXFLAGS += ['-fno-devirtualize']
      I am not happy GCC bugs was triggered (and both mine) but they are clearly rare enough that they was never reported before.

      Performance analysis

      My first run of benchmarks has shown some notable regressions (note that this is with only one run of the tests, so smaller changes are considered noise).

      Important regressions were:
      1. rasterflood_gradient (75%). This renders some pink-orange blobs. Run it yourself.
      2. rasterflood_svg 13%. This renders rotating squares (with squary hairs).
      3. tsvg_static  18%.This consists of 4 subtests with regressions: rendering transparent copies of Kyoto and rendering rotated copies of Kyoto.
      4. tsvgx 36%. This again consist of 7 subtests. Massive 493% ression is on test drawing blue-orange blobs. Second test renders nothing in my version of Firefox (official tumbleweed RPM) but it renders weird looking green-blue jigsaw pieces for Chrome. Last test which regresses renders green map.

        I have filled bug 1516791 but it may be caused by fact that I need to add some java-script into the test to get it running out of Talos and I am no javascript expert.

        Update: as explained in the bug, it was my mistake. This test needs additional file smallcats.gif and that strange jigsaw puzzle is actually broken image icon. So indeed my mistake.
      5. displaylist_mutate 8%.. This benchmark I did not analyse easily. It consists of sub-tests that all look alike and have flat profile.
      Fortunately all those benchmarks except for diplaylist_mutate are of micro-benchmark nature and it is easy to analyse them. Overall I think it is good sign about GCC 8 built performance out of the box: if you do not need on rendering performance of shader animations, GCC 8 built binary will probably perform well for you.

      Skia (improving rasterflood_gradient and tsvgx)

      Skia is a graphic rendering library which is shared by Firefox and Chrome. It is responsible for performance in two benchmarks: rasterflood_gradient and the massively regressing tsvgx subtest. I like pink-orrange blobs better and thus looked into rasterflood_gradient. Official build profile was:
      Samples: 155K of event 'cycles:uppp', Event count (approx.): 98151072755
      Overhead  Command          Shared Object                  Symbol
        13.32%  PaintThread      libxul.so                      [.] hsw::lowp::gradient
         7.88%  PaintThread      libxul.so                      [.] S32A_Blend_BlitRow32_SSE2
         5.20%  PaintThread      libxul.so                      [.] hsw::lowp::xy_to_radius
         4.14%  PaintThread      libxul.so                      [.] hsw::lowp::matrix_scale_translate
         3.97%  PaintThread      libxul.so                      [.] hsw::lowp::store_bgra
         3.77%  PaintThread      libxul.so                      [.] hsw::lowp::seed_shader
      while GCC profile was:
      Samples: 151K of event 'cycles:uppp', Event count (approx.): 101825662252
      Overhead  Command          Shared Object               Symbol
        17.64%  PaintThread      libxul.so                   [.] hsw::lowp::gradient
         6.51%  PaintThread      libxul.so                   [.] hsw::lowp::store_bgra
         6.36%  PaintThread      libxul.so                   [.] hsw::lowp::xy_to_radius
         5.40%  PaintThread      libxul.so                   [.] S32A_Blend_BlitRow32_SSE2
         4.73%  PaintThread      libxul.so                   [.] hsw::lowp::matrix_scale_translate
         4.53%  PaintThread      libxul.so                   [.] hsw::lowp::seed_shader
      So only few percent difference on my setup (as opposed 75% on the try server) but clearly related to hsw::lowp::gradient which promptly pointed me to the Skia library. Hsw actually stands for haswell and it is hand optimized vector rendering code which is used for my Skylake CPU.

      #if defined(__clang__) issues

      I looked into sources and noticed two funny hacks. Someone enabled always_inline attribute only for Clang which I fixed by this patch. And there was apparently leftover hack in the Firefox copy of Skia (which was never part of official Skia) disabling AVX optimization on all non-Clang compilers. Fixed by this patch. That patch also fixes another disabled always_inline with comment about compile time regression with GCC. Those did not reproduce to me. I also experimented with setting -mtune=haswell on those files since I suspected that AVX vectorization on generic tuning may be off - I never got an idea to test it since I expected people to use -march=<cpu> in this case.

      I was entertained to noice that Clang actually defines __GNUC__. Shall also GCC define __clang__?

      With this rasterflood_mutate regression reduced from 75% to 39% and the tsvgx subtest from 493% to 75%.

      AVX256 versus SSE2 code

      From profiles I worked out that the rest of difference is again caused by #ifdef machinery. This time is however not so easy to undo. Firefox version of Skia contains two implementations of the internal loops. One is Clang only using vector extensions while other is used for GCC and MSVC using the official Intel's ?mmintrin.h API. The second version was never ported to AVX and the avx/hsw loops still used 128bit SSE vector and API just compiled with new ISA enabled.

      I have downloaded upstream Skia sources and realized that few weeks ago the MSVC and GCC path was dropped completely and Skia now defaults to scalar implementation on those compilers.

      Its webpage mentions:

      A note on software backend performance

      A number of routines in Skia’s software backend have been written to run fastest when compiled by Clang. If you depend on software rasterization, image decoding, or color space conversion and compile Skia with GCC, MSVC or another compiler, you will see dramatically worse performance than if you use Clang.
      This choice was only a matter of prioritization; there is nothing fundamentally wrong with non-Clang compilers. So if this is a serious issue for you, please let us know on the mailing list.
      Clang's vector extensions are in fact GNU vector extensions and thus I concluded that it should not be that hard to port Skia to GCC again and to give it a try. It is about 3000 lines of code. I got it to compile with GCC in an hour or two, but it turned out that more work would be necessary. It does not make sense to spend too much time on it if it can not be upstreamed so I plan to discuss it with the Skia developers.

      The problem is in:
      template <typename T> using V = T __attribute__((ext_vector_type(size)));
      
      This is not understood by GCC. ext_vector_type is Clang only extension of GNU vector extensions for which I found brief mention of in the Clang Langugage Extensions manual. Here they are called "OpenCL vectors".

      I also noticed that replacing ext_vector_type by GNU equivalent vector_size makes GCC unhappy about using attributes on right hand side for which I filled PR88600 and Alexander advised me to use instead:
      template <typename T> using V [[gnu::vector_size (sizeof(T)*8)]] = T;
      
      Which gives equivalent vector type, but unfortunately not equivalent semantics. Clang, depending on the attribute (ext_vector_type or vector_size), accepts different kind of code. In particular following constructions are rejected for gnu::vector_size by both GCC and Clang but accepted for ext_vector_type:
      typedef __attribute__ ((ext_vector_type (4))) int int_vector;
      typedef __attribute__ ((vector_size (16))) float float_vector;
      
      int_vector a;
      float_vector b;
      
      int
      test()
      {
        a=1;
        a=b;
      }
      
      Here I declare one openCL variable a consisting of 4 integers and one GNU vector extension variable b consisting of 4 floats (4*4=16 is the size of vector type in bytes). The code has semantics of writing integer vector [1,1,1,1] to a and then moving it into float vector without any conversion (thus getting funny floating point values).

      Replacing openCL vector by GNU vector makes both compilers reject both statements.  But one can fix it as:
      typedef __attribute__ ((vector_size (16))) int int_vector;
      typedef __attribute__ ((vector_size (16))) float float_vector;
      
      int_vector a;
      float_vector b;
      
      int
      test()
      {
        a=(int_vector){} + 1;
        a=(int_vector)b;
      }
      
      Construct (int_vector){} + 1 was recommended to me by Jakub Jelínek and the first part builds vector zero and then adds 1 which is an vector-scalar addition that is accepted by GNU vector extensions.

      The explicit casts are required intentionally because semantics contradicts the usual C meaning (removing vector attribute, integer to float conversion would happen) and that is why users are required to write it by hand. It is funny that Clang actually also requires the cast when both vectors are OpenCL or both are GNU. It only accepts a=b if one vector is GNU and other OpenCL. Skia uses these casts often on places it calls ?mmintrin.h functions.

      I thus ended up with longish patch adding all the conversions. I had to also work around non-existence of __builtin_convertvector which would be reasonable builtin to have (and there is PR85052 for that). I ended up with code that compiles and renders some stuff correctly but incorrectly other.

      I therefore decided to discuss this with upstream maintainers first and made only truly non-upstreamable hack. I replaced the four source files SkOpts_avx.cppSkOpts_hsw.cpp, SkOpts_sse41.cppSkOpts_sse42.cppSkOpts_ssse3.cpp. This, of course, solved the two regressions but there is work ahead.

      I also filled PR88602 to track the lack of ext_vector_size in GCC, but I am not quite sure whether it is desirable to have. I hope there is some kind of specification and/or design rationale somewhere.

      2d rendering performance (improving rasterflood_svg)

      The performance seems is caused by fact that GCC honors always_inline across translation units. I ended up disabling always_inline in mfbt/Attributes.h which solved most the regression. I will identify the wrong use and submit patch later.

      The rest of performance difference turned out to be CPU tuning compiling:
      #include <emmintrin.h>
      __m128i test (short a,short b,short c,short d,short e,short f,short g)
      {
        return _mm_set_epi16(a,b,c,d,e,f,g,255);
      }
      
      
      GCC 8 with generic compiles this into rather long sequence of integer operations, store and vector loads, while Clang uses integer to SSE stores. This is because GCC 8 still optimizes for Bulldozer in its generic tuning model and integer to vector stores are expensive there. I did pretty detailed re-tuning of generic setting for GCC 8 and I remember that I decided to keep this flag as it was not hurting much new CPUs and was important for Bulldozer, but I have missed effect to hand vectorized code. I will change the tuning flag for GCC 9.

      I added -mtune-ctrl=inter_unit_moves_to_vec to compilation flags to change the bits. Clearly the benchmarking servers are not using Bulldozer where indeed GCC version runs about 6% faster.

      Tweaking train run

      While looking into the performance problems of svg rendering I noticed that the code is optimized for size because it was never executed during the train run. Because Clang does not optimize for size cold regions, this problem is not very visible for Clang benchmarks, but I have displaylist_mutate.html, rasterflood_svg.html and hixie-007.html into the train run. These are same as used by Talos, just modified to run longer and out of talos scripting machinery. I thus did not fill bug report for this yet.

      I have tested and it seems to have only in-noise effect on Clang, but same issues was hit by hears ago with MSVC  as reported in bug 977658.

      It seems there are some more instances where train run can be improved and probably those tests could be combined into one to not increase build times too much. A candidates are those two regressions in perf micro benchmarks which I have verified to indeed execute cold code.

      I have also noticed that training of hardware specific loops is done only on the architecture which is used by build server. This is now bug 1516809.

      -fdata-sections -ffunction-sections

      These two flags puts every function and variable into separate linker section. This lets linker to manipulate with them independently which is used to remove unreachable code and also for identical code folding in Gold.They however also have overhead by adding extra alignment padding and preventing assembler from using short form of branch instructions.

      For GCC without LTO the linker optimization saves about 5MB of binary size for GCC build and 8MB for Clang. With LTO it however does not make much sense because compiler can do these transforms itself. While GCC may not merge all functions with identical assembly linker can detect and vice versa, it is a win to disable those flags, by about 1MB and better link-times. This is now bug 1516842.

      -fno-semantic-interposition

      Clang ignores the fact that in ELF libraries symbols can be interposed by different implementation. When comparing perofmrance of PIC code between compiler, it is always good to use -fno-semantic-interposition in GCC to get apple-to-apple comparison. Effect on Firefox is not great (about 100Kb of binarry size differnece) because it declares many symbols as hidden, but it prevents more perofrmance surprises.

      Implementation monoculture

      I consider Firefox an amazing test-case for link-time optimization development because::
      1. it has huge and dynamically changing code-base where large part is in modern C++,
      2. it encompasses different projects with divergent coding styles,
      3. it has a lot of low level code and optimization hacks which use random extensions,
      4. it has decent benchmarks where some of them has been hand optimized for a while,
      5. it can be benchmarked from command line with reasonable confidence. 
      This makes it possible to do tests that can not be done by running usual SPEC benchmarks which are order of magnitude smaller, sort of standard compliant, and written many years ago. Real world testing is essential both to make individual features (such as LTO or PGO) production ready and to fine tune configuration of the toolchain for practical use.

      For this reason I am bit sad about the switch to single compiler. It is not clear to me whether Firefox will stay portable next year. Nathan wrote interesting blog "when an implementation monoculture might be the right thing". I understand his points, but on the other hand it seems that no super-human efforts were needed to get decent performance out of GCC builds. From my personal experience maitaining a portable codebase may not be always fun but pays back in long term.

      Note that Chromium had switched to Clang only builds compiler some time ago it still builds with GCC (GCC is used to build Chromium for Tumbleweed RPM package, for example), so there is some hope that community will maintain compatibility, but I would not bet my shoes on it.

      It would make sense to me to add both GCC and Clang builds into Mozilla nightly testing. This should:
      1. Uncover portability issues and other bugs due to different warnings and C++ implementations in both compilers.
      2. Make the code-base easier to maintain and port in long term
      3. Keep GCC and LLVM toolchains developers interested in optimizing Firefox codebase by providing more real-world benchmarks than SPEC, Polyhedron and Phoronix benchmarks (which are way too small for LTO and often departed from reality).
      4. Benefit from fact that toolchain bugs will be identified and fixed before new releases
      At least in my short experiment I was easily able to identify problems in all three project (and fix some of them).

      Communication with maintainers of Firefox packages in individual Linux distros

      It seems to me that it would be good to communicate better performance settings with authors of packages used by individual distros. I think most of us do not download Firefox by hand and simply use one provided by the particular Linux distribution. Seeing communication I had with Martins concerning SUSE and RedHat's packages, it seems very hard for packagers to reproduce Firefox PGO build setup which is critical to get good binary. One of things that would improve situation is to make build system fail if train run fails, which I filled as bug 1516872.

      It would be nice to set up some page which lists things that are important for a quality build and which provides links to benchmarks checking that the distribution provided Firefox is comparable to official one.

      It may sound funny, but even though I look at Firefox performance since 2010 until this December it did not cross my mind that I can actually download official package and benchmark against it. It is not usual that reproducing quality build would need such effort, but it is a consequence of complexity of the code-base.

      Future plans

      I already did some useful work on GCC 9 during last two weeks:
      1. Fix for the devirtualization bug 
      2. Reduced compile time memory use when LTO linking large programs with profile feedback 
      3. Reduced memory use when linking large LTO programs with profile instrumentation, second part
      4. Re-enabled speculative devirtualization that was accidentally disabled for LTO
      5. Fixed some cases where GCC has thrown away profile feedback for no good reason
      6. Got function summaries more precise after meriging profiles of C++ Comdat functions
      7. Fixed buffer overflow on mismatched profiles
      8. ... and more I am testing.
      I plan to continue working on this by updating my setup to GCC 9 and making sure it will do well on Firefox once released. I will also look deeper into performance with -O2 and the remaining displaylist_mutate regression and try to find time to write another update.

      Saturday, December 15, 2018

      Firefox 64 built with GCC and Clang

      One of my tasks (as GCC maintainer of the inter-procedural optimization framework) is to make sure that it works well with real world programs as opposed to smaller scale benchmarks like SPEC. I am slowly getting ready for my annual attempt to verify that GCC 9 is doing well optimizing large applications, but because recent switch of official and Fedora builds from GCC to Clang (Fesco ticket), I decided to take a quick look on how GCC 8 built binaries compares to the official release builds for Linux.

      Update: see also followup post

      Clang built Firefox is claimed to outperform GCC, but it is hard to get actual numbers. Firefox builds switched from GCC 6 builds (GCC 6 was released in 2016) with profile guided optimization (PGO) to Clang 7 builds (latest release) which in addition enable link time optimization (LTO). Link-time optimization can have important performance and code size impact.

      Martin Stránský (RedHat maintainer of the Firefox package) compared GCC 8 built binary with Clang 7, and his setup was not completely comparable either because of the stack protector settings and other security centric system-wide defaults of Fedora. Moreover GCC -O2 defaults are (in my opinion unfortunately) still not enabling vectorization and unrolling which may have noticeable effects on benchmarks. This is a historical decision bit motivated by pressure Clang gave on us with compile times. This has changed. I am in process of evaluating what can be done with this default.

      For my testing. I built LTO and PGO enabled Firefox using GCC 8 (sadly no official release since I had to fix one bug) and Clang 7 and compared them to two official Firefox binaries:
      1. last release built with GCC: Firefox 63, and 
      2. first release build with Clang: Firefox 64.
      Any comments and corrections to my methods are welcome.

      Summary

      To summarize my findings, I found that watchdog in Firefox kills the training run before it had time to stream profile data to disk. This bug in Firefox build system has bad effect on performance, because compiler has only partial profile data. Fixing this issue leads to faster binaries. GCC is a lot more careful about binary size (Clang builds have 48% bigger code section!) but for tasks which are covered by train run GCC 8 LTO+PGO binary performs well (and win in benchmarks I tried). It is probably possible to find benchmarks that test things not covered by train run where GCC code will likely perform slower than Clang because it is a lot smaller. However benchmarks I found discussed in the Fedora ticket above all like GCC. Clang currently builds about 10% faster, I hope to reverse the sides with GCC 9 :)

      Comparing compiler performance on such a complex project as Firefox is delicate job. I plan to get Firefox's benchmark infrastructure (Talos) running again and do more detailed analysis which also compare GCC 9.

      You can compare this with my earlier tests of Firefox

      Try it

      I have uploaded a binary build with GCC 8, with link-time optimization and profile feedback. If your curiosity exceeds the fear of running random binaries from the net, you are welcome to try it out. It is built from Firefox 64 release. You can compare it to the official build and build provided by your favourite distro. (Is there an stable link to the Firefox 64 official Linux binary?)

      I am just typing my post using it and it seems to work on both Debian and Tumbleweed distros.

      Profile data collection problem

      PGO in Firefox build system is automated. If enabled in mozconfig, the build first produces a binary with profile instrumentation, then starts a local webserver and trains the application on few things (there is Speedometer, SunSpider and few other things I did not immediately recognize) and proceeds by building the final binary next.

      One problem is that if train run fails, the build system will not inform you. For example, if you do not have X server available, your binary will be effective on printing error message but nothing else.  Sometimes the train run crashes and then the code quality goes completely off. I learnt the habit of using Xvnc and observing the train run remotely to check that it does what it should. I am surprised that Firefox developers did not added a check that each of the tasks has finished successfully. It would be useful as a regression suite, too.

      Issue I run into this time is bit subtle. It is not visible during testing but it can be seen as following message in the build log:
      MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) at /aux/hubicka/firefox-2018/release/toolkit/components/terminator/nsTerminator.cpp:219
      This message (hidden between 57958 others) basically means that the worker threads was killed during exit and thus the profile data collected from the actual training benchmark was never saved to disk (and thus invisible to the compiler). It turns out that this is a timeout in Firefox internal watchdog that kills individual subprocesses if they seems to have hanged up during exit. GCC profiling runtime streams all data from the global destructor in libgcov library and for Firefox it may take some time. GCC profile data is organized into multiple files (one for every object file) while Clang's is one big file which is later handled by specialized tool. I suppose one large file may be faster to write than 8k smaller files.

      I use this patch to increase the timeout for training runs.

      Update: Thanks to Nathan Froyd I have set up myself as Firefox developer and tried to produce cleaner patch for review  https://bugzilla.mozilla.org/show_bug.cgi?id=1516081

      GCC LTO bug

      Fixing the PGO collection problem finally got me optimized binary, this time it however did not start. Problem is caused by a long standing bug in GCC command line option handling where command line options was incorrectly merged for the function wrapping global constructor. This led to enabling AVX and since the global constructor now gets some code auto-vectorized the binary crashed on invalid instruction during the build (my testing machine has no AVX).

      Update: As pointed out at ycombinator, the invalid instructions was actually AVX2, Bulldozer supports AVX.

      I have now fixed this for GCC 8 and mainline and plan to backport it to GCC 7. So to reproduce my builds, you either new recent GCC 8 snapshot or you can work-around by disabling the cdtor merging pass by using -fdisable-ipa-cdtor.

      This optimization combines static constructors and destructors from individual translation units into single function. It was in fact also motivated by Firefox which used to have many constructors (remember, each time you include iostreams you implicitly get one) and running them caused important lag during startup accessing many parts of the code segments. It seems that this problem was fixed by hand over the time and thus this optimization is not very important for Firefox anymore.

      This bug affect link-time optimized builds only where correct behaviour with respect to the optimization options passed at compile-time is quite challenging. This bug needs several factors to trigger - one needs to have at least one file build with AVX codegen enabled, object files needs to be passed in right order, there needs to be global constructor which is autovectorizable and one needs to execute final binary on CPU without AVX support. I suspect that this also may be origin of the problem with Firefox crashing at startup seen by RedHat guys recently.

      LTO module is intrusive change to the whole toolchain and unfortunately because the LTO adoption is still pretty low surprises happens. At SUSE we now work actively on enabling link-time by default after switching to GCC 9. Hopefully this will hammer out similar issues. At the moment only about 500 out of 11k packages fails to build with link-time optimization some for a good reasons. We will concentrate on fixing the issues prior GCC 9 release.

      File size


      While I did not manage to 100% match the Firefox official builds, it seems that my Clang 7 build is close enough to make comparision meaningful.

      48% code segment size increase for switching compilers is little bit surprising. I think there are two factors affecting this.
      1. GCC is more aggressive to optimize for size regions that was not trained. 
      2. The traditional LTO where whole program is loaded back to compiler which runs it through the back-end as if it was all one compilation unit is too slow in practice. Both GCC and Clang use to more scaleable model (you can get traditional LTO with -flto-partition=none for GCC and -flto=full for Clang).

        The faster LTO modes necessarily trades some code quality for performance. This is where both compilers differs. GCC's LTO was designed to run whole inter-procedural optimization queue using summaries and later dispatch local optimization into multiple worker processes, while Clang's thin LTO is built around assumption that GCC's approach will not scale enough and works differently. Thin linker does just part of inter-procedural optimization decisions and rest of translation is per-file based..

        Time will tell which of the approaches will scale better. I find it interesting challenge to get GCC build times on par or better than Clangs even for project of size of Firefox. It would be always possible to combine both approaches if linking bigger applications than Firefox becomes important. So far I do not have a testcase to play with.
      Other interesting observation is that my LTO build is about the same size as official build of Firefox 63. Usually LTO builds are noticeably smaller. It may be because official build was suffering from the same loss of profile data as observed on my builds. I plan to build GCC 6 binary and look into it later.

      Understanding performance of builds with PGO enabled

      The train run of Firefox covers about 15% of the whole binary. Bear in mind that benchmarking code that was not at all trained during the build will make you to measure performance of code optimized for size (and it won't be very good). This can be handled by improving the train run coverage in Firefox or disabling PGO for those modules where it can not be done (such as hardware specific video decoders).

      This makes direct GCC to Clang comparison bit harder, because Clang seems to not optimize for size cold regions or do that a lot less aggressively.

      I believe GCC's default is correct one because size of binaries matters in practice, but it may not be best one in all scenarios. If it would seem useful, I can provide command line option for GCC that will disable aggressive code size optimizations for cold parts of the program.

      GCC and Clang also differs in a way they interpret -Os. For GCC it means "do everything possible to get smaller binary", while for Clang it seems to be more "disable some of the most code size expensive optimizations". Clang provides -Oz that is closer to what GCC's -Os have. For a while I was thinking that having such less aggressive size optimization in GCC would be also useful especially in scenarios where GCC auto-guesses cold regions and there is chance that it was wrong. This is not hard to implement and may be something to do next year.

      Benchmarks

      I started with SunSpider and Speedometer benchmarks which I have noticed are part of the default train run. This makes most apple-to-apple comparison of abilities of code generators without being affected by choice of code quality for cold regions of the binary. Of course in practice binaries are never perfectly trained and thus I also include other benchmarks if you scroll down.

      Sometimes more is better and sometimes it is the opposite. I always ordered data from best run to worst for easy orientation.



      Sunspider is now somewhat historical javascript benchmark that seems to be superseded by JetStream.

      Update: I got  some feedback that old server CPU may not be most representative for testing desktop application. So I will try to re-run some of benchmarks on my skylake notebook to also verify that they reproduce.  I will add them in red. I do not plan to re-run everything. I am not completely happy about the reproducibility of sunspider here, but I have disabled powersave, killed my .mozilla directory and switched to ice-wm. I skipped my clang build but added default Tumbleweed firefox for extra fun.



      Speedometer is closer to the noise factor, but shows difference at least between GCC 8 binary and Firefox 63. It measures responsibility of the browser.
      Update: local run seems to have less noise.



      Dromaeo DOM is the first benchmark which is not part of the train run. It tests DOM and CSS queries. I show results of run using http://dromaeo.com/?dom|jslib|cssquery. This is subset of the full suite that is not very centric to the javascript JIT performance and I have earlier observed it is more sensitive to compiler generated code. I have run the default set of benchmarks earlier also observing a difference which was about 1.5% (comparing GCC 8 build with Firefox 64 official binary). They runs for a while, I will find time to re-run them later. From perf profiles I know that the JavaScript benchmark tests a lot of JIT generated code, some of JIT performance itself and simple C routines, such as UTF conversions.

      You can check detailed comparison of individual runs. (order is the same as I run it: GCC 8, Firefox 63, Clang 7, Firefox 64). I am bit surprised by difference between Clang and Firefox 64 release.



      MotionMark is fun to watch benchmark testing rendering speed. Eventually I got bored however and produced fan-art.
      Hope they will stay friends :)
      Results are very close to noise factor and may be affected by fact that I run the tests in Xvnc. It is bit strange that my Clang build differs from official one. I will try to find time (and suitable machine) to run this test locally and see if the results are more trustworthy. Similar issues are seen with BaseMark that I eventually gave up on.

      Update: I have re-run this on my skylake notebook.



      JetStream is testing performance of most advanced web applications. Seems you are better to build with old GCC in this case!

      ARES-6 tests ECMAScript 6 applications. I do not know what it is, but I am sure it is important.

      Runtime memory use


      This I measured by letting the Firefox to start, observing resources in top and waiting for few seconds for number to settle down. I was hoping for more interesting numbers here because GCC with LTO has a code section reordering algorithms which was developed by Martin Liška and was motivated by Firefox. This is done by measuring the average time of first execution of every function and then ordering code segment in a way that during the startup just tiny portion of it is touched and execution is done in the increasing order.
      It does not show much and the resource usage seems to be fully justified by the code segment size. At some point Firefox started to mmap whole binary to prevent page demand loading from seeking too much. This seems to be still the case today. I wonder how that works on SSD disks?

      Build time and built-time memory usage

      This is memory and CPU use graph I collected from builds.
      Memory and CPU use during GCC 8 build
      Memory and CPU use duuring LLVM build
      Clang build time is about 9% better (91 minutes compared to 100). I will give GCC 9 a try because I spent good part of this year speeding LTO up. One aspect where Clang wins hands down is memory use during build. This is partly caused by the technical differences between LTO implementations discussed earlier. One other aspect it GCC's use of garbage collector. Fortunately the peak of memory use is during the link-time when parallel builds are performed. This issue is solved in GCC9 which should fit here under 10GB. I will write on that later.

      Note that reducing parallelism from 16 down will get you smaller peak memory use so Firefox should build with GCC on boxes with 10GB.

      Update: Building Firefox with current snapshot of GCC 9 takes 93minutes, so 7% improvement and only 2% slower than my Clang build. Memory use is down to 15GB peak (still twice of Clang's) but more importantly the link-time optimization part should fit in 8GB box (GCC has garbage collector so if you have less memory than 64GB I use for testing, it will trade some memory for compile time).
      Memory use of GCC 9 snapshot (Dec 16th 2018)


      Update: According to reddit post my Clang build procedure could be improved, because the default training run is very small. Instructions are here and here. I will give it a try.

      Details of my setup

      For my tests I use AMD Bulldozer based machine (AMD Opteron 6272) with 8 cores and 16 threads running Debian 9.6. My other machine is ThinkPad X260 notebook with Intel(R) Core(TM) i7-6600U CPU.

      I built GCC 8 from current SVN trunk configured with
      ../configure --with-build-config=bootstrap-lto --disable-multilib  
      --disable-werror --disable-plugin
      and build with
      make profiledbootstrap
       I also tested GCC trunk (which is in early stage3 heading to GCC 9 release) with same configuration.

      Since LLVM webpage no longer has official binaries for debian I downloaded llvm 7 release and built it myself following the bootstrap and LTO+PGO procedure. For that one needs to first build LLVM+clang+lld+runtime with GCC, then build version collecting profile data and finally build LTO optimized binary with profile feedback. To gather profile data I used
      /usr/bin/cmake  -C ../llvm-7.0.0.src/tools/clang/cmake/caches/PGO.cmake 
      ../llvm-7.0.0.src  -DCMAKE_BUILD_TYPE=Release 
      -DCMAKE_INSTALL_PREFIX=/aux/hubicka/llvm7-install-fdo -DLLVM_TARGETS_
      TO_BUILD=X86 -DLLVM_BINUTILS_INCDIR=/aux/hubicka/binutils-install/include/  -G Ninja
      
      And to obtain final build I use:
      /usr/bin/cmake  -DCMAKE_C_COMPILER=/aux/hubicka/llvm7-install-fdo/bin/clang 
      -DCMAKE_CXX_COMPILER=/aux/hubicka/llvm7-install-fdo/bin/clang++ 
      -DLLVM_PROFDATA_FILE=/aux/hubicka/./build2/tool
      s/clang/stage2-instrumented-bins/tools/clang/utils/perf-training/clang.profdata 
      ../llvm-7.0.0.src  -DCMAKE_BUILD_TYPE=Release 
      -DCMAKE_INSTALL_PREFIX=/aux/hubicka/llvm7-install-fdo -DLLVM_
      TARGETS_TO_BUILD=X86 -DLLVM_BINUTILS_INCDIR=/aux/hubicka/binutils-install/include/  
      -G Ninja  -DLLVM_ENABLE_LTO=Thin -DCMAKE_RANLIB=/aux/hubicka/llvm7-install-fdo/bin/llvm-ranlib  
      -DCMAKE_AR=/aux/hubicka/llvm7-install-fdo/bin/llvm-ar 
      
      It is first time I attempted to build PGO optimized clang so I hope I did that correctly.

      Update: According to reddit post my Clang build procedure could be improved, because the default training run is very small. Instructions are here and here. I will give it a try.

      Finally I use pretty basic mozconfig:
      mk_add_options MOZ_MAKE_FLAGS="-j16"
      CC=/aux/hubicka/trunk-install/bin/gcc
      CXX=/aux/hubicka/trunk-install/bin/g++
      export PATH=/aux/hubicka/trunk-install/bin/:$PATH
      MYFLAGS="-O3"
      mk_add_options OS_CFLAGS="$MYFLAGS"
      mk_add_options OS_CXXFLAGS="$MYFLAGS"
      mk_add_options OS_LDFLAGS="$MYFLAGS"
      ac_add_options --enable-optimize=-O3
        
      ac_add_options --enable-application=browser
      ac_add_options --enable-debug-symbols
        
      ac_add_options --disable-valgrind
      ac_add_options --enable-lto
      ac_add_options --enable-tests
      ac_add_options MOZ_PGO=1
        
      export moz_telemetry_reporting=1
      export mozilla_official=1
        
      ac_add_options --enable-linker=gold
        
      export CFLAGS="$MYFLAGS"
      export CXXFLAGS="$MYFLAGS -fpermissive"
      export LDFLAGS="$MYFLAGS"
      mk_add_options MOZ_OBJDIR=<mydir>

      I thus use gold for both GCC and Clang build. For clang I additionally need
      ulimit -n 10240
      because it runs out of file descriptors during linking. This does not happen with lld, but then elfhack fails.

      I have set power saving to performance for my testing.