Sunday, December 30, 2018

Even more fun with building and benchmarking Firefox with GCC and Clang

Recent switch of official Firefox builds to Clang on all platforms has triggered some questions about the quality of code produced by GCC compared to Clang. I use Firefox as an ocassional testcase for GCC optimization and thus I am somewhat familiar with its build procedure and benchmarking. In my previous post I ran benchmarks of Firefox64 comparing GCC 8 built binary to official builds using Clang which turned out to look quite good for GCC.

My post triggered a lot of useful discussion and thus I can write an update. Thanks to Nathan Froyd I obtained level 1 access to Mozilla repository and Mozilla's try server which allows me to do tests on the official setup. Mozilla's infrastructure features some very cool benchmarking tools that reliably separates useful data from the noise and run more tests than I did last time

I am also grateful to Jakub Jelínek, who spent a lot of time bisecting a nasty misoptimization of spell checker initialization. I also did more joint work with Jakub, Martin Liška and Martin Stránský on enabling LTO and PGO on OpenSUSE and Fedora packages. (Based on LTO enabled git branch Martin Liška maintained for a while.)

      Treeherder

      Treeherder, the build system used by Mozilla developers seems surprisingly flexible and kind of fun to use. I had impression that it is major pain to change its configuration and update to newer toolchain. In fact it is as easy as one can hope for. I was surprised that with minimal rights I can do it myself. With an outline from Nathan I have decided to do the following:
      1. revert builds back to GCC 6,
      2. update build system to GCC 8, 
      3. enable link-time-optimization (LTO) and compare performance to official builds 
      to see if there are code quality issues or bugs I can fix.

      It took three days to get configuration updated and at Christmas eve I got first working LTO build with GCC 8. The build metrics looked great, but indeed was number of performance regressions (unlike in tests I did two weeks ago where basically all benchmarks looked good).

      I had to see a lot of cat dancing animations, sacrifice few cows to Sr. Ignucius and replace 5 object files by precompiled Clang assembler (these contains hand optimized AVX vectorized rendering routines written in the Clang only extensions of GNU vector extensions which I explain later). 4 days later I resolved most of performance problems (and enjoyed the festivities as well).

      In this post I discuss what I learnt. I focus on builds with link-time-optimization (LTO) and profile-guided optimization (PGO) only because it is what official builds use now and because it makes it easier to get apple-to-apple comparisons. I plan to also write about performance with -O2 and -O2 -flto. Here the fact that both compilers differs in interpretation of optimizations levels shows up.

      Try server benchmarks: Talos comparison of GCC 8 and Clang 6 binaries build with link-time-optimization (LTO) and profile guided optimization (PGO)
      This dancing cat is overseeing Firefox's benchmarks.
      If you do some Firefox performance work you are better to be a cat person.
      .

      Treeherder has so far the best benchmarking infrastructure I worked with. Being able to run benchmarks in controlled and reproducible environment is very important and I really like to ability to click on individual benchmarks and see history and noise of Firefox mainline. It is also cute that one can mark individual regressions, link them with bugzilla and there are performance serifs doing so.

      GCC benchmarking has been for years ruled by simple scripts and GNU plot. Martin Liška recently switched to LNT which has a lot of nice environments but I guess LNT can borrow some ideas :)
      This is a screenshot of the dancing cat page comparing binary sizes of GCC 8 LTO+PGO binary to Clang 6 LTO+PGO and other build metrics. Real version may take a while to load and will consider all changes as insignificant because it does not have data from multiple builds. Screenshot actually compares my build to trunk of 28th December 2018 which is not far from my branchpoint.
      The "section sizes" values are reduction of binaries. Here largest is libxul.so that goes down from 111MB to 82, 26% reduction. Installer size reduces by 9.5%. What is not reported but is important is that code segment reduces by 33%. Build time difference are within noise.

      Number of warnings is red, but I guess more warnings are good it comes to compiler comparison.
      Screenshot of dancing cat comparsion of my GCC 8 LTO+PGO build with Clang 6 LTO+PGO to the official build from the point I branched. Dancing cat will give you a lot of extra information: you can look at sub-tests and it shows tests where changes are not considered important or within noise. You can also click to graph and see progress over time.

      The following benchmarks sees important improvements:
      1. tp5o responsiveness (11.45%) tracks reponsiveness of the browser on the tp5o page set. This is 51 most popular webpages from 2011 accroding to Alexa with 3 noisy ones removed. List of the webpages is seen in the subtests of tp5o benchmark

        There is interesting discussion about it in bugzila. This is complex test and I would like to believe that the win is caused by the careful choice of code size wrt performance, but I have no real proof for that.
      2. tps (5.09%) is a tab switching benchmark on the tp5o pageset.
      3. dromaeo_dom (4.74%) is a synthetic benchmark testing manipulations with DOM trees. It consists of 4 subtests and thus the profile is quite flat. See subtests. Run it from official dromaeo page.
      4. sessionrestore (3.57%), as name suggests, measures time to restore a session. Again it looks like quite interesting benchmark training quite large part of the Firefox.
      5. sessionrestore_no_auto_restore (3.05%) seems similar to previous benchmark. 
      6. dromaeo_css (2.99%) is synthetic benchmark testing CSS. It consists of 6 subtests and thus the profile is quite flat. See subtests. Run it from official dromaeo page
      7. tp5o (2.43%) is benchmark loading tp5o webpages. This is really nice overall performance test i think. See subtests which also lists the page. The improvements are quite uniform across the spectra.

      The following 3 benchmarks sees important regressions:
      1. displaylist_mutate (6.5%) is measuring time taking to render page after changing display list. It looks like a good benchmark because its profile is very flat which also made it hard for me to quickly pinpoint the problem. One thing I noticed is that GCC build binary has some processes showing up in profile that does not show in clang's so it may be some kind of configuration difference.
        You can run it yourself.
      2. cpstartup (2.87%) is testing time opening new tab (which starts component process I think) and getting ready to load page. This looks like an interesting benchmark but since it is just 2.96% I did not run it locally. It may be just the fact that train run does not really open/close many tabs and thus a lot of code is considered cold
      3. rasterflood_svg (2.7%) is testing speed of rendering square patterns. It spends significant time in hand optimized vector rendering loops. I analysed the benchmarks and reduced regressions as described below since profile is simle. I did not look at the remaining 2% difference. Run it yourself.

      There are some additional changes considered unimportant but off-noise:


      Improvements: 
      1. tpaint (4.77%)
      2. tp5o_webext (2.59%) 
      3. tp6_facebook (2.53%)
      4. tabpaint (2.43%)
      5. about_preferences_basic (2.23%)
      6. tp6_google (1.66%)
      7. a11r (1.66%)
      Regressions:
      1. tsvgx (1.28%)
      2. tart (0.67%)

      All these tests are described in Talos webpage.

      Out of 40 performance tests, had 20 off-noise changes and except 5 in favour of GCC.
      You can compare it with report on the benefits for switch from GCC 6 PGO to Clang 6 LTO+PGO in this bug 1481721. Note that speedometer is no longer run as part of the Talos benchmarks. I ran it locally and improvement over Clang was 5.3%. Clearly for both compilers LTO does have important effect on the performance.

      It is interesting to see that Firefox official tests mix rather simple micro-benchmarks with more complex tests. This makes it bit more difficult to actually understand the overall performance metrics.

      Overall I see nothing fundamentally inferior in GCC's code generation and link-time optimization capabilities compared to Clang. In fact GCC implemetnation of scalable LTO (originally called WHOPR, see also here) is more aggressive about whole program analysis (that is, it does almost all inter-procedural decision on whole program scope) than Clang's ThinLTO (which by design makes as much as possible on translation unit scope where translation workers may pick some code from other translation units as instructed by this linker). ThinLTO design is inspired by the fact that almost all code quality benefits from LTO in today compilers originate from unreachable code removal and inlining. On other other hand, optimizing at whole program scope makes it possible to better balance code size and performance and implement more transforms. I have spent a lot of time on optimizing compiler to get WHOPR scalable (which, of course, helped to cleanup the middle-end in general). I am happy that so far the build times with GCC looks very competitive and we have more room for experimenting with advanced LTO transformations.

      Performance regressions turned out to be mostly compiler tuning issues that are easy to solve. Important exception is the problem with Clang only extensions which affects rasterflood_gradiend and some tsvg subtest explained in section about Skia. Making Skia vector code GCC compatible should not be terribly hard as described later.

      Update: I gave second chance to displaylist_mutate and found it is actually missed inline. GCC inliner is bit tuned down for Firefox and and can trade some more size for speed. Using --param inline-unit-growth=40 --param ealry-inlining-insns=20 fixes the regression and brings some really good improvements over the spectra. While binary is still 22% smaller than Clang build.  If I increase limits even more, I get even more improvements. I will now celebrate end of year and once next year I will analyse this and writemore.

      I am in process of fine-tuning inlined for GCC 9 so I will take Firefox as additional testcase.

      Getting GCC 8 LTO+PGO builds to work.

      Following Nathan's outline it was actually easy to update configuration to fetch GCC8 and build it instead of GCC6.

      I enabled LTO same way as for Clang build and added:
      export AR="$topsrcdir/gcc/bin/gcc-ar"
      export NM="$topsrcdir/gcc/bin/gcc-nm"
      export RANLIB="$topsrcdir/gcc/bin/gcc-ranlib"
      to build configuration in build/unix/mozconfig.unix. This is needed to get LTO static libraries working correcly. Firefox already has same defines for llvm-ar, llvm-nm and llvm-ranlib. Without this change one gets undefined symbols at link-time.

      I added patch to disable watchdog to get profile data collected correctly. This is problem I noticed previously and is now bug 1516081 which is my first experiment with Firefox patch submission procedure (Phabricator) which I found particularly entertaining by requiring me to sacrifice few games from my phone in order to install some autentificating app.

      Next problem to solve was undefined symbol in sandbox. This is fixed by the following patch taken from Martin Liška's Firefox RPM:
      diff --git a/security/sandbox/linux/moz.build b/security/sandbox/linux/moz.build
      --- a/security/sandbox/linux/moz.build
      +++ b/security/sandbox/linux/moz.build
      @@ -99,9 +99,8 @@ if CONFIG['CC_TYPE'] in ('clang', 'gcc')
       # gcc lto likes to put the top level asm in syscall.cc in a different partition
       # from the function using it which breaks the build.  Work around that by
       # forcing there to be only one partition.
      -for f in CONFIG['OS_CXXFLAGS']:
      -    if f.startswith('-flto') and CONFIG['CC_TYPE'] != 'clang':
      -        LDFLAGS += ['--param lto-partitions=1']
      +if CONFIG['CC_TYPE'] != 'clang':
      +    LDFLAGS += ['--param', 'lto-partitions=1']

       DEFINES['NS_NO_XPCOM'] = True
       DisableStlWrapping()
      The code to add necessary --param lto-partitions=1 already exists, but somehow it is not enabled correctly. I guess it was not updated for new --enable-lto. The problem here is that sandbox contains toplevel asm statement defining symbols. This is not supported for LTO (because there is no way to tell compiler that the symbol exists) and it is recommended to simply disable LTO in such cases. This is now bug 1516803.

      Silencing new warnings

      Official build uses -Werror so compilation fails when warnings are produced.I had to disable some warnings few where GCC complains and Clang is happy:

      I ended up disabling:
      1. -Wodr. This is C++ One Definition Rule violation detector I wrote 5 years ago. It reports real bugs even though some of them may be innocent in practice.

        In short C++ Definition Rule (ODR) says that you should not have more than one definition of same name. This is very hard to keep in program of size of Firefox unless you are very consistent with namespaces. ODR violation can leads to surprises where, for example, virtual method ends up being dispatched to virtual method of completely different class which happens to clash in name mangling. This is particularly dangerous when, as Firefox does, you link multiple versions of same library into one binary.

        These warnings are detected only with LTO. I started to look into fixes and found that GCC 8 is bit too verbose. For example it outputs ODR violation on the class itself and then on every single method the class has (because its this pointer parameter is mismatched). I silenced some of wranings for GCC 9. GCC 9 now finds 23 violations which are reported as bug 1516758. GCC 8 reported 97.
      2. -Wlto-type-mismatch. This is warning about mismatched declarations across compilation units such as when you declare variable int in one unit but unsigned int in another. Those are real bugs and should be fixed. Again I reduced verbosity of this warning for GCC 9 so things are easier to analyse. Reported as bug 1516793.
      3. -Walloc-size-larger-than=. This produces warnings when you try to allocate very large arrays. In case of Firefox the size is pretty gigantic.

        GCC produces 20 warnings on Firefox and they do not seem particularly enlightening here.
        audio_multi_vector_unittest.cc:36:68: warning: argument 1 value ‘18446744073709551615’ exceeds maximum object size 9223372036854775807 [-Walloc-size-larger-than=]

        array_interleaved_ = new int16_t[num_channels_ * array_length()];
        What is says is that the function was inlined and array_length ended up being compile time constant of 18446744073709551615=FFFFFFFFFFFFFFFF. It is a question why such context exists
      4. -Wfree-nonheap-objects. As name suggests this warns when you call free on something which is clearly not on heap, such as automatic variable. It reported 4 places where this happens across quite large inline stacks so I did not debug them yet.
      5. -Wstringop-overflow=. This warns when string operation would overrun its target. This detects 8 places which may or may not be possible buffer overflows. I did not analyse them.
      6. -Wformat-overflow. This warns when i.e. sprintf formatting string can lead to output larger than is the destination buffer. It outputs diagnostics like:
        video_capture_linux.cc:151:21: warning: ‘%d’ directive writing between 1 and 11 bytes into a region of size 10 [-Wformat-overflow=]

        sprintf(device, "/dev/video%d", (int) _deviceId);
        Where someone apparently forgot about the 0 terminating the string. It trigger 15 times.
         Martin Liška and Jakub Jelínek have patches which I hope will get upstream soon.

      Supplying old libstdc++ to pass ABI compatibility test

      After getting first binary, I ran into problem that GCC 8 built binaries require new libstdc++ which was not accepted by ABI checks and if you disable those tests the benchmarking server will not run the binary.

      Normally one can install multiple versions of GCC and use -I and -L to link against older libstdc++. I did not want to spent too much time on figuring out how to get official build system to install two GCC versions at once, I simply made my own tarball of GCC 8.2 where I replaced libstdc++ by one from GCC 6.4.

      Working around the two GCC bugs

      First bug I wrote about previously and is related to the way GCC remembers optimization options from individual compilation commands and combines them together during link-time optimization. There was an omission in the transformation which merges static constructors and destructors together which made it to pick random flags. By bad luck those flags happened to include -mavx2 and thus binaries crashed on Bulldozer machine I use for remote builds (they would probably still work in Talos). It is easy to work this around by adding -fdisable-ipa-cdtor to LDFLAGS.

      Second bug reproduces with GCC 8 and PGO builds only. Here GCC decides to inline into thunk which in turn triggers an omission in the analysis pass of devritualization. Normally C++ methods take pointer to the corresponding objects as this pointer. Thunks are special because they take pointer adjusted by some offset. It needs a lot of bad luck to trigger this and get wrong code and I am very grateful to Jakub Jelínek who spent his afternoon by isolating a testcase.

      I use:
      diff --git a/extensions/spellcheck/src/moz.build b/extensions/spellcheck/src/moz.build
      --- a/extensions/spellcheck/src/moz.build
      +++ b/extensions/spellcheck/src/moz.build
      @@ -28,3 +28,8 @@ EXPORTS.mozilla += [

       if CONFIG['CC_TYPE'] in ('clang', 'gcc'):
           CXXFLAGS += ['-Wno-error=shadow']
      +
      +# spell checker triggers bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88561
      +# in gcc 7 and 8. It will be fixed in GCC 7.5 and 8.3
      +if CONFIG['CC_TYPE'] in ('gcc'):
      +    CXXFLAGS += ['-fno-devirtualize']
      I am not happy GCC bugs was triggered (and both mine) but they are clearly rare enough that they was never reported before.

      Performance analysis

      My first run of benchmarks has shown some notable regressions (note that this is with only one run of the tests, so smaller changes are considered noise).

      Important regressions were:
      1. rasterflood_gradient (75%). This renders some pink-orange blobs. Run it yourself.
      2. rasterflood_svg 13%. This renders rotating squares (with squary hairs).
      3. tsvg_static  18%.This consists of 4 subtests with regressions: rendering transparent copies of Kyoto and rendering rotated copies of Kyoto.
      4. tsvgx 36%. This again consist of 7 subtests. Massive 493% ression is on test drawing blue-orange blobs. Second test renders nothing in my version of Firefox (official tumbleweed RPM) but it renders weird looking green-blue jigsaw pieces for Chrome. Last test which regresses renders green map.

        I have filled bug 1516791 but it may be caused by fact that I need to add some java-script into the test to get it running out of Talos and I am no javascript expert.

        Update: as explained in the bug, it was my mistake. This test needs additional file smallcats.gif and that strange jigsaw puzzle is actually broken image icon. So indeed my mistake.
      5. displaylist_mutate 8%.. This benchmark I did not analyse easily. It consists of sub-tests that all look alike and have flat profile.
      Fortunately all those benchmarks except for diplaylist_mutate are of micro-benchmark nature and it is easy to analyse them. Overall I think it is good sign about GCC 8 built performance out of the box: if you do not need on rendering performance of shader animations, GCC 8 built binary will probably perform well for you.

      Skia (improving rasterflood_gradient and tsvgx)

      Skia is a graphic rendering library which is shared by Firefox and Chrome. It is responsible for performance in two benchmarks: rasterflood_gradient and the massively regressing tsvgx subtest. I like pink-orrange blobs better and thus looked into rasterflood_gradient. Official build profile was:
      Samples: 155K of event 'cycles:uppp', Event count (approx.): 98151072755
      Overhead  Command          Shared Object                  Symbol
        13.32%  PaintThread      libxul.so                      [.] hsw::lowp::gradient
         7.88%  PaintThread      libxul.so                      [.] S32A_Blend_BlitRow32_SSE2
         5.20%  PaintThread      libxul.so                      [.] hsw::lowp::xy_to_radius
         4.14%  PaintThread      libxul.so                      [.] hsw::lowp::matrix_scale_translate
         3.97%  PaintThread      libxul.so                      [.] hsw::lowp::store_bgra
         3.77%  PaintThread      libxul.so                      [.] hsw::lowp::seed_shader
      while GCC profile was:
      Samples: 151K of event 'cycles:uppp', Event count (approx.): 101825662252
      Overhead  Command          Shared Object               Symbol
        17.64%  PaintThread      libxul.so                   [.] hsw::lowp::gradient
         6.51%  PaintThread      libxul.so                   [.] hsw::lowp::store_bgra
         6.36%  PaintThread      libxul.so                   [.] hsw::lowp::xy_to_radius
         5.40%  PaintThread      libxul.so                   [.] S32A_Blend_BlitRow32_SSE2
         4.73%  PaintThread      libxul.so                   [.] hsw::lowp::matrix_scale_translate
         4.53%  PaintThread      libxul.so                   [.] hsw::lowp::seed_shader
      So only few percent difference on my setup (as opposed 75% on the try server) but clearly related to hsw::lowp::gradient which promptly pointed me to the Skia library. Hsw actually stands for haswell and it is hand optimized vector rendering code which is used for my Skylake CPU.

      #if defined(__clang__) issues

      I looked into sources and noticed two funny hacks. Someone enabled always_inline attribute only for Clang which I fixed by this patch. And there was apparently leftover hack in the Firefox copy of Skia (which was never part of official Skia) disabling AVX optimization on all non-Clang compilers. Fixed by this patch. That patch also fixes another disabled always_inline with comment about compile time regression with GCC. Those did not reproduce to me. I also experimented with setting -mtune=haswell on those files since I suspected that AVX vectorization on generic tuning may be off - I never got an idea to test it since I expected people to use -march=<cpu> in this case.

      I was entertained to noice that Clang actually defines __GNUC__. Shall also GCC define __clang__?

      With this rasterflood_mutate regression reduced from 75% to 39% and the tsvgx subtest from 493% to 75%.

      AVX256 versus SSE2 code

      From profiles I worked out that the rest of difference is again caused by #ifdef machinery. This time is however not so easy to undo. Firefox version of Skia contains two implementations of the internal loops. One is Clang only using vector extensions while other is used for GCC and MSVC using the official Intel's ?mmintrin.h API. The second version was never ported to AVX and the avx/hsw loops still used 128bit SSE vector and API just compiled with new ISA enabled.

      I have downloaded upstream Skia sources and realized that few weeks ago the MSVC and GCC path was dropped completely and Skia now defaults to scalar implementation on those compilers.

      Its webpage mentions:

      A note on software backend performance

      A number of routines in Skia’s software backend have been written to run fastest when compiled by Clang. If you depend on software rasterization, image decoding, or color space conversion and compile Skia with GCC, MSVC or another compiler, you will see dramatically worse performance than if you use Clang.
      This choice was only a matter of prioritization; there is nothing fundamentally wrong with non-Clang compilers. So if this is a serious issue for you, please let us know on the mailing list.
      Clang's vector extensions are in fact GNU vector extensions and thus I concluded that it should not be that hard to port Skia to GCC again and to give it a try. It is about 3000 lines of code. I got it to compile with GCC in an hour or two, but it turned out that more work would be necessary. It does not make sense to spend too much time on it if it can not be upstreamed so I plan to discuss it with the Skia developers.

      The problem is in:
      template <typename T> using V = T __attribute__((ext_vector_type(size)));
      
      This is not understood by GCC. ext_vector_type is Clang only extension of GNU vector extensions for which I found brief mention of in the Clang Langugage Extensions manual. Here they are called "OpenCL vectors".

      I also noticed that replacing ext_vector_type by GNU equivalent vector_size makes GCC unhappy about using attributes on right hand side for which I filled PR88600 and Alexander advised me to use instead:
      template <typename T> using V [[gnu::vector_size (sizeof(T)*8)]] = T;
      
      Which gives equivalent vector type, but unfortunately not equivalent semantics. Clang, depending on the attribute (ext_vector_type or vector_size), accepts different kind of code. In particular following constructions are rejected for gnu::vector_size by both GCC and Clang but accepted for ext_vector_type:
      typedef __attribute__ ((ext_vector_type (4))) int int_vector;
      typedef __attribute__ ((vector_size (16))) float float_vector;
      
      int_vector a;
      float_vector b;
      
      int
      test()
      {
        a=1;
        a=b;
      }
      
      Here I declare one openCL variable a consisting of 4 integers and one GNU vector extension variable b consisting of 4 floats (4*4=16 is the size of vector type in bytes). The code has semantics of writing integer vector [1,1,1,1] to a and then moving it into float vector without any conversion (thus getting funny floating point values).

      Replacing openCL vector by GNU vector makes both compilers reject both statements.  But one can fix it as:
      typedef __attribute__ ((vector_size (16))) int int_vector;
      typedef __attribute__ ((vector_size (16))) float float_vector;
      
      int_vector a;
      float_vector b;
      
      int
      test()
      {
        a=(int_vector){} + 1;
        a=(int_vector)b;
      }
      
      Construct (int_vector){} + 1 was recommended to me by Jakub Jelínek and the first part builds vector zero and then adds 1 which is an vector-scalar addition that is accepted by GNU vector extensions.

      The explicit casts are required intentionally because semantics contradicts the usual C meaning (removing vector attribute, integer to float conversion would happen) and that is why users are required to write it by hand. It is funny that Clang actually also requires the cast when both vectors are OpenCL or both are GNU. It only accepts a=b if one vector is GNU and other OpenCL. Skia uses these casts often on places it calls ?mmintrin.h functions.

      I thus ended up with longish patch adding all the conversions. I had to also work around non-existence of __builtin_convertvector which would be reasonable builtin to have (and there is PR85052 for that). I ended up with code that compiles and renders some stuff correctly but incorrectly other.

      I therefore decided to discuss this with upstream maintainers first and made only truly non-upstreamable hack. I replaced the four source files SkOpts_avx.cppSkOpts_hsw.cpp, SkOpts_sse41.cppSkOpts_sse42.cppSkOpts_ssse3.cpp. This, of course, solved the two regressions but there is work ahead.

      I also filled PR88602 to track the lack of ext_vector_size in GCC, but I am not quite sure whether it is desirable to have. I hope there is some kind of specification and/or design rationale somewhere.

      2d rendering performance (improving rasterflood_svg)

      The performance seems is caused by fact that GCC honors always_inline across translation units. I ended up disabling always_inline in mfbt/Attributes.h which solved most the regression. I will identify the wrong use and submit patch later.

      The rest of performance difference turned out to be CPU tuning compiling:
      #include <emmintrin.h>
      __m128i test (short a,short b,short c,short d,short e,short f,short g)
      {
        return _mm_set_epi16(a,b,c,d,e,f,g,255);
      }
      
      
      GCC 8 with generic compiles this into rather long sequence of integer operations, store and vector loads, while Clang uses integer to SSE stores. This is because GCC 8 still optimizes for Bulldozer in its generic tuning model and integer to vector stores are expensive there. I did pretty detailed re-tuning of generic setting for GCC 8 and I remember that I decided to keep this flag as it was not hurting much new CPUs and was important for Bulldozer, but I have missed effect to hand vectorized code. I will change the tuning flag for GCC 9.

      I added -mtune-ctrl=inter_unit_moves_to_vec to compilation flags to change the bits. Clearly the benchmarking servers are not using Bulldozer where indeed GCC version runs about 6% faster.

      Tweaking train run

      While looking into the performance problems of svg rendering I noticed that the code is optimized for size because it was never executed during the train run. Because Clang does not optimize for size cold regions, this problem is not very visible for Clang benchmarks, but I have displaylist_mutate.html, rasterflood_svg.html and hixie-007.html into the train run. These are same as used by Talos, just modified to run longer and out of talos scripting machinery. I thus did not fill bug report for this yet.

      I have tested and it seems to have only in-noise effect on Clang, but same issues was hit by hears ago with MSVC  as reported in bug 977658.

      It seems there are some more instances where train run can be improved and probably those tests could be combined into one to not increase build times too much. A candidates are those two regressions in perf micro benchmarks which I have verified to indeed execute cold code.

      I have also noticed that training of hardware specific loops is done only on the architecture which is used by build server. This is now bug 1516809.

      -fdata-sections -ffunction-sections

      These two flags puts every function and variable into separate linker section. This lets linker to manipulate with them independently which is used to remove unreachable code and also for identical code folding in Gold.They however also have overhead by adding extra alignment padding and preventing assembler from using short form of branch instructions.

      For GCC without LTO the linker optimization saves about 5MB of binary size for GCC build and 8MB for Clang. With LTO it however does not make much sense because compiler can do these transforms itself. While GCC may not merge all functions with identical assembly linker can detect and vice versa, it is a win to disable those flags, by about 1MB and better link-times. This is now bug 1516842.

      -fno-semantic-interposition

      Clang ignores the fact that in ELF libraries symbols can be interposed by different implementation. When comparing perofmrance of PIC code between compiler, it is always good to use -fno-semantic-interposition in GCC to get apple-to-apple comparison. Effect on Firefox is not great (about 100Kb of binarry size differnece) because it declares many symbols as hidden, but it prevents more perofrmance surprises.

      Implementation monoculture

      I consider Firefox an amazing test-case for link-time optimization development because::
      1. it has huge and dynamically changing code-base where large part is in modern C++,
      2. it encompasses different projects with divergent coding styles,
      3. it has a lot of low level code and optimization hacks which use random extensions,
      4. it has decent benchmarks where some of them has been hand optimized for a while,
      5. it can be benchmarked from command line with reasonable confidence. 
      This makes it possible to do tests that can not be done by running usual SPEC benchmarks which are order of magnitude smaller, sort of standard compliant, and written many years ago. Real world testing is essential both to make individual features (such as LTO or PGO) production ready and to fine tune configuration of the toolchain for practical use.

      For this reason I am bit sad about the switch to single compiler. It is not clear to me whether Firefox will stay portable next year. Nathan wrote interesting blog "when an implementation monoculture might be the right thing". I understand his points, but on the other hand it seems that no super-human efforts were needed to get decent performance out of GCC builds. From my personal experience maitaining a portable codebase may not be always fun but pays back in long term.

      Note that Chromium had switched to Clang only builds compiler some time ago it still builds with GCC (GCC is used to build Chromium for Tumbleweed RPM package, for example), so there is some hope that community will maintain compatibility, but I would not bet my shoes on it.

      It would make sense to me to add both GCC and Clang builds into Mozilla nightly testing. This should:
      1. Uncover portability issues and other bugs due to different warnings and C++ implementations in both compilers.
      2. Make the code-base easier to maintain and port in long term
      3. Keep GCC and LLVM toolchains developers interested in optimizing Firefox codebase by providing more real-world benchmarks than SPEC, Polyhedron and Phoronix benchmarks (which are way too small for LTO and often departed from reality).
      4. Benefit from fact that toolchain bugs will be identified and fixed before new releases
      At least in my short experiment I was easily able to identify problems in all three project (and fix some of them).

      Communication with maintainers of Firefox packages in individual Linux distros

      It seems to me that it would be good to communicate better performance settings with authors of packages used by individual distros. I think most of us do not download Firefox by hand and simply use one provided by the particular Linux distribution. Seeing communication I had with Martins concerning SUSE and RedHat's packages, it seems very hard for packagers to reproduce Firefox PGO build setup which is critical to get good binary. One of things that would improve situation is to make build system fail if train run fails, which I filled as bug 1516872.

      It would be nice to set up some page which lists things that are important for a quality build and which provides links to benchmarks checking that the distribution provided Firefox is comparable to official one.

      It may sound funny, but even though I look at Firefox performance since 2010 until this December it did not cross my mind that I can actually download official package and benchmark against it. It is not usual that reproducing quality build would need such effort, but it is a consequence of complexity of the code-base.

      Future plans

      I already did some useful work on GCC 9 during last two weeks:
      1. Fix for the devirtualization bug 
      2. Reduced compile time memory use when LTO linking large programs with profile feedback 
      3. Reduced memory use when linking large LTO programs with profile instrumentation, second part
      4. Re-enabled speculative devirtualization that was accidentally disabled for LTO
      5. Fixed some cases where GCC has thrown away profile feedback for no good reason
      6. Got function summaries more precise after meriging profiles of C++ Comdat functions
      7. Fixed buffer overflow on mismatched profiles
      8. ... and more I am testing.
      I plan to continue working on this by updating my setup to GCC 9 and making sure it will do well on Firefox once released. I will also look deeper into performance with -O2 and the remaining displaylist_mutate regression and try to find time to write another update.

      8 comments:

      1. Hi! I'm the upstream Skia maintainer for all that vector code. You've hit on all the important points here that led us to make that vectorized software rendering code Clang-only. Even though they look pretty similar, Clang's vector extensions are just quite a bit more ergonomic to use than GCC's, for exactly all the reasons you've filed GCC bugs. :)

        That said, if there's interest I'd be happy to work with you to make the code support both Clang and GCC vector extensions. We've done that in another similar piece of code that handles color management,

        https://skia.googlesource.com/skcms/+/master/src/Transform_inl.h

        I think every awkward part of GCC's extensions can be worked around except for not being able to splat a scalar to a vector directly. It's so weird to read "x = F()+k" when you mean "x = k".

        ReplyDelete
        Replies
        1. Great, I went through the prototype and I think I can prepare patches after some discussion on ML early next year.

          We could make GNU extensions more convenient in longer term, but that would be only for GCC10 that is more than year away.

          Delete
        2. Also for extending vector extensions in GCC, it would be great to have some design documentation for what Clang does. As I have found out experimentally (and show in testcase) some behaviour seems odd.

          Delete
      2. Hi Jan, thanks for the thoughtful analysis.

        If you are curious, ext_vectors are documented here:
        https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors

        They have specifically been designed to make vector code more ergonomic. In addition to the extensions that provide, they support very nice swizzling syntax, things like vec.xxxx to splat out element zero.

        ReplyDelete
        Replies
        1. Thanks, Chris! I found this document, but it is not completely clear to me what kind of operations are allowed on ext_vector_type that are not with vector_size and how these are intended to mix together.

          It would be useful to discuss that and get things more compatible.

          And happy new year!

          Delete
      3. Thanks for your work on enhancing GCC!

        ReplyDelete
      4. Great work and very interesting post, thanks!

        ReplyDelete
      5. Note that the GCC support for Skia in Firefox is essentially a week of part time work while I was otherwise majorly busy with other tasks.

        I had to come up with a solution that would both support GCC and MSVC during our build transition, and so I opted against using GCC's native vector extensions as they are a bit difficult to get a compatible syntax with Clang's while minimizing the invasive changes in Skia to ease our long-term maintenance burden.

        So, for better or worse, the hack I came up with was to misappropriate the existing SkNx vector abstraction (to save me the trouble of having to reimplement one myself) that would have a more compatible syntax to Clang's vector extension, and also work with both GCC and MSVC as a bonus. The downside is that SkNx is not optimized for AVX2, and I did not want to go through the trouble of implement AVX2 support for SkNx for this. Thus the level of optimization really only takes advantage of SSE2 and which is also why the AVX path is disabled with GCC given there are very few cases where it would help much without AVX2.

        However, it seems to me that the essence of that hack, to simply wrap the difference away between GCC and Clang vector extensions by shoving both in wrapper classes would be a more happy middle, if those were the only two compilers we need to support anymore, with maybe some token support for MSVC via the native intrinsics (SkNx or otherwise). That would seem more reasonable to me than trying to solve this at the compiler level itself.

        ReplyDelete