Friday, April 11, 2014

Devirtualization in C++, part 5 (feedback driven devirtualization)

After a break, I finally have time to return to the devirtualization series. Last time I was speaking about speculative devirtualization performed by static analysis of the type inheritance graph. This time I plan to speak about simple (but very effective) feedback directed devirtualization. While GCC implements basic feedback directed  devirtualization since 2003, GCC 4.9 will newly make this cross-module. I also finally show some statistics how effective the described optimizations are.



What is feedback directed optimization? (FDO)

Feedback directed optimization (or equivalently profile directed optimizations, PDO) allows compiler to collect information about the runtime program behavior and use it during the static compilation. This usually means compiling program once with an instrumentation (-fprofile-generate in case of GCC), training it on "typical" program behavior, and then recompiling with profile feedback (-fprofile-use).

Feedback directed optimization is a well established optimization technique popular in performance critical applications (and benchmarking) since 80's. FDO is instrumental to get better inlining and code/data layout. It however helps many other optimizations including loop optimization, vectorization, register allocation, etc... Many optimizations needs to make tradeoffs between optimizing size or speed or tradeoffs making one code path slower while optimizing other code paths. All those can consult the profile feedback to get the decision right.

FDO implementations

The GCC implementation has grown up from gcov tool, written in 1990 by James E. Wilson, whose original primary purpose was to annotate source code at line-by-line basis by execution counts.

Eventually the information collected by gcov was used to help compiler produce branch prediction hints for IA-64. Because the implementation was really ad-hoc, in 2002-2003 Zdeněk Dvořák, Pavel Nejedlý, Josef Zlomek and myself worked on school promject Infrastructure for Profile Driven Optimizations in GCC Compiler, whose purpose was to get profile feedback maintained and used thorough the whole compilation queue. Our main goal was to actually reorganize GCC backend to use common control flow graph datastructure, but the profile feedback was
a nice marketing tool to demonstrate the importance of this work.

FDO brings significant performance gains. We reported 14% improvement on SPEC2000 back then, where about half of the gains was also achieved by the static prediction techniques (here compiler simply guesses the profile). Never ever in my life I managed to improve SPEC score so much. The gains depends a lot on particular application, but generally they are higher than ones got by switching from -O2 to -O3 without the code size costs. (see also this gcc summit paper)

The drawback of FDO is however the more complicated build process. Most developers do not want to care about training the program and rebuilding with feedback. While I was assured in 90's that FDO is extensively used by database vendors, even a decade later I found FDO being relatively little used in free software world where reproducible and simple builds are essential.

In case you are interested in experimenting with FDO, building many of non-interactive packages with FDO basically means to do the following:
CFLAGS="... -fprofile-generate" make
make check <or any better training>
make clean
CFLAGS="...-fprofile-use" make
For example, gzip's performance improves by about 6%, bzip's by about 10% and training these is trivial.

Popularity of FDO is however improving. These days, for example, Google, Firefox and various language interpreters are heavy users of feedback directed optimizations. This motivated quite lively developments in the area recently. Google's LIPO makes interesting approach to use profile feedback to avoid expenses of full link time optimization, AutoFDO allows to use low overhead profiles instead of feedback from special purpose instrumetnation. (These two features are still off-tree, but I hope they will be merged for 4.10). Martin Liška implemented new feedback directed function reordering based on idea of Firefox developer Taras Glek

In LLVM the feedback directed optimization became a focus recently and similar features are being implemented, too. Of course, Open64 implements decent feedback directed optimization as well.

How feedback is collected in GCC

To enable feedback directed optimizations, user needs to first build with -fprofile-generate. In this mode every compilation units gains a set of counters measuring various events of interest. Additionally every object file gains a static constructor which calls __gcov_init and registers the counters to the gcov runtime.

Gcov runtime is a small librarly called libgcov implementing several helper functions and __gcov_dump that is responsible for streaming the collected data and merging them with previous runs. This library is tranparently added to linker command line when the program is linked through gcc wrapper with -fprofile-generate switch.

The profile streaming happens from atexit hook and few other dirty tricks are done. For example calls to fork, vfork or thread creation are wrapped and redirected to libgcov to avoid double-accounting when both child and parent exits. So you are better to ensure you compile functions calling these with -fprofile-generate when profiling just part of your program.

The on-disk profile information is saved per object-file, in files names <objectfilename>.gcda. These files can be read and merged by the runtime, used by GCC, and by external tools - gcov, gcov_exit and gcov-tool (which will be merged in 4.10).

The basic information collected is so called edge profile, where execution count of every basic block and every edge in the control flow graph is measured. This is done using nice (and classical) instrumentation algorithm that avoids redundancies by computing minimal spanning tree. (See also interesting alternative path profiling, that is however not implemented in GCC, because the code quality benefits are not clear).

The profile feedback infrastructure also allows so-called value profiling implemented by Zdeněk Dvořák in 2002-2003. Edge profiling allows GCC to determine values of various variables in the program that may be used for further optimization. One of the first uses of it was to optimize divisions. For a/b it may happen that b is often a nice constant and then one can optimize the sequence into someting like:
if (b=2)
  a>>=1
else
  a=a/b
This simple trick often helps, for example for sily hash function implementations computing a%b where b is usually the initial hash table size, say 257.

Feedback directed devirtualization in GCC

Feedback directed devirutalization builds on the value profiling infrastructure. At every indirect call statement (there is no difference in between virtual call or classical indirect calls in this case) we try to determine the common destination of the indirect call and possibly speculatively inline the called function.

Getting common value of variable cheaply

Value profiling infrastructure implements various types of counters, one of them is one value profiler. You may see it in action if you compile the following with -O2 -fprofile-generate:
int
m(int a,int b)
{
 return a/b;
}
You will get:
m (int a, int b)
{
  __gcov0.m[0]++; 
  __gcov_one_value_profiler (&__gcov3.m[0], b);
  return a / b;
}
Here gcov0.m increment is the edge profile counter that is no interest for us. The usual value of b is determined via __gcov_one_value_profiler call that is implemented in libgcov as follows:
void
__gcov_one_value_profiler (gcov_type *counters, gcov_type value)
{
  if (value == counters[0])
    counters[1]++;
  else if (counters[1] == 0)
    {
      counters[1] = 1;
      counters[0] = value;
    }
  else
    counters[1]--;
  counters[2]++;
}
Three counters are used. counters[0] holds the usual value of b. counters[1] is increased when current value of b match the usual value and is decreased otherwise. counter[2] counts all execution of given statement.

If b is a constant c for more than 50% of time, you will get counters[0]=c and counters[1] will give you conservative estimate how common the value actually is. GCC then simply checks counters[1]>counters[2]/2 and if so, it assumes that b==c most of the time.

From common value to function identifiers

Indirect call profiling was introduced by Zdeněk Dvořák in 2003 and after more than a decade I finaly reworked it for GCC 4.9 to be inter-module with LTO. Here I describe the new implementation.

Getting common value of function pointer being called is an useless information without having a way to translate it back to the function identifier known to the compiler.  Doing so would require saving symbol table of the instrumented binary and dealing with a fact that one object file may link into many binaries and each binary may be loaded at a different address each time. Instead of that, GCC introduce a concept of profile_ids that are stable identifiers and can be easily mapped back to the symbol name within the compiler. Those are calculated as CRC of function's symbol name for external functions or as a combination of symbol name, source line and first global ID in the program for internal functions. This makes the unique in wast majority of cases.

Lets consider a simple testcase with an indirect call:
#include "stdio.h"
int
b(void)
{
  return 1;..
}

int a(int (*b)(void))
{
  return b();
}

m()
{
  int i,j=0;
  for (i=0; i<1000000;i++)
   j+=a(b);
  printf ("%i\n",j);
}
Most compilers, with optimization enabled, will simplify m into:
m()
{
  printf ("%i\n", 1000000);
}
But this is not the case when you compile it as a shared library. This is because ELF interposition rules actually allows you to replace function a by something completely different, for example by the LD_PRELOAD mechanizm.

GCC there correctly disables the inlining of all functions not declared inline that may be interposed with when compiling with -O2 -fpic. To my amusement, I just noticed that clang still optimize the code, I filled in PR19405 for that.

With -O2 -fpic -fdump-tree-optimized one can see the instrumentation in the following form:
a (int (*<T599>) (void) b)
{
  int _4;
 

  __gcov_indirect_call_profiler_v2 (872192950, a);
  __gcov_indirect_call_callee = 0B;
  __gcov0.a[0]++
;
  __gcov_time_profiler (&__gcov8.a[0]);
  __gcov_indirect_call_counters = &__gcov5.a[0];
  __gcov_indirect_call_callee = b_2(D);
  _4 = b();
  __gcov0.a[1]++
;
  return _4;
}
b ()
{

  __gcov_indirect_call_profiler_v2 (1092076645, b);
  __gcov_indirect_call_callee = 0B;
  __gcov0.b[0]++
;
  __gcov_time_profiler (&__gcov8.b[0]);
  return 1;
}
m ()
{
  int i.j;
 

  __gcov_indirect_call_profiler_v2 (2089875789, m);
  __gcov_indirect_call_callee = 0B;
  __gcov_time_profiler (&__gcov8.m[0]);

  j=0;
 

L3:
  __gcov0.m[1]++;

  j += a (b);
  i++;
  __gcov0.m[0]++
;
  if (i < 1000000)
    goto L3;
  __gcov0.m[2]++
;
  printf ("%i\n", i_30);
  __gcov0.m[3]++;
  c ();
  __gcov0.m[4]++;
  return;
}

There are quite few gcov0 increments which implements the edge profile. In fact more of them than one would expect. Most of the time, gcov0.a[0] will be the same as gcov0.a[1]. This is because GCC has to assume that each of the function calls whose target it does not see (by interposition this holds even for calls of a and b) may terminate program, invoke longjmp or throw an exception or be otherwise evil. Therefore after every function call that is not const/pure & nothrow a new counter is necessary.


What is important for us is however call of __gcov_indirect_call_profiler_v2. This call is added into beginning of every function that may be called indirectly. The counter takes two paremeters. FIrst is the profile-id. Second is the address of current function.

Additionally, at the indirect call site, in b(), two global variables are set:
  __gcov_indirect_call_counters = &__gcov5.a[0];
  __gcov_indirect_call_callee = b_2(D);
Those are thread local global variables used by the indirect call profiler (version 2, introduced in GCC 4.9). This counter is implemented in libgcov as follows:

void
__gcov_indirect_call_profiler_v2 (gcov_type value, void* cur_func)
{
   if (cur_func == __gcov_indirect_call_callee)
    __gcov_one_value_profiler (__gcov_indirect_call_counters, value);
}

The indirect call sequence then works as follows:
  1. Caller sets __gcov_indirect_call_counters and   __gcov_indirect_call_callee
  2. Called functions calls  __gcov_indirect_call_profiler_v2
  3. __gcov_indirect_call_profiler_v2 verifies that __gcov_indirect_call_callee corresponds to the real callee and if so it records the profile_id into counter selected by caller

    The verification is necessary to avoid confusion with another function in the middle that is not instrumented for profiling.

    While writting this blog, I noticed that in this case we ought to increase counters[2].  This will prevent us from devirtualizing function calls that mostly lands in a uninstrumented function. I will fix that in next release.

Getting stable profile-id

One of problems with profile-ids is that some symbol names may actually be random. In C++ ABI there exists public symbols that needs to be unique across whole program to be, for example, collected by linker into the static initialization lists. They are also used for anonymous namespaces and some other corner cases. Sadly, there is no safe way to invent global symbol that can not clash with anything else. GCC uses combination of first global symbol in the unit if it is present and if not a random number to get something like: _GLOBAL__I_65535_0_nsStaticXULComponents.o (ask c++filt what that means)
Currently I am ignoring the problem for profile-id calculation assuming that these functions are not called indirectly.

Profiling multiple targets

One optimization GCC can't do is to speculate one call into multiple targets. According to Xinliang David Li (who has a lot of experience with profile feedback at Google), in some programs this makes an important difference as one polymorphic call typically have very few possible targets. This will need extension into one value profiler to support multiple values, something Google promised to submit for next GCC release.

Reading and using profile feedback

To actually use the profile, GCC goes through several steps.
  1. At compile file, profile stored in <objectfile>.gcda is read in. One can check the contents in the dump -fdump-ipa-profile:
    Indirect call value:1092076645 match:1000000 all:1000000.
    Indirect call -> direct call b_2(D)=> b transformation on insn postponned to ipa-profile: _4 = b_2(D) ();
    This mean that the indirect call was annotated with a value histogram to be used later.

    Alternative way to explore gcda files is to use gcov-dump tool.
  2. Either at a compile time or at a link time (with LTO) the ipa-profile pass is responsible for introducing the speculative call. This can be checked with -fdump-ipa-profile_estimate:
    Histogram:
      1000001: time:2 (5.56) size:2 (11.76)
      1000000: time:34 (100.00) size:11 (76.47)
      1: time:14 (100.00) size:4 (100.00)
    Overall time: 36000016
    GCOV min count: 1000000 Time:100.00% Size:76.47%
    Determined min count: 1000000 Time:100.00% Size:76.47%
    Indirect call -> direct call from other module a/12 => b/11, prob 1.00
    Introduced new external node (b.localalias.0/15).
    Indirect call -> speculative call a/12 => b.localalias.0/15

    The first part of dump is about determining where the program spends 99% of te execution time and deciding that every statement with less than 1000000 executions is not interesting (only the internal loop matters).

    The second part is about turning the indirect call into the speculative call. Here a new asymbol b.localalias is introduced. This is a static symbol that can be used to refer to body b in a way that can not be altered by ELF dynamic linking. If user decides to interpose b later, the speculative sequence will not match.
  3. Inliner inlines the call (as seen in -fdump-ipa-inline):
    Considering b/11 with 3 size
     to be inlined into a/12 in t.c:10
     Estimated badness is -1073741828, frequency 1.00.
     Called 1000000x
     Inlined into a which now has time 2 and size 7,net change of -4.
    Sadly the inliner won't inline call from m to a, because it knows that a may be interposed by different implementation. In 4.10 I plan to introduce speculative calls in this case, too.
  4. After all inline decisions are done, inliner removes speculative calls that seems not profitable.
  5. Local optimizations are performed. Those hopefully benefits from the inlined code sequences
The resulting code looks as follows:
int
b(void)
{
  return 1;..
}

static int b.localalias __attribute__ ((alias("b")));

int a(int (*b)(void))
{

  if (b == b.localalias)
    return 1;
  else
    return b();
}

m()
{
  int i,j=0;
  for (i=0; i<1000000;i++)
   j+=a(b);
  printf ("%i\n",j);
}
This is clearly not an optimal result. A lot better would be:
m()
{
  int i,j=0;

  if (a == a.localalias && b == b.localalias)
    j = 1000000;
  else
    for (i=0; i<1000000;i++)
      j+=a(b);
  printf ("%i\n",j);
}

This is something that I will implement for 4.10 :).

I designed the testcase intentionally to play with the interposition to show the limits of current optimizations of -fPIC code. In less crappy cases, you will get pretty nice code. Still avoiding the indirect call and inlining a constant alone is about 30% runtime improvement. However main improvements come when later optimizations can do better work on the speculated sequence. Consider related testcase:
int
b(void)
{
  return 1;
}

 __attribute__ ((noinline))
int a(int (*b)(void))
{
  int i,j=0;
  for (i=0; i<1000000;i++)
   j+=b();
  printf ("%i\n",j);
}

main()
{
   a(b);
}
GCC (without -fPIC) will go all the way optimizing a as:
 __attribute__ ((noinline))
int a(int (*b)(void))
{
  int i,j=0;
  if (a == b)
    j = 1000000;
  else
    for (i=0; i<1000000;i++)
      j+=b();
  printf ("%i\n",j);
}
Here the important optimization is enabled by loop unswitching, which eventually splits the loop into two variants, one with known value of a and one with unknown and subsequently the first variant will be optimized to a constant store. No need to say that a() will now execute much faster.

That's it; all the magic behind feedback directed devirtualization.

Comparing feedback directed speculation to static methods on Firefox

Firefox these days builds with the link time optimization (still needs a small patchset) as well as with the profile feedback. This makes it easy to get some data about effectivity of optimizations described.

Update: The original results I presented in this section was wrong; my firefox crashed on startup during training.

First lets look what -fdump-ipa-profile_estimate tells us while linking libxul.so:

Histogram:
  39481368: time:3 (0.40) size:2 (0.00)
  22085009: time:2 (0.55) size:2 (0.00)
  22004637: time:10 (1.30) size:10 (0.00)
  22004631: time:10 (2.04) size:10 (0.00)
  21705795: time:3 (2.27) size:3 (0.00)
...
  114: time:9482 (99.89) size:4365 (7.45)
  113: time:2649 (99.89) size:1278 (7.46)
  112: time:5479 (99.89) size:2658 (7.50)
  111: time:2105 (99.89) size:993 (7.51)
  110: time:6411 (99.90) size:2819 (7.54)
  109: time:2463 (99.90) size:1279 (7.56)
  108: time:3762 (99.90) size:1893 (7.58)
  107: time:2062 (99.90) size:1017 (7.59)
  106: time:2868 (99.90) size:1364 (7.61)
  105: time:1161 (99.90) size:610 (7.62)
  104: time:5711 (99.90) size:2896 (7.65)
  103: time:1751 (99.90) size:825 (7.66)

...
  3: time:111016 (100.00) size:50936 (13.92)
  2: time:108097 (100.00) size:51390 (14.54)
  1: time:236334 (100.00) size:106379 (15.85)
  0: time:14095802 (100.00) size:6873694 (100.00)
Overall time: 29481332351
GCOV min count: 112 Time:99.89% Size:7.50%
Determined min count: 104 Time:99.90% Size:7.65%
In the first part GCC uses the knowledge of the whole program to determine where it spends 99.9% of its time. The training run used by firefox consume 29481332351 time units in GCC setimate.  The distribution of histogram suggest that statements executed over 104 times should be considered hot (in a way so hot code consume 99.9 execution time) and it is 7.66% of overall firefox binary size.

It actually seem that the profiling coverage of firefox is quite low - only 15% of code is executed at all compared to about 48% executed during GCC's training.
It is however quite common case that FDO pays back even with quite crappy training. Firefox is a difficult program to train due to its interactive nature. Life would be a lot easier, if all CLI was still ruling the world.

In the following part of dump GCC complains that some of object files were not build with -fprofile-generate, that seems like a bug in the build machinery that forgets to pass profiling flags to some files (I even saw some files in firefox that are built without optimization on):
Node getDynamicClassID/13759998 has no profile-id (profile feedback missing?)
Node getStaticClassID/13759997 has no profile-id (profile feedback missing?)

...
Node GetIsDebugBuild/45006 has no profile-id (profile feedback missing?)
Node Release/45001 has no profile-id (profile feedback missing?)
Node AddRef/45000 has no profile-id (profile feedback missing?)
 Next the actual conversions to speculative calls happen:
Indirect call -> direct call from other module tryStatement/13609244 => bindLet/13609141, prob 1.00
Indirect call -> speculative call tryStatement/13609244 => bindLet/13609141
Indirect call -> direct call from other module variables/13609249 => bindVarOrConst/13609295, prob 0.60

....
Indirect call -> speculative call GetBaseFile/11663 => AddRef/91845
Indirect call -> direct call from other module __base_dtor /11645 => Release/91848, prob 1.00
Indirect call -> speculative call __base_dtor /11645 => Release/91848
29162 indirect calls trained.
17227 (59.07%) have common target.
0 (0.00%) targets was not found.
1632 (5.60%) speculations seems useless.
15595 (53.48%) speculations produced.
Last week of checkout of firefox contains over 100000 polymorphic calls (not counting usual indirect calls), so about 29% of indirect calls are executed and we end up speculating 53% of those.  Not bad. It is interesting to observe that the checkout of firefox I analyzed in February had only about 60000 polymorphic calls. A 60% increase in 2 months, Firefox developers are busy!

Next we can cross check with ipa-devirt data I reported last time, in -fdump-ipa-devirt and we get:
109280 polymorphic calls, 0 devirtualized, 3184 speculatively devirtualized, 85942 cold
13975 have multiple targets, 0 overwritable, 11247 already speculated (11247 agree, 0 disagree), 0 external, 0 not defined, 0 artificial

OK compared to results from last time, the speculative devirtualization pass has saved some work (and code size) by skipping 85942 polymorphic calls that are cold by the profile. Out of the trained calls, it is able to determine the common destination in 11247 calls out of 15595 (72%) calls FDO speculation can handle. It also introduce 3184 speculative devitirtualizations that FDO did not handle: those are ones in the unprofiled code. Good news is that all speculations made by the static analysis actually agree with the runtime data!

As I explained last time, the speculative calls not always survive to the final binary. GCC inliner has ability to undo the speculation when it seems useless. Process of this can be checked in -fdump-ipa-inline dump:
Deciding on functions to be inlined into all callers and removing useless speculations:
Removing speculative call operator!=/13963171 => _ZNK6icu_5217RuleBasedTimeZoneeqERKNS_8TimeZoneE.localalias.164/13766021
Removing speculative call resolveConstructor/13963002 => GenericCreateConstructor/13369305
Removing speculative call defaultValue/13961679 => XPC_WN_Shared_Convert/5769960
Removing speculative call defaultValue/13961283 => XPC_WN_Shared_Convert/5769960
Removing speculative call defaultValue/13961260 => XPC_WN_Shared_Convert/5769960

...
Speculation is typically removed when called function is too large to be inlined or the caller was inlined and the offline copy became no longer interesting from performance point of view. This removal happens 9916 times. Mind that many of these speculative calls was in meantime duplicated as a result of function cloning or inlining. so this number can not easily compare with the number above. In fact Inliner informs you in a brief summary:
Unit growth for small function inlining: 7232256->9220597 (27%)
Inlined 183353 calls, eliminated 54652 functions
It somewhat surprises me that 7% of firefox's code base that is considered hot bloats the code size by 27% after inlining; I will double check why it happens. Seems that the "hot" part of firefox managed to grow 4 times!

I hacked inliner dump files to output statistic about individual calls weighted by their execution counts. In the charts bellow the percentages actually mean a probability that a call from unoptimized libxul execution (or more precisely libxul optimized only by early optimizations that are performed per-compilation unit only and include limited inlining, constant propagation and devirtualization) will be handled in a given way in the optimized binary.



It means that 68.1% of all calls will disappear by (late feedback directed) inlining. 27.1% calls will not be inlined for whatever reason. The small remaining sections actually describe indirect calls that accunts only less than 5% of all calls executed during the train run. (Therefore Firefox's train run is probably not a good benchmarks to demonstrate benefits of inlining and devirtualization.)

To make effects more visible, here is a graph omitting the direct calls:



32% of all calls can be inlined via static analysis. 10% are handled only speculatively by algorithms described last time. Profile feedback enables inlining of 21.9% extra calls. About 50% of calls remains uninlined. This is not bad result - both for feedback directed devirtualization and the static speculative devirtualization.

In the last graph I collected some data on why GCC decided to not inline a given call. Seems that actually majority of calls remaining are external, into other libraries.



UPDATE: these numbers still needs revisiting.

 How many speculations then survive into final code? This is easy to check by in -fdump-tree-optimized. Every speculatively inlined function sequence starts with the following code:
  PROF_18 = this_3(D)->5;
  if (PROF_18 == __deleting_dtor )
    goto <bb 5>;
  else
    goto <bb 10>;
 "grep "PROF.* == " *optimized* | wc -l" tells that there are 5519 such functions.

Further I hacked GCC to actually annotate the dumps with execution counts. This gives that 2038439 indirect calls executed in train run got inlined. 74189 polymorphic calls survived into optimized binary and are executed 316492 times. This means that GCC speculatively inlined 7% of all indirect calls (statically) that got rid of 86% indirect call executions in Firefox's the train run.

Disabling the FDO devirtualization, the static devirtualization predicts correctly 3622 calls that are executed 857979 times.

I shall note that these numbers are not 100% reliable, since they depends on GCC profile updating (profile is read early and then maintained thorough the optimizations), but they ought not be too far from reality. It may be interesting to confirm these results with valgrind and/or performance counters to get data from real execution.

GCC LTO results

While GCC does not really use any polymorphic calls, it is easy enough to profile so I did similar graphs as for firefox:






And runtime improvements?

All I was able to try this week was dromaeo benchmark comparing default build, LTO build and LTO build with FDO. The 7-8% improvement for LTO is actually very good given that dromeao is dominated by JIT generated code and thus it is a poor compiler benchmark.

Update 1: 14% speedup for FDO and LTO together is acutally good, too. My initial data looked bad because the train run crashed and majority of Firefox was size optimized. I will get -Os data, too.

Update 2:  I also run Dromaeo benchmarks comparing FDO+LTO build with link-time devirtualization disabled with the usual FDO+LTO build.  The results can be seen here. It seems that the devirtualization techniques described here accounts for about 2.7% speedup. This seems pretty good given how few indirect calls are really executed

I am also looking for other good candidates for benchmarking, if you know of applications that are benchmarkable, buildable with FDO and needs devirtualization, I would like to know about it!

I will definitely write a followup if I find some interesting benchmarks.

Next time I plan to write a bit about experiments giving a feedback to users to improve devirtualization in their programs by annotating the source files.

6 comments:

  1. This is one place where we have advantage in OpenJDK. We first compile with a quick 'n dirty non-optimizing compiler which generates code that gathers statistics. When the optimizing compiler kicks in later, every edge has a frequency attached to it. I wonder if GCC could have a framework like that.

    ReplyDelete
    Replies
    1. Indeed, JIT has a great advantage here and there are things static compilers will never be able to do alone.

      For static compilation, the FDO is usual way to go. In longer term, I suppose we will see more integration of static compilation and JITing. I always found it bit sad that LLVM started in this direction but it seems to be rather developing in direction of static copmiler or special purpose JIT.

      For instance current developments about using GPUs in generic code (such as HSA) expect that hot parts of statically compiled programs will be JITed to particular hardware.

      Delete
  2. Jan, thanks for the awesome articles. Any way I could follow you on Twitter?

    ReplyDelete
    Replies
    1. Thanks, I do not have twitter account, but blogger has some options for blogs to be followed.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete