Monday, February 17, 2014

Devirtualization in C++, part 4 (analyzing the type inheritance graph for fun and profit)

Last time I described the construction of the type inheritance graph, a main datastructure used by special purpose devirtualization algorithms. Now it is time for some real optimization fun! In this post I will review the most elementary transformations and will show some real world data.


What you need to know from previous post

Type inheritance graph (or class hirearchy graph, or class hirearchy diagram) is a directed acyclic graph where vertices are types (classes) and there are oriented edges representing direct bases of a type.

I described the new implementation of the type inheritance graph builder in GCC in a detail. Here is a summary:

Optimizers are generally interested in a list of possible targets of a virtual call.  To obtain it, one needs to fill in the polymorphic_call_context that contains:
  1. an outer_type, this is the type where start the walk. If the program analysis determined that the virtual call is contained in a bigger object, this is the type of bigger object in our example it is b,
  2. offset in bits specifying where the class whose method is called is located within the outer_type,
  3. a flag maybe_in_construction specifying whether we want to include the base types, and,
  4. a flag maybe_derived_type specifying whether we want to include derivations.
Having that done, one can call possible_polymorphic_call_targets that will return the list.

The function takes two extra parameters - the type of the call and token that is an index into the virtual table where the function is located.

The lookup algorithm (covered in a detail last time, too) consists of a doubly recursive walk filling in the list of possible targets:
  1. First it looks into outer_type and finds a base of type at offset (if type = outer_type it takes the outer_type itself) and performs the virtual table lookup. The resulting method is added to the list.
  2. If maybe_derived_type is true it walks to all known derivations. For each type found it walks recursively all its bases and on all appearances of outer_type it looks for its base of type at position offset. For each such base found it performs virtual table lookup and records the target to the list.
  3. If it is asked to walk bases, it also looks for all bases of outer_type which contains type at position offset.  For each appearance of type found it performs the lookup in the corresponding virtual table and again records the target.

Maybe this cheesy looking illustration showing an lookup will help:

Inheritance tree walk.
If both maybe_derived_type and maybe_in_constructions are false, the lookup will happen only in the virtual tables associated with outer_type. maybe_in_construction will add targets in all blue nodes (that are on the path to type) and maybe_derived_type will include targets in red types. Note also that it does not really matter where the virtual function was introduced. The lookup always care only about the type of the polymorphic call (that is type of the instance or pointer that was used to invoke it) and its derivations.

The resulting list is either complete or incomplete. When maybe_derived_types is set, the list typically will be incomplete. One always has to take into an account that other unit may introduce new derivations that are not visible to the compiler. It is possible to turn incomplete results to complete by final keyword or by use of the anonymous namespace. Sadly those features of C++ standard are not very consistently used in practice.

Determining the context

It is always safe to put outer_type to be the same as type, offset to 0 and maybe_derived_types to true. This however will result in unnecessary large lists of possible targets.

By simple analysis of function bodies one can often do better. Consider a testcase:
struct A final
{
  virtual int foo (void) {return 42;}
};
int test(void)
{
  struct A a, *b=&a;
  return b->foo();
}
Here compiler can derive that b is actually an instance of A. In this case outer_type will be A, offset 0 and maybe_derived_type false.

Determining the context is a simple backward walk of the SSA graph starting from the instance pointer used in the virtual call sequence and keeping track of the constant offset (GCC implementation does not care about the virtual inheritance at the moment) until the origin of the pointer is found. Origin may be address of a variable, function parameter, load from memory, return value of function call and seeral other cases.

Based on this origin it may be possible to determine the outer_type and the two flags. Currently GCC implements the following rules:
  1. If the origin is an address of static variable or automatic variable, then the outer_type is set to the type of the variable. maybe_in_construction is set to true because the constructor/destructor may be inlined and maybe_derived_type is false.
  2. If the pointer points to a function parameter passed by value (well, actually because it is non-POD, it is passed by an invisible reference), the outer_type is set to the parameter's type and maybe_in_construction/maybe_derived_type is set to false.
  3. If the pointer is this pointer, then the outer_type is set to the method belongs to. maybe_derived_type is set true for normal methods and false for constructors and destructors. maybe_in_construction is set to true for constructors and destructors and false for normal methods.
This logic is very simplistic and clearly insufficient. It is however a case where we bring in knowledge that is completely invisible to the generic low level optimizers otherwise. I plan to improve it for 4.10 in several ways:
  1. We are very conservative about types in construction/destruction. It is easy to track if a constructor was inlined and GCC already contains code that tries to disprove the fact that type is in construction by looking on virtual table sets. It is just not used here.
  2. We do not make any attempt to track heap memory (that is instances born via new). I have experimental path for this that simply looks into function calls that may construct the instance and determine method by it.
  3. C++ standard makes many promises that are lost on the way to low-level GIMPLE representation. For example, libstdc++ developers complain in PR46507 about devirtualization being missed in the following simple testcase:
    #include <tuple>
    #include <utility>
    
    struct A
    {
        virtual void f () const;
    };
    
    void arg_tuple_test (const std::pair<A,A> &t)
    {
        std::get<0>(t).f ();
    }
     Here the context analysis sees following low-level version:
    void arg_tuple_test(const struct pair & t)
    {
      int (*__vtbl_ptr_type) () fnptr;
    
      fnptr = *t->first._vptr;
      fnptr (&t->first); 
      return;
    }
    By C++ standard the dereference t->first imply that pointer t must point to a structure pf known type and thus the type of field first is known. This information is long lost in the translation. For GIMPLE the accesses to fields are really just pointer dereferences with an offset and they guarantee nothing about type or structure of the memory object pointer points to.

Basic optimizations making use of type inheritance graph analysis

This time I describe tree most elementary uses of type inheritance analysis.

Basic devirtualization

Devirtualizing is easy. If the list of possible targets is complete and it contain one call, we may turn the polymorphic call into a direct call.

Interesting is also an empty complete list. In this case we can safely turn the call into __builtin_unreachable() letting the compiler know that this code path has undefined effect. This actually happens in perfectly valid programs: It is a common practice to have base type A with multiple derived types (B, C). C may define a virtual method foo. Now we can have function taking pointer to A, checking that it indeed sees an instance of C and call method of C. Now as a result of code duplication and inlining, the function may end up on a code path that is handling instance of B. At runtime the code path will never be taken, but compiler is not aware of that.

Inliner in GCC takes hints from the indirect and polymorphic call optimization code and try hard to inline functions if doing so leads to removal of indirect calls. Obviously inlining hard to optimize unreachable code is a nonsense and thus the heuristic disable itself it he call target is noreturn. Both __builtin_unreachable and cxa_pure_virtual (function terminating program in the case you call pure virtual method) are declared noreturn.

Speculative devirtualization

More interesting optimization based on the new infrastructure is speculative devirtualization. It applies to the polymorphic call in the case the (complete or incomplete) contain only one method that seems likely.  Here likely method must not be noreturn, for reasons discussed, nor declared with cold attribute. Also methods reachable only for types in construction or destruction are considered unlikely.

In the case one likely target is found, the speculative devirtualization turns:
ptr->foo();
Into:
if (ptr->foo == A::foo)
  A::foo ();
else
  ptr->foo ();
(where A::foo is the single likely target of the call)

This transformation enables other optimizations, in particular inlining, to optimize the direct call path. In the case we are lucky and the object is not of a derived type produced from other compilation unit, the resulting code may run a lot faster even with the runtime check in it. (about 3-5 times faster for simple functions).

In 90's the speculative devirtualization was a win even without inlining, since the call was predicted by the hardware branch target predictor. These days most CPUs implement branch target predictors even for the indirect calls (see, for example Agner Fog: The microarchitecture of Intel, AMD and VIA CPUs). So if no optimization happens, the speculative devirtualization will just consume code space and branch prediction buffers.

For this reason I added speculative call support into the basic interprocedural optimization framework in GCC: after creating a speculative call, the function body is not immediately updated and only extra direct edge with speculative flag is paired with the existing indirect call edge in the call-graph. Speculative calls go mostly transparently through the optimization queue until after inlining when their feasibility is evaulated and if the speculation seems useless the speculation is removed. At the moment I declare speculation useless if:
  1. the call was determined to be cold by profile analysis
  2. the call target is cold or noreturn
  3. the call was not inlined and we did not gather anything interesting about the call target. Here the definition of interesting info is the pure or const flag that may be auto-detected by compiler.

Early unreachable code removal

C++ programs contains elaborate headers that define tons of inline functions and datastructures. Eliminating dead code as early as possible is essential for the compiler's performance.

Without the polymorhic call analysis it is hard to tell about a given virtual method if it may or may not become reachable as optimization progresses (by devirtualization via store to load propagation or by some other means). It may seem that all the methods must be reachable from virtual tables that are reachable from code. Even this is wrong. Methods are often keyed - that means that their offline copies are produced only in compilation units that define the first non-inline method of the class. Those methods may become reachable via devirtualization machinery even if there are no obvious references to them in the current unit. So on the one hand it a good idea to remove virtual methods keyed to other units as soon as possible, while on the other hand one wants to hold them when it is possible that later optimization passes will devirtualize to them. Polymorphic call analysis makes it possible to remove most of keyed methods early by proving that they may not be reached from polymorphic calls within current compilation unit.

It would also be a possibility to delay optimization of these methods until a direct call is found. This however complicate organization of the compiler and it may prevent removal of other objects just because they are used by the so-far unoptimized method bodies.

A pleasant suprise

I did not anticipate that, but early removal turned out a lot more important with the link-time optimization; here the removal can happen at compile time saving the streaming bandwidth and reducing symbol tables substantially. GCC 4.8 was keeping all virtual method bodies until link time (removing the unreachable ones after inlining), while GCC 4.9 uses the polymorphic call analysis to remove them as early as possible. For firefox, this chage alone reduced memory footprint of link time optimizer from 14GB to 4GB with similar improvements in disk space needed to save LTO objects. It also reduced the link time by over 40%.

Three are not that many virtual functions in firefox. What I underestimated is that those functions refer more objects and thus the overall cost is much higher.

How the code quality is affected?

To see the effect of the type inheritance graph analysis driven devirtualization I collected data during the link-time optimization on few large real world programs. I will not include any runtime data this time, because I do not really know much about benchmarking them. I will try to learn more till next time.

Firefox's libxul.so

libxul.so is the main binary of firefox.

How large is libxul.so?

It is quite a monster! It consists of roughly 9000 object files (that are now pre-combined via #include to about 1000), 270000 functions (85000 of them virtual), 47000 static variables, 60000 virtual calls producing 30MB of code in 83MB binary. This is all after merging of duplicate inlines and removal of unreachable code that takes into the account the type inheritance graph analysis. I did not disabled it, since I do not really want to show data about optimizing dead calls.

Effect of type inheritance graph analysis

First thing to consider is how large lists one gets. For that I collected a simple histogram:





libxul has 66471 polymorphic calls after the link-time unreachable calls removal. Out of those 569 has precisely one possible target and therefore will get devirtualized. 14954 have only one target in their incomplete list of targets and therefore are good candidates for speculative devirtualization. Speculative devirtualization is working even harder by ignoring the unlikely calls. Here is a breakdown of ipa-devirt's changes to libxul.


I was definitely very surprised that such a simplistic static analysis can actually handle almost 50% of all virtual calls in Firefox. 1% of fully devirtualized calls is also not bad; every removed indirect call counts. It is good to observe that the lists are typically short. Only small percentage of call sites have more than 10 call targets.

To be honest, wast majority of these transformations happens even without the context analysis. All context analysis code improves speculative devirtualization count only by about 500 calls. It however plays important role in other tricks.

Now what happens when one enables the other devirtualization algorithms?

Enabling front-end devirtualization

In part 1 I described devirtualization done at the front-end level. Enabling it saves 347 calls. 15 of them appears in the histogram as those with one know target but incomplete list, 325 are identified correctly as having one target alone.

Enabling low-level devirtualization at compile time

The low-level devirtualization, described in part 2, handled additional 7 calls. This is because it relies on inlining happening of constructor happening at compile time and GCC limits inlining to bare minimum (only if it reduces code size) leaving more complex decisions to the link-time.

Enabling compile time type inheritance analysis

The type inheritance analysis can be done (and is done by GCC-4.9 by default) at the compile time, too. The devirtualization answers are about as good as at LTO time, so we devirtualize all 244 calls that are otherwise handled by LTO.

When link-time optimization is disabled, GCC will devirtualize specualitvely at compile time, too. This will result in more false positives, because it is more likely that a derived type is not visible to the compiler. So far I did not hear any complains about this being enabled by default. I will try to gather some statistics in the future.

Effect on the resulting code

The devirtualization happens just at the start of inter-procedural optimization phase. Afterwards the code goes through many transformations. Most noticeable are the inliner and constant propagation passes that results in duplicating a lot of code and removing other. Also after inlining unimportant speculative calls are removed, as described earlier. Finally during the local optimization performed just before producing the final assembly more of polymorphic and indirect calls are optimized into direct calls or completely removed from the program.

This is the breakdown of all calls in libxul.so at the end of the GIMPLE optimization queue.

Bit surprising observation is that many of the virtual methods in Firefox are actually very small and good candidates for inlining. A common user of virtual methods is the garbage collector (with calls to virtual method AddRef sitting on hot paths) and the JIT compiler where even methods to get number of operands are virtual.

Libreoffice 

To verify that I do not suffer from Firefox tunnel vision, I tested other applications too. Here is a summary.

Libreoffice seems to be decomposed into more libraries than Firefox or Chrome. Largest library in my build is libvcllo with 19908 functions and  4543 virtual calls. This is the histogram of polymorphic call targets:

The results with complete list are less interesting - 2 calls with 1 target and one call with 2.




While effects of devirtualization is very similar (and so is the distribution of call target lists), the effect on binary is less visible. This is mostly because libreoffice compiles itself with -O2 and most virtual methods are not declared inline and thus inlined only when size decreases. Size estimates include the extra cost of speculative call sequence and therefore the code is never expected to shrink. This is actually something we may want to revisit for future releases - enabling somewhat more inlining at -O2 seems to have good code size/performance ratios (often even decreasing code size by enabling more optimization).

Qt 5's webkit

My original plan was to include statistics on Chromium. I eventually gave up. My Debian setup is obviously not fit and I am not familiar with the build machinery to debug the issues. Instead I took Qt's WebKit. It is a lot smaller than the real thing, but still relevant. It has 75000 functions, 13995 polymorphic calls. One impressive property is that it actually has 3 polymorphic calls that have over 10000 targets!


There are two calls with complete list of 2 targets.


Again WebKit compile itself with -O2 and thus number of speculatively inlined calls is lower, about 11% of all polymorphic calls. Still very good, I would say.


While generic algorithms have serious trouble to devirtualize 1% of calls, the speculative devirtualization easily predicts 30-50%. This is a surprise to me. The devirtualization code in GCC consists currently of about 2000 lines of code. In my original plans, I definitely did not anticipated it to do useful work at this stage. My main motivation was to start slowly tracking the problem of computing may edges for the callgraph. This makes me wonder why so many virtual calls have only one possible target in the program. I plan to implement -fsuggest-final-classes and -fsuggest-final-methods flags that will help users to annotate headers and turn speculative virtual calls into direct calls. Also from the collected data it seems that the speculative devirtualization may almost double its effect if two likely targets are allowed.

Next time I will write about profile driven speculative devirtualization and will try to gather some runtime data, too.

11 comments:

  1. Honza, big thanks to you. As an app. developer, I find reading as this one extremely useful in order how we can help the compiler to optimize more successfully.

    ReplyDelete
  2. How about changing:

    if (ptr->foo == A::foo)

    into:

    if (ptr->_vptr == A::_vtable)

    Wouldn't that be beneficial? 1 vs 2 memory accesses.

    ReplyDelete
    Replies
    1. Multiple vtables may (and often will) point to single function.

      Having this as an alternative would be good idea - main implemntation challenge is to identify the actual vtable lookup in already optimized code and be sure it can be moved down just before the call. It is on my TODO list, but did not get chance to implement it yet.

      Delete
  3. None of the graphs work, they say "Google Drive - You need permission..."

    ReplyDelete
    Replies
    1. Sorry, it tok a while, but I fixed that now. Not sure when the permissions changed

      Delete
  4. >> This makes me wonder why so many virtual calls have only one possible target in the program. <<

    One reason may be the use of interface classes used to provide easy mocking for unit tests, like:

    class IServiceProvider
    {
    public:
    virtual ~IServiceProvider();
    virtual void DoIt() = 0;
    };

    class ServiceProvider final : public IServiceProvider
    {
    public:
    virtual void DoIt() override; // implemented in separate *.cpp
    };

    class ServiceUser
    {
    public:
    ServiceUser(IServiceProvider& serv) : m_serv(serv) {}
    IServiceProvider& m_serv;

    void UseIt()
    {
    m_serv.DoIt();
    }
    };

    (Note: An additional class Mock_IServiceProvider : public IServiceProvider used for unit testing is not visible to the compiler while building target code.)

    When I look at the assembler code generated and linked for ARM using arm-none-eabi-g++ 5.3 I find that the compiler is not able to resolve any virtual call via the interface class - although ServiceProvider is final, no other implementation exists and no other libraries at all are linked since it is an embedded system.

    Is there anything I can do to enable the compiler to devirtualize these calls?

    ReplyDelete
    Replies
    1. I think GCC should speculatively devirtualize here when it sees only one possible target. Can you send a full testcase? Perhaps the speculative devirtualization is undone later because it seems unprofitable.

      Delete
    2. RixiusM@minimax.deApril 11, 2016 at 3:13 AM

      "Seeing" may be the main problem - when compiling each single *.cpp the compiler does not see any implementations of the interface classes. It shouldn't have to; this is what abstraction is about.

      I created a small test case which actually compiles for my ARM device and shows the same problem as the much larger application (which I am not at liberty to release and also includes a real time kernel). Where may I send it?

      Delete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Hi,

    I would naively expect the case below to be at least speculatively devirtualized

    struct B /* final */ {
    virtual int foo() { return 3; }
    };

    struct C {
    B& b;

    __attribute__((noinline))
    C( B& b ) : b(b) {
    }

    int foo() {
    return b.foo();
    }
    };

    int main() {
    B b;
    C c(b);

    int res = c.foo();
    return res;
    }

    ReplyDelete
  7. Just rename main to something else. GCC knows that main is executed once and will optimize it for size.

    ReplyDelete