Sunday, January 26, 2014

Devirtualization in C++, part 2 (low-level middle-end devirtualization by forwarding stores to loads)

In the first part of the series we saw low level implementation of a polymorphic call. It is a common belief that devirtualization can be done well by the generic middle-end optimization techniques - combination of the constant propagation, store-to-load forwarding and indirect call removal. (See, for example, this post on llvm-dev, GCC developers tends to share similar sentiments.) This time I show how this is implemented and why one needs special purpose devirtualization code in the middle-end to really optimize well.



I will again try to make the presentation self-contained even though I need to touch several quite specialized topics. I will use the same running example as last time:
struct A
{
  virtual int foo (void) {return 42;}
};
int test(void)
{
  struct A a, *b=&a;
  return b->foo();
}
To middle-end it is presented in the following form (explained in detail in part 1):
void _ZN1AC2Ev (struct A * const this)
{
  this->_vptr.A = &_ZTV1A + 16;
}
int _ZN1A3fooEv (struct A * const this)
{
  return 42;
}
char * _ZTS1A = "1A";
struct typeinfo _ZTI1A = {&_ZTVN10__cxxabiv117__class_type_infoE + 16, &_ZTS1A}
struct vtable _ZTV1A = {0, &_ZTI1A, _ZN1A3fooEv}
int test() ()
{
  int (*__vtbl_ptr_type) () *tmp1;
  int (*__vtbl_ptr_type) () tmp2;
  struct A a;

    struct A * b;
 
  _ZN1AC2Ev (&a);
  b = &a;
  tmp1 = b->_vptr.A;
  tmp2 = *tmp1;
  return tmp2 (b);
}
There is nothing inherently magic about polymorphic calls that would prevent standard optimizers to work out their destination. This is also what happens in GCC with -O2 -fno-devirtualize (the second flags prevents special devirtualization logic to trigger). Several optimizations however need to cooperate on the result:
  1. First inliner inlines call to the constructor and brings initialization of virtual table pointer into the body of test itself:
    int test() ()
    {
      int (*__vtbl_ptr_type) () *tmp1;
      int (*__vtbl_ptr_type) () tmp2;
      struct A a;

       struct A * b;
     
      a._vptr.A = &_ZTV1A + 16;
      b = &a;
      tmp1 = b->_vptr.A;
      tmp2 = *tmp1;
      return tmp2 (b);
    }
  2. Constant propagation pass replaces all occurrences of b by &a. While &a a address on stack the stack frame and thus it is not really a constant, it is considered to be so by optimizers operating at one function only.
    int test() ()
    {
      int (*__vtbl_ptr_type) () *tmp1;
      int (*__vtbl_ptr_type) () tmp2;
      struct A a;

        struct A * b;
     
      a._vptr.A = &_ZTV1A + 16;
      b = &a;
      tmp1 = a._vptr.A;
      tmp2 = *tmp1;
      return tmp2 (a);
    }
  3. Global value numbering propagates memory store to load and subsequently resolves the lookup in the virtual table and turns the polymorphic call into a direct call.

    int test() ()
    {
      int (*__vtbl_ptr_type) () *tmp1;
      int (*__vtbl_ptr_type) () tmp2;
      struct A a;

        struct A * b;
     
      a._vptr.A = &_ZTV1A + 16;
      b = &a;
      tmp1 = &_ZTV1A + 16;
      tmp2 = foo;
      return foo (a);
    }
  4. Dead code removal pass removes the unnecessary leftovers of the polymorphic call machinery.
    int test() ()
    {
      struct A a;
       
      a._vptr.A = &_ZTV1A + 16;
      return foo (a);
    }
  5. Inliner now sees the direct call and is ready to inline A::foo.
    int test() ()
    {
      struct A a;
       
      a._vptr.A = &_ZTV1A + 16;
      return 42;
    }
  6. Dead code removal can now remove the initialization of virtual table and thus also the actual instance of A from the stack frame.
    int test() ()
    {

      return 42;
    }
All done!

Quite an effort to cleanup such a simple sequence. Can it be done more easily? Well, party, yes. For example the constant propagation pass can be skipped since value numbering is a complex pass capable of doing the propagation itself. One can not however easily change the fact, that devirtualization is a result of quite complicated inter-play of the basic optimization passes.

In Clang one can check the compilation process with -mllvm -print-after-all -O2. The sequence of events is basically the same.
  1. The replacement of b is done already by SROA pass (Scalar Replacement of Agregates); 
  2. early Common Subexpression Ellimination removes redundancy in memory access (there is access of array index 0 of virtual table; something GCC simplifies already in the front-end);
  3. Inliner will bring in the virtual table constructor;
  4. Global Value Numbering work out the value of a._vptr.A;
  5. Instruction Combining will produce direct call; 
  6. Inliner will inline the call;
  7. SROA will cleanup now dead a.

Keyed classes and external virtual tables

For some reason, the approach described here did not work in GCC until 2010 when I fixed a bug in the global value numbering to propagate accesses into read-only variables. The C++ front-end used slightly non-standard way to describe the contents of the virtual table and the middle-end simply gave up on the virtual table lookup. After fixing this I soon started to learn interesting things related to how symbol visibility works in C++.

It is kind of legacy from C language that visibility (that is static, global or external) is defined only for functions and variables (well, exept for a rare case where one declares a type inside function definition). This is because types in C are not producing any code or datastructures in compiler's output by themselves (except for debug info) and thus there is no real need to control their visibility.

In C++ classes, the types however contain also (inline) methods, class data, virtual tables, type information and other things that translate to real code or datastructures. Unless you use anonymous namespace (that is still not very popular), all types you define are actually shared in between compilation units based on type's unique name (I will speak more on One Definition Rule later).

Because the compiler works on a single compilation unit only, it is a standard approach to place all class data and inline methods into special sections that are transparently unified by a linker (called COMDAT or linkonce). This is wasteful, because classes appearing in headers are compiled many times. This result in slow rebuild times and huge disk space requirements to store object files.

There are several ways to overcome redundant compilation. The most transparent one is the concept of keyed methods. If class define at least one method that is not inline, we know that there is one compilation unit defining it. Therefore all class data are located into that unit.

There are more such mechanizms, such as explicit instantiations and use of repositories. All these techniques boils down to a situation where virtual table is known but it is not output in current compilation unit. Kind of the following:
extern struct vtable _ZTV1A = {0, &_ZTI1A, _ZN1A3fooEv};
This is a quite uncommon concept to C programmer and the middle-end cowardly throwed away the information. After fixing this, I run into many of interesting problem reports. They are all based on the same problem: we know the contents of the virtual table, but because virtual table is not a part of current compilation unit, there may be weird reasons why we are not allowed to refer to it directly.

I will mention at least two most important ones:

User defined visibilities

As an extension to C and C++, GCC allows to specify the ELF visibility of symbols. This can be done via an visibility attribute, by -fdefault-visibility switch, or by the linker script. Visibility of symbol can be either default, protected, hidden, or internal. Only default and hidden are commonly used.

When building a library, the default visibility makes everything exported from the library and moreover it allows any other library to overwrite the symbol by a different definition. This is an useful feature, but it inhibits optimization. Compiler can never be sure about what given function call is going to do, since current function body can be overwritten. It also makes dynamic linking expensive, because all references needs to be processed through dynamic linker. For this reason users are encouraged to carefully specify symbols that are not part of library's interface as hidden (see, for example, Urlich Drepper's paer How to write shared libraries). Hidden symbols are visible only within the library itself and they are not exported out.

Unlike C++ native visibilities (static/anonymous namespaces, global, COMDAT and extern), the ELF visibilities are not really part of One Definition Rule. It is common to have headers behave differently when used inside a library and when used by external program.

C++ is not really making distinction in between implementation details and external API; they are all part of the library headers and public/private specifiers in classes are more of an syntax shugar. For this reason compiler needs to be careful about using a symbol it found in an external variable constructor. The knowledge it got from the library header may or may not be part of the intended API.

Construction virtual tables

Before base types of an object are constructed or after the object has been destructued, its virtual table is replaced by a new virtual table, called construction virtual table, that points to virtual methods of base types (so their constructors can work).

Normally construction virtual tables are directly accessed only in units they are output. In PR54314 however GCC managed to optimize memory accesses in a program to directly refer construction virtual table of libstdc++ i/o streams. This is because construction virtual tables are, in the case of multiple inheritance, referred to by another datastructure, VTT (Virtual Table Table), that is global.

Libstdc++ manually control list of symbols that are exported out of the library and the construction virtual table was not on a list. You may actually look into the bug-report itself to see how long discussion it took to settle down on a decision that it was GCC's bug to refer to those. Even if the construction virtual tables are exported from compilation units, their symbol names are not standarized by ABI and they can not be referenced externally.

Why these symbols are exported at first place? Well, to allow COMDAT mechanizm to share them. It is an GCC's extension to the C++ symbol name mangling making the symbol name unique. If two objects that contain a construction virtual table are both compiled with a compiler following this GCC's extension, the linker will unify both copies. An alternative approach would be to introduce an COMDAT local (that is a static symbol captured within the same COMDAT section) as the VTT. This would however prevent even code compiled within the compilation unit containing the VTT to refer to it.

Current "solution"

It seems that there is nothing that really defines when one can take a symbol from external constructor and refer to it. In this case we can just guess what user will do, or rather make realistic assumption about intened use and convince users to change their habits when something breaks.

I ended up implementing can_refer_decl_in_current_unit_p predicate that prohibits introducing new references to abstract symbols, symbols with user defined visibility (via an attribute, not by the other methods) and for COMDAT or static objects whose body is no longer available (at some point compiler must decide what will be output and throw away the rest).

To make devirtualization more reliable, I have also extended representation of references in symbol table to also include constructors of external variable until inlining is finished. This ensures that COMDAT or static virtual functions are not thrown away too early.

Over time can_refer_decl_in_current_unit_p has grown up into quite a beast (see gcc/trunk/gcc/gimple-fold.c) and with introduction of COMDAT locals to gcc 4.9 it will need to become symbol rather then unit specific...

Limitations of low-level approach

Certainly not all devirtualizable calls can be handled by the intra-procedural low level techniques described here. Here are the main issues.

Virtual table store is not always visible to the compiler

In our example it was always necessary to inline the constructor to make devirtualization happen. If our testcase is compiled with -O2 -fno-inline -fno-devirtualize (or in case of clang with -O2 -fno-inline), the call will not be optimized. This is because currently the inter-procedural optimizers are stupid enough to not realize that after the constructor returns, the virtual table pointer has a known value. This limitation will be hopefully fixed after GCC 4.9 is released by implementing aggregate return-functions into inter-procedural propagation.

Even with this implemented, these techniques will not cover cases where the construction is hidden behind the compiler's back. Consider:
struct A
{
  virtual int foo (void) {return 42;}
};
int test(struct A a)
{
  struct A *b=&a;
  return b->foo();
}
Here the constructed object come from an other compilation unit.

Alias analysis is too weak to track the type of object over its lifetime

Virtual table pointer is part of the object memory layout. To propagate its store (in constructor) to its load (in polymorphic call), the compiler needs to prove that this value did not change in meantime. For this reason there is alias analysis module. This module is trying to prove that given memory altering instruction (such as memory store, call or asm statement) is not going to modify given memory location.

This is doine by series of techniques. One of main concepts is tracking whether given memory location escapes from current unit (either function or compilation unit) or if it stays local. For local memory locations the alias analysis algorithms can quite precisely track what pointers may point to the object and this knowledge is often enough for proving that store is not harmful. For escaping location the alias analysis algorithm however need to believe the worst.

Even quite innocent code can confuse alias analysis. For example:
struct A
{
  virtual int foo (void) {return 42;}
  void bar (void);
};
int test(void)
{
  struct A a, *b=&a;
  b->bar();
  return b->foo();

}
Here optimizers will give up on devirtualizing, because there is method bar that is external to the unit. In lowlevel code, call b->bar() will translate as _ZN1A3barEv (b). This will make alias analysis module to believe that address of a is escaping and thus that every function call may change anything in its memory representation, including the virtual table pointer. It will also believe that all memory locations coming from the outer world may contain pointers to a.

Because polymorphic calls are often quite displaced from the constructors of the object, it is quite clear that low-level devirtualization is really more of an toy than reliable tool.

Brute-force is not always a win

Compilers can not take unlimited amount of CPU time to complete, so in the following testcase:
struct A
{
  virtual int foo (void) {return 42;}
  virtual int foo1 (void) {return foo();}
  virtual int foo2 (void) {return foo1();}
  virtual int foo3 (void) {return foo2();}
  virtual int foo4 (void) {return foo3();}
  virtual int foo5 (void) {return foo4();}
  virtual int foo6 (void) {return foo5();}
  virtual int foo7 (void) {return foo6();}
  virtual int foo8 (void) {return foo7();}
  virtual int foo9 (void) {return foo8();}
  virtual int foo10 (void) {return foo9();}
};

int test(void)
{
  struct A a, *b=&a;
  return b->foo10();
}
You will see GCC with -fno-devirtualize to give up after 2 "iterations" (GCC really has two inliners in it) and clang after 4 iterations. A fair try, but the resulting code calling foo8 or foo6 is horrible, of course. The iteration approach was proposed for GCC in this patch. It however never reached the mainline. Presentation about this approach.

What is missing in GCC implementation

The local (intra-procedural) store-to-load propagation is doing a good job. I also think we are doing well on making information available to the middle-end, making it understand the external constructors, the One Declaration Rule, the COMDAT sharing (with the fact that all COMDAT sections of the same name should be semantically equivalent) and other details.

The main limitations lie on the alias analysis side. To devirtualize one typically has to solve at least basic inter-procedural problems. GCC can certainly should do better while auto-detecting malloc attribute that would allow it to track better the lifetime of an object. It also should get a working inter-procedural points-to analysis (current implementation of -fipa-pta is not scalable and stable enough to be enabled by default and it is quite limited. We also should auto-detect the nocapture (noescape) and noclobber attributes. For some reason patch implementing the attributes was apparently never made it into the mainline. Auto-detection should be a simple addition into the existing pure-const pass.

Because the way propagation through variable constructors is organized, GCC will currently not take into account knowledge it has about construction virtual tables just based on the fact it knows it can not refer to the virtual table itself.  This is because propagation works step by step and in one of the intermediate forms there would be an invalid reference to the table. This can probably be solved by allowing those references and later turning them back into VTT lookup via exisitng DECL_VALUE machinery.

Classical value numbering works is only able to propagate one known value to each temporary. It does not take into an account a scenarios, where object can have multiple types, but all the types are using the same virtual function. This can be probably partly dealt with the conditional value numbering algorithm.


In future I will speak about extra knowledge C++ standard gives you on the value of virtual table pointer.  I wondered whether this can be hooked into the alias-analysis module and whether store-to-load propagation can not use the fact that at some places we know the virtual table pointer without actually seeing the store. It is typically not a job of those optimizers to understand this and especially alias analysis is hard problem by itself, so it is perhaps better to not throw extra complexity into it.
Last important area is to improve inter-procedural constant propagation, but I will write on that later.


I gave arguments on why middle-end needs to learn more about the C++ specific semantic. Next time I will write about type inheritance graph.

Author: , Google +

3 comments:

  1. Is there any thought of making C compilation being done using g++ as such .c extension source files default would be 'extern "C" ' and all C++ only keywords would be context sensitive. So that developer would be able to embed or use C++ in a .c source files too. This will allow to have only 1 front end for both C and C++.

    ReplyDelete
    Replies
    1. Having one front-end for both C and C++ is definitely possible. Extending C with a way to embed C++ code into C source is however huge change with potential compatibility issues. There are some important differences in between C and C++ (such as the one definition rule) so I do not think compiler can safely support it without official "blessing" from both standardization committees to cooperate to give sane semantic to the side cases of this.

      We solve similar issues with mixing languages in link-time-optimization and it is not a trivial problem at all.

      Delete
  2. Excellent article! In your "Brute-force is not always a win," you mention that clang will give up after 4 iterations. Do you have any more info about where this constant lives? It seems that there is no documentation regarding the number of iterations the passes will be run. Thanks!

    ReplyDelete