Thursday, September 18, 2014

Devirtualization in C++, part 7 (Enforcing One Definition Rule)

This time I am going to write about a feature that I implemented as a side effect of the devirtualization machinery: the verification of One Definition Rule on types.

One Definition Rule (ODR) is one of more interesting cases where C and C++ differs in very subtle way. Wikipedia describes it as follows:
In short, the ODR states that:
  1. In any translation unit, a template, type, function, or object can have no more than one definition. Some of these can have any number of declarations. A definition provides an instance.
  2. In the entire program, an object or non-inline function cannot have more than one definition; if an object or function is used, it must have exactly one definition. You can declare an object or function that is never used, in which case you don't have to provide a definition. In no event can there be more than one definition.
  3. Some things, like types, templates, and extern inline functions, can be defined in more than one translation unit. For a given entity, each definition must be the same. Non-extern objects and functions in different translation units are different entities, even if their names and types are the same.
Some violations of the ODR must be diagnosed by the compiler. Other violations, particularly those that span translation units, are not required to be diagnosed.[1]
This means that C++ type names are global and should not be re-used in different ways in different source files. (In C doing so is perfectly valid and common, types across units are considered equivalent if their structure is.)

One of motivations for ODR is to make name mangling possible. Type names are used, for example, as program wide identifiers to distinguish functions and methods of same name but taking different types of arguments (function overloading):
struct A {int a;};
int getval (struct A a) {return a.a;}
struct B {char a;};
int getval (struct B a) { return a.a;}
This compilation unit will result in exporting two function symbols:  _Z6getval1A and _Z6getval1B. A and B comes from name of the argument's type.

Now, obviously, if another unit defines completely different structure A and completely different getval taking it as argument, the function will also be called _Z6getval1A and things will fail miserably. General violations of ODR can not be diagnosed by compiler working on single translation unit in isolation. If one is lucky, ODR violation turns into linker error about duplicated symbol name. With presence of inline functions however this will not happen and one of the two functions will be eliminated from program. This may result in invocation of a function on wrong data crashing program in random ways.


-detect-odr-violations in gold

One ODR checking tool I am aware of is gold's -detect-odr-violations:
gold uses a heuristic to find potential ODR violations: if the same symbol is seen defined in two different input files, and the two symbols have different sizes, then gold looks at the debugging information in the input objects. If the debugging information suggests that the symbols were defined in different source files, gold reports a potential ODR violation. This approach has both false negatives and false positives. However, it is reasonably reliable at detecting problems when linking unoptimized code. It is much easier to find these problems at link time than to debug cases where the wrong symbol.

Ian Lance Taylor: New ELF linker, Proceedings of the GCC Developer's Summit 2008
This is very simplistic analysis, but linker can hardly do a lot better; it lacks knowledge of types. Moreover there are some valid cases where the difference can happen and thus the warning can get false positives.

-Wodr: detecting ODR violations in GCC

As discussed in earlier post about construction of type inheritance graph, establishing type equivalency based on their names is important step in building cross-compilation-unit type inheritance graph. It is tricky to do for templates and thus in GCC 4.8 I used simple trick of identifying types based on mangled names of their associated virtual tables. This works only on types with virtual methods and thus with help of our C++ maintainer, Jason Merill, I extended the mehanizm to all class and union types: Before streaming out LTO data I simply invoke C++ frontend langhook to compute mangled name of a type and save it into the LTO object file. At linktime I can use the names to establish equivalence of all named types. This way middle-end does not need to know details of C++ namespaces and templates with the exception of being able to identify types in anonymous namespace and disable merging for those.

This streaming is not free, it seems to add about 2% of extra memory use and comile time for the mangled names. One day I plan to use the mechanism to strengthen the type based alias analysis for C++ and also to reduce redundancies in debug info produced (the second is actually done by LLVM to help fight memory usage explosion).

Once ODR based type equivalency is established, I can actually compare the types that become equivalent as part of LTO linking process. Doing so is a simple recursive walk over the GCC's type representation (that is sufficiently close to parse tree of declarations) and compare it. This step is not completely trivial because one needs to get past some differences that are considered valid. In particular complete and incomplete types of the same name are equivalent and there are few cases that we want to tolerate - such as use of attributes and perhaps differences caused by command line options, such as -fsigned-char/-funsigned-char even if those are ABI breaking and could lead to wrong code.

Making the warning useful

My first attempt resulted in plenty of warning looking as follows:
/aux/hubicka/firefox4/modules/libjar/zipstruct.h:38:16: warning: type ‘struct ZipCentral_’ violates one definition rule [-Wodr]
 typedef struct ZipCentral_
                ^
../../dist/include/zipstruct.h:38:0: note: a different type is defined in another translation unit
 typedef struct ZipCentral_
 ^
My best guess of average Joe's attempt to understand the message is "those types are the same! Stupid GCC!". It is a common case that types declared same way and in the same header are actually different because the difference is carried in from earlier includes. Without being able to report why the types are different, the warning would probably be ignored and declared bogus by most developers. For this reason I ended up adding about 200 lines of code just doing kind of structural diff and trying to output useful information about the difference.

In this particular cases it winds down to:
/aux/hubicka/firefox4/modules/libjar/zipstruct.h:38:16: warning: type ‘struct ZipCentral_’ violates one definition rule [-Wodr]
 typedef struct ZipCentral_
                ^
../../dist/include/zipstruct.h:38:0: note: a different type is defined in another translation unit
 typedef struct ZipCentral_
 ^
/aux/hubicka/firefox4/modules/libjar/zipstruct.h:47:25: note: the first difference of corresponding definitions is field ‘MOZ_Z_crc32’
   unsigned char crc32 [4];
                         ^
../../dist/include/zipstruct.h:47:0: note: a field with different name is defined in another translation unit
   unsigned char crc32 [4];
 ^
A fair try, but still not perfect :). Luckily, the problem is now not hard to guess. There are two zlib implementations linked into the Firefox and one of them redefine crc32 to MOZ_Z_crc32 (apparently to avoid ODR conflicts identified earlier by Gold). I thus decided that such warnings are verbose enough and did not add further information into this particular case. I expect to improve the diagnostic as we get more experience about individual errors reported.

Limitations of ODR detection

Even my ODR verifier is not a complete ODR nazi - it verifies only violations for types that survive into actual program and only for types where name matters (i.e. is part of type mangling). For example, if one unit define
typedef int mytype;
and the other:
typedef char mytype;
the violation will not be reported. This is because typedefs are ignored for type mangling purposes and thus I have no way to figure out that those two types are intended to be equivalent.

Other problem is, of course, that detection is limited to one binary. More conflicts can be introduced when linking with non-LTO library either statically, dynamically or at runtime via dlopen.

Detecting virtual table ODR violations

ODR relates not only to types, but also to functions and methods. While it is very hard to compare two methods coming from different unit because the body may differ in representation (because each is compiled with different flags or optimized in different context) but it may have the same semanticsI can however easily do is to compare virtual table. Mismatches in those are especially deadly because they may result in dispatch to a wrong virtual function.

Here is an example of ODR violation in Firefox:
../../dist/include/mozilla/a11y/DocAccessible.h:40:0: warning: type ïstruct DocAccessibleï violates one definition rule [enabled by default]
 class DocAccessible : public HyperTextAccessibleWrap,
 ^
/aux/hubicka/firefox/accessible/src/generic/DocAccessible.h:40:0: note: a type with the same name but different virtual table is defined in another translation unit
 class DocAccessible : public HyperTextAccessibleWrap,
 ^
/aux/hubicka/firefox/accessible/src/generic/DocAccessible.cpp:1232:0: note: the first different method is ïHandleAccEventï
 DocAccessible::HandleAccEvent(AccEvent* aEvent)
 ^
/aux/hubicka/firefox/accessible/src/atk/AccessibleWrap.cpp:956:0: note: a conflicting method is ïHandleAccEventï
 AccessibleWrap::HandleAccEvent(AccEvent* aEvent)
 ^

How common are ODR violations?

Fairly common :) Out of 3 programs (GCC, Firefox and Libreoffice) I tried to build with ODR checking, each of them had type ODR violations one of them (Firefox) had violation in virtual table. Today I leant that LLVM team has fixed ODR violations based on new warning, too. What are the most common causes?

Random type name clashes

Each time you create type with short name in global namespace (say hash_t), you risk clash with other units doing the same for completely unrelated reason. This is a good motivation to use namespaces and anonymous namespaces (though the anonymous namespaces are causing problems with GDB debugging, see PR16874). A firefox example looks as follows:
/aux/hubicka/firefox4/rdf/base/nsInMemoryDataSource.cpp:149:0: warning: type ‘struct Entry’ violates one definition rule [-Wodr]
 struct Entry {
 ^
/aux/hubicka/firefox4/gfx/skia/trunk/src/core/SkFlattenable.cpp:66:0: note: a different type is defined in another translation unit
 struct Entry {
 ^
/aux/hubicka/firefox4/rdf/base/nsInMemoryDataSource.cpp:150:0: note: the first difference of corresponding definitions is field ‘mHdr’
     PLDHashEntryHdr mHdr;
 ^
/aux/hubicka/firefox4/gfx/skia/trunk/src/core/SkFlattenable.cpp:67:0: note: a field with different name is defined in another translation unit
     const char*             fName;
 ^

Some examples come in this thread: http://nabble.documentfoundation.org/Re-Performance-samples-for-LibreOffice-td4119856.html.

Cut&paste

Copying code from one unit to another and adjusting the datastructures for new purpose is a common way to introduce name clashes. For example in GCC we have plenty of structures expr and occr that all originate from gcse.c that implemented the first dataflow based global optimizer.

Again, a Firefox example:
/aux/hubicka/firefox4/netwerk/sctp/datachannel/DataChannel.h:57:0: warning: type ‘struct BufferedMsg’ violates one definition rule [-Wodr]
 class BufferedMsg
 ^
../../../../dist/include/mozilla/net/DataChannel.h:57:0: note: a different type is defined in another translation unit
/aux/hubicka/firefox4/netwerk/sctp/datachannel/DataChannel.h:64:26: note: the first difference of corresponding definitions is field ‘mSpa’
   struct sctp_sendv_spa *mSpa;
                          ^
../../../../dist/include/mozilla/net/DataChannel.h:64:0: note: a field of same name but different type is defined in another translation unit
/aux/hubicka/firefox4/netwerk/sctp/datachannel/../src/usrsctp.h:198:8: note: type ‘struct sctp_sendv_spa’ should match type ‘struct sctp_sendv_spa’ but is defined in different namespace  
 struct sctp_sendv_spa {
        ^
../../../../dist/include/mozilla/net/DataChannel.h:60:0: note: the incompatible type is defined here
Here the type declaration is unchanged, but one of types it is built from has changed. This case shows also a namespace clash.

"Upgrading" codebase from C to C++

Because C does not have ODR, updating compiling code that was originally written as a C code and later upgraded to C++ needs audit for ODR violations. This is what the warning found for GCC: https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01402.html

Linking multiple versions of given library into the program

This is particularly common case for large programs. For example:
/aux/hubicka/firefox4/content/media/fmp4/ffmpeg/libav53/../FFmpegDecoderModule.h:18:0: note: a type with different memory representation is defined in another translation unit
 class FFmpegDecoderModule : public PlatformDecoderModule
 ^
/aux/hubicka/firefox4/content/media/fmp4/ffmpeg/libav53/include/libavcodec/avcodec.h:985:0: warning: type ‘struct AVFrame’ violates one definition rule [-Wodr]
 typedef struct AVFrame {
 ^
/aux/hubicka/firefox4/content/media/fmp4/ffmpeg/libav54/include/libavcodec/avcodec.h:989:0: note: a different type is defined in another translation unit
 typedef struct AVFrame {
 ^
/aux/hubicka/firefox4/content/media/fmp4/ffmpeg/libav53/include/libavcodec/avcodec.h:997:0: note: the first difference of corresponding definitions is field ‘data’
     uint8_t *data[AV_NUM_DATA_POINTERS];
 ^
is one of many warnings issued during Firefox build.

Conditional compilation

In some cases, that looked particularly dangerous for me, there are fields that are compiled conditionally (such #ifdef DEBUG) and the condition is not set consistently across the whole program.This is also quite common case in Firefox. It usually shows itself as a warning at same position but complaining about mismatch of field names.

Non-LTO and whole program ODR checker

It seems that ODR violations are very common bugs in large C++ programs. Majority of the ODR problems found are "safe" in a way that they will not result in wrong code: the types in question are never used in overloaded functions or do not have methods attached. There are was however few real bugs and many of those random clashes have a potential to become real bugs later.

Maintainers of the programs usually welcome the reported issues and fixed them quite promptly. Especially welcome response came from Michael Meeks:
Jan - wow - that is a nice error =) are there any other ODR issues ?
they habitually bite us hard so ... great to get libmerged debugged even
more. CC'ing the list too.
It seems to me it would make sense to implement ODR checking that is not dependent on LTO and do not suffer from limitations of gold in the following way:
  1. Extend DWARF info by mangled names of types.
  2. Implement ODR based debug info merging in linker and warn on mismatches.

    Merging of ODR types would likely reduce debug info size considerably (judging from GCC and LLVM memory footprint issues).

    Linker has also chance to examine shared libraries linked into the program as long as they do have the annotated debug info available. This is not possible in my LTO implementation.
  3. Implement dynamic library that will optionally verify types at dynamic linking time. This way one can report issues caused by dlopen.
I am quite sure more interesting cases would be found. Sadly I do not think I will have time to implement it :(

7 comments:

  1. Here's a situation I encountered recently that I think is a violation of the ODR; perhaps you could explain to me a bit more precisely than I understand what the problem is, and if your new ODR spotter would have got it.

    I was seeing, on a Linux build of a cross-platform program (did not see the issue on a windows build), a segFault. A segFault happening after main had finished, so I eventually traced it to a double destruction of a string.

    When I went looking, this string was defined as a static string in the header of a particular source file. This source file was part of the main executable, and also a library that was dynamically linked by the main executable at run time. I opened up the executable and the library to double check and, sure enough, identical symbols in each.

    When I watched this object carefully (thankfully, it got put into the same memory address on consecutive runs, so I could watch it crash once and then carefully watch that memory address on the next run) it was being constructed twice (bad and wrong but not fatal), and then destructed twice, with the second destruction essentially being a double-free on the same memory, causing the segFault.

    On a hunch, I then grepped the code looking for similar static strings that existed more than once, and found hundreds of them. I then took a break :) On a hunch, I then changed the build system so that the Linux version linked the library statically, at build-time, and the crash vanished (symptoms smothered, rather than problem fixed; thank you, Linux linker). Presumably, the different way that windows does its libraries had already smothered the symptoms over on the dynamically linked windows build).

    From what I've said, perhaps you can tell me if my understanding of the problem was correct (same object being constructed twice and destructed twice), and explain a little bit about how that happens? At startup, the executable creates it; when would the second construction happen? Would your clever ODR detectors above have got this one? Presumably they would have to be part of the linker and would have to go exploring the dynamic library.

    Hope I'm not asking too much.

    ReplyDelete
  2. The static construction/destruction is done by collecting all necessary constructors into ctors/dtors array. It depends how your string is declared. If you do static variable with construction:

    struct A {A();};
    static struct A a;

    You will get duplicate symbols (of same name) in library and main program, so double construction should not happen.

    If you forget about the "static" keyword, then you should get linker error with static linking complaining on duplicated symbol and silent double construction/destruction with shared library (because with PIC symbol interposition is allowed).

    Finally you can place the symbol into a comdat section:

    #include
    struct A {A() {printf("init\n");}};
    inline int t()
    {
    static struct A a;
    }
    main(){t();}

    In this case a global lock will be added and each unit using it will try to also invoke static constructor, but only one construction will happen.

    So I am not sure how to reach scenario you describe except for using explicit weak attribute on the string or by violating one definition rule and declaring it differently in each unit.

    ODR checking works per DSO only, so it will not have chance to catch the problem in the case clash happens between main program and a library. It should be able to tell something if the different declarations happens within one program.

    ReplyDelete
    Replies
    1. I will see if I can dig out the code and double check how exactly it was done. I seem to recall it was something like:

      static string error_one = "Error one";

      in a header file. That header file was then included in a cpp source file in the executable, and again in the dynamic library.

      I can't recall what the situation was with namespaces etc.

      Delete
    2. It would be nice to know the context where error was declared - I can definitely add some extra cases into the ODR checker if I see testcases.

      Delete
  3. Greetings,

    Problem:

    struct X
    {
    enum { vectorized_size = 128 }; // anything
    enum { alignment_size = 16 }; // anything

    typedef byte bytes[vectorized_size] __attribute__ ((aligned (alignment_size)));

    };

    // elsewhere

    typedef byte bytes[128];

    Generates an odr warning, even though these are two genuinely two different types.

    Am I missing something, or is this a type resolution issue stemming from the use of __attribute__ ?

    Thanks.

    ReplyDelete
    Replies
    1. Sorry for late reply. I missed your comment :(
      This is interesting. if the name mangling is the same, it should be the same type. Do you have a complete testcase which produce the warning?

      Delete