I did a quick test just to see what the cost of different methods of function calling are. Lots of people say “Soandso is more expensive” but I rarely see anyone quantify what “more expensive” means. These tests are just for the call itself; any other overhead such as stack modification for arguments, etc. is being ignored.
The test code is in C++ and is compiled in Microsoft Visual C++ 2010 express with default console project settings, release mode with /O2 (Maximize Speed.)
In my tests, I have the code broken out into a main.cpp, test_impl.h and test_impl.cpp to enforce separation. I found that some function declarations were inlined even with declspec(noinline) attached to it. To prevent the functions from being compiled out, I have a global ‘int test_value’ that each function simply increments.
If you’re interested in the actual timings and cost of the call, see Agner Fog‘s excellent instruction tables document. I reference them for the AMD K8 processor.
Here’s the summary of the code:
__declspec(noinline) void test_function() { test_value += 1; } void (*test_functionptr)() = test_function; class test_class_novtable { public: __declspec(noinline) void test_function() { test_value += 1; } }; class test_class_abstract { public: virtual ~test_class_abstract() {} virtual void test_function() = 0; }; class test_class_abstract2 { public: virtual ~test_class_abstract() {} __declspec(noinline) void test_function() { test_value += 1; } }; test_class_abstract2 test_object_abstract2; test_class_abstract *test_object_abstract = &test_object_abstract2;
The first test was a single function call and generated the expected assembly, simply calling the mangled function name. Assuming I’m understanding the CALL instruction, it is 16-22 macro ops and 23-32 cycles of latency.
call ?test_function@@YAXXZ
Next is the function pointer, which issues the call on a memory address. This is slightly more expensive at 16-22 macro ops and 24-33 cycles of latency:
call DWORD PTR ?test_functionptr@@3P6AXXZA
Next, the standard class member call, which is identical to the normal function call but with more mangling to identify the class name. It will push the hidden ‘this’ parameter onto the stack, so even though the call is the same, the overall cost may not be:
call ?test_function@test_class_novtable@@QAEXXZ
And now the pure abstract virtual class:
mov eax, DWORD PTR ?test_object_abstract2@@3Vtest_class_abstract2@@A mov edx, DWORD PTR [eax+4] mov ecx, OFFSET ?test_object_abstract2@@3Vtest_class_abstract2@@A\ call edx
As it turns out, calling a virtual class is significantly more expensive, assuming there are no cache misses. The reason for this is that it can’t simply call the function. It has to load the object, load the vtable, load the function code, and finally perform the jump.
Assuming the AMD K8 processor, and no cache misses, calling a function in a virtual table is 3 cycles of latency for each mov. This means it’s an extra ~9 cycles per call, or half to a third more time per call.
The vtable (and it’s calling cost) can be represented in C as below. The object is a pointer to a struct containing an array of function pointers:
struct test_vtobject { void (**vtable)(); } void (*test_vtable[])() = { test_function }; test_vtobject test_vtobject_impl = { test_vtable }; test_vtobject *test_vtobject_ptr = &test_vtobject_impl;
Calling it with an integer (I reused test_value) to select the function looked like this:
test_vtobject_ptr->vtable[test_value]();
And resulted in this assembly:
mov eax, DWORD PTR ?test_value@@3HA mov ecx, DWORD PTR ?test_vtobject_impl@@3Utest_vtobject@@A mov edx, DWORD PTR [ecx+eax*4] call edx
An alternative implementation is the traditional C object which ditches the array in favor of named function pointers.
struct test_cobject { void (*test_function0)(); void (*test_function1)(); void (*test_function2)(); void (*test_function3)(); // data }; test_cobject test_cobject_impl = { test_fn_0, test_fn_1, test_fn_2, test_function }; test_cobject *test_cobject_ptr = &test_cobject_impl;
This implementation resulted in function pointer calling:
call DWORD PTR ?test_cobject_impl@@3Utest_cobject@@A+12
I’ll keep playing around with it.