new
and delete
operators) then most likely you were faced with performance
degradation problem at multiprocessor systems. You program is running fine at computer with single CPU.
Then you try to run it at the system with 2 or more processors and instead of expected
increase of performance 2 or more times, you find out that you application runs ten or more times slower!
What is happening? Why adding more CPU significantly decrease speed of your application?
One of the possible answers is memory allocator. Standard system memory allocator use
mutexes to prevent concurrent access to allocator structures and preserve their consistency.
Locking mutex is not a problem when there are no lock conflicts - it is just few atomic assembler instructions.
But at multiprocessor platform, different threads concurrently invokes memory allocator and as a result
we have a large number of lock conflicts. So most of the time is now spent is locking/unlocking mutexes.
Even if threads in your application are working autonomously (thread is accessing objects created only by
itself) and so require no synchronization, there will be still a lot of conflicts between threads
caused by memory allocator. As a result instead of expected linear increase of performance with adding new
CPUs, you will have exponential degradation of performance.How to solve the problem? The obvious solution is to avoid creation of objects in dynamic memory at all and allocate objects on stack instead. It is almost the ideal solution fro performance point of view. Unfortunately it is not always possible or convenient (for example thread stack has also limited size so we can not create large number of objects in the stack). Another solution is to have separate memory allocator for each thread, so that each allocator can allocate memory without interaction with other allocators. The product contains implementation of such allocator.
malloc
. When allocator is requested to allocate memory, it rounds requested size
to nearest block size. If chain of block with this size is not empty, object is allocated by just unlinking
block from the chain. If the chain is empty, then allocator allocates new page (default size is 4Kb),
split it into blocks of requested size and add all these blocks to the correspondent chain.
Page is also allocated using standard malloc
. When object is deallocated, it is linked in
the correspondent chain.So you can see, that most frequently allocate and free operation requires no synchronization at all - them are just implemented as L1 list link/unlink. So in case when thread works locally (objects are deallocated from the same thread which allocates them), performance is very high. And adding extra processors can only speedup you application. But sometimes data exchange between threads is needed (so objects created in one thread are sent to another thread where them are deallocated). In this case it is not possible to avoid synchronization at all. But it is possible to reduce it overhead. Instead of using one centralized mutex, this algorithm creates one mutex per thread. When free method find out that object was allocated by some other thread, it lock the mutex of this thread and add this block to the pending requests list. Each thread checks this list at each allocate operation. When number of pending requests exceeds some specified threshold, thread lock the mutex, store head of the list, reset the list, unlocks the mutex and then deallocate all these elements. So time of holding mutex is minimal, number of locks is significantly reduces and locking can block only one thread (instead of centralized lock which can block all running threads).
When thread is terminated, algorithm checks counters of live objects at each page and deallocate pages with zero counter of used objects. Other pages will be deallocated later by other threads which will deallocate these living objects.
void* thread_alloc(size_t size); void* thread_realloc(void* addr, size_t size); void thread_free(void* addr);You can use them instead of standard
malloc, free
and realloc
in your C application.
Just include "threadalloc.h"
header file and "threadalloc.obj"
file
in the linker list for your project.
In C++ application you can redefine default new
and delete
operators if you
compile your application with included everywhere "threadalloc.h"
header file
and defined REDEFINE_DEFAULT_NEW_OPERATOR
macro.
This allocator can be build using make utility. Makefile for Microsoft Visual C++ (makefile.mvc
)
and GCC (makefile
) are provided. Command file make.bat
can be used
to invoke Microsoft nmake
utility for code>makefile.mvc.
Make will build allocator object file and four tests: stdalloc_test1, stdalloc_test2,
threadalloc_test1, threadalloc_test2
. Test 1 implements model where thread works in isolation from
each other (allocates and deallocates objects locally). In this case performance benefit
comparing with standard allocator are most meaningful. Test 2 implements model of communication
between two threads: object are created in one thread (producer) and deleted in another thread (consumer).
The table below summarize results for single and multiprocess systems (time of test execution in seconds).
The tests were performed under Scientific Linux 2.6.15.5 for x86_64.
Test | Description of test | Intel(R) Pentium(R) 4 CPU 3.00GHz [Hyperthreading=On] | Intel(R) Pentium(R) 4 CPU 3.00GHz [Hyperthreading=Off] | Dual Intel(R) Xeon(TM) CPU 3.00GHz [Hyperthreading=On] | Dual Intel(R) Xeon(TM) CPU 3.00GHz [Hyperthreading=Off] |
---|---|---|---|---|---|
stdalloc_test1 | standard allocator with 4 threads working in isolation | 6.596 | 7.948 | 4.797 | 5.014 |
stdalloc_test2 | standard allocator with producer and consumer threads | 10.853 | 9.986 | 11.200 | 9.613 |
threadalloc_test1 | this allocator with 4 threads working in isolation | 2.711 | 1.690 | 1.079 | 0.837 |
threadalloc_test2 | this allocator with producer and consumer threads | 8.719 | 7.848 | 7.804 | 6.966 |
As you can see from these result, proposed allocator is 2-3 times faster than standard allocator at single processor system and almost 100 times faster at dual processor platform! Certainly there are many other factors which are limiting scalability of real application except memory allocator, but I hope that using this allocator can really help you in improving performance of your application.
I will provide e-mail support of this product.
Look for new version at my homepage | E-Mail me about bugs and problems