summaryrefslogtreecommitdiff
path: root/doc/scale.html
diff options
context:
space:
mode:
Diffstat (limited to 'doc/scale.html')
-rw-r--r--doc/scale.html210
1 files changed, 210 insertions, 0 deletions
diff --git a/doc/scale.html b/doc/scale.html
new file mode 100644
index 0000000..2e70148
--- /dev/null
+++ b/doc/scale.html
@@ -0,0 +1,210 @@
+<HTML>
+<HEAD>
+<TITLE>Garbage collector scalability</TITLE>
+</HEAD>
+<BODY>
+<H1>Garbage collector scalability</h1>
+In its default configuration, the Boehm-Demers-Weiser garbage collector
+is not thread-safe. It can be made thread-safe for a number of environments
+by building the collector with the appropriate
+<TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation
+flag. This has primarily two effects:
+<OL>
+<LI> It causes the garbage collector to stop all other threads when
+it needs to see a consistent memory state.
+<LI> It causes the collector to acquire a lock around essentially all
+allocation and garbage collection activity.
+</ol>
+Since a single lock is used for all allocation-related activity, only one
+thread can be allocating or collecting at one point. This inherently
+limits performance of multi-threaded applications on multiprocessors.
+<P>
+On most platforms, the allocator/collector lock is implemented as a
+spin lock with exponential back-off. Longer wait times are implemented
+by yielding and/or sleeping. If a collection is in progress, the pure
+spinning stage is skipped. This has the advantage that uncontested and
+thus most uniprocessor lock acquisitions are very cheap. It has the
+disadvantage that the application may sleep for small periods of time
+even when there is work to be done. And threads may be unnecessarily
+woken up for short periods. Nonetheless, this scheme empirically
+outperforms native queue-based mutual exclusion implementations in most
+cases, sometimes drastically so.
+<H2>Options for enhanced scalability</h2>
+Version 6.0 of the collector adds two facilities to enhance collector
+scalability on multiprocessors. As of 6.0alpha1, these are supported
+only under Linux on X86 and IA64 processors, though ports to other
+otherwise supported Pthreads platforms should be straightforward.
+They are intended to be used together.
+<UL>
+<LI>
+Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to
+run the mark phase in parallel in multiple threads, and thus on multiple
+processors. The mark phase typically consumes the large majority of the
+collection time. Thus this largely parallelizes the garbage collector
+itself, though not the allocation process. Currently the marking is
+performed by the thread that triggered the collection, together with
+<I>N</i>-1 dedicated
+threads, where <I>N</i> is the number of processors detected by the collector.
+The dedicated threads are created once at initialization time.
+<P>
+A second effect of this flag is to switch to a more concurrent
+implementation of <TT>GC_malloc_many</tt>, so that free lists can be
+built, and memory can be cleared, by more than one thread concurrently.
+<LI>
+Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread
+local allocation. It does not, by itself, cause thread local allocation
+to be used. It simply allows the use of the interface in
+<TT>gc_local_alloc.h</tt>.
+<P>
+Memory returned from thread-local allocators is completely interchangeable
+with that returned by the standard allocators. It may be used by other
+threads. The only difference is that, if the thread allocates enough
+memory of a certain kind, it will build a thread-local free list for
+objects of that kind, and allocate from that. This greatly reduces
+locking. The thread-local free lists are refilled using
+<TT>GC_malloc_many</tt>.
+<P>
+An important side effect of this flag is to replace the default
+spin-then-sleep lock to be replace by a spin-then-queue based implementation.
+This <I>reduces performance</i> for the standard allocation functions,
+though it usually improves performance when thread-local allocation is
+used heavily, and thus the number of short-duration lock acquisitions
+is greatly reduced.
+</ul>
+<P>
+The easiest way to switch an application to thread-local allocation is to
+<OL>
+<LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>,
+and then include the <TT>gc.h</tt>
+header in each client source file.
+<LI> Invoke <TT>GC_thr_init()</tt> before any allocation.
+<LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>,
+and/or <TT>GC_GCJ_MALLOC</tt>.
+</ol>
+<H2>The Parallel Marking Algorithm</h2>
+We use an algorithm similar to
+<A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by
+Endo, Taura, and Yonezawa</a> at the University of Tokyo.
+However, the data structures and implementation are different,
+and represent a smaller change to the original collector source,
+probably at the expense of extreme scalability. Some of
+the refinements they suggest, <I>e.g.</i> splitting large
+objects, were also incorporated into out approach.
+<P>
+The global mark stack is transformed into a global work queue.
+Unlike the usual case, it never shrinks during a mark phase.
+The mark threads remove objects from the queue by copying them to a
+local mark stack and changing the global descriptor to zero, indicating
+that there is no more work to be done for this entry.
+This removal
+is done with no synchronization. Thus it is possible for more than
+one worker to remove the same entry, resulting in some work duplication.
+<P>
+The global work queue grows only if a marker thread decides to
+return some of its local mark stack to the global one. This
+is done if the global queue appears to be running low, or if
+the local stack is in danger of overflowing. It does require
+synchronization, but should be relatively rare.
+<P>
+The sequential marking code is reused to process local mark stacks.
+Hence the amount of additional code required for parallel marking
+is minimal.
+<P>
+It should be possible to use generational collection in the presence of the
+parallel collector, by calling <TT>GC_enable_incremental()</tt>.
+This does not result in fully incremental collection, since parallel mark
+phases cannot currently be interrupted, and doing so may be too
+expensive.
+<P>
+Gcj-style mark descriptors do not currently mix with the combination
+of local allocation and incremental collection. They should work correctly
+with one or the other, but not both.
+<P>
+The number of marker threads is set on startup to the number of
+available processors (or to the value of the <TT>GC_NPROCS</tt>
+environment variable). If only a single processor is detected,
+parallel marking is disabled.
+<P>
+Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside
+the collector to immediately yield the processor instead of busy waiting
+first. In the case of a multiprocessor and a client with multiple
+simultaneously runnable threads, this may have disastrous performance
+consequences (e.g. a factor of 10 slowdown).
+<H2>Performance</h2>
+We conducted some simple experiments with a version of
+<A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to
+run multiple concurrent client threads in the same address space.
+Each client thread does the same work as the original benchmark, but they share
+a heap.
+This benchmark involves very little work outside of memory allocation.
+This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine
+under Linux 2.2.12.
+<P>
+Running with a thread-unsafe collector, the benchmark ran in 9
+seconds. With the simple thread-safe collector,
+built with <TT>-DLINUX_THREADS</tt>, the execution time
+increased to 10.3 seconds, or 23.5 elapsed seconds with two clients.
+(The times for the <TT>malloc</tt>/i<TT>free</tt> version
+with glibc <TT>malloc</tt>
+are 10.51 (standard library, pthreads not linked),
+20.90 (one thread, pthreads linked),
+and 24.55 seconds respectively. The benchmark favors a
+garbage collector, since most objects are small.)
+<P>
+The following table gives execution times for the collector built
+with parallel marking and thread-local allocation support
+(<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We tested
+the client using either one or two marker threads, and running
+one or two client threads. Note that the client uses thread local
+allocation exclusively. With -DTHREAD_LOCAL_ALLOC the collector
+switches to a locking strategy that is better tuned to less frequent
+lock acquisition. The standard allocation primitives thus peform
+slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be
+avoided in time-critical code.
+<P>
+(The results using <TT>pthread_mutex_lock</tt>
+directly for allocation locking would have been worse still, at
+least for older versions of linuxthreads.
+With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the
+lock with pthread_mutex_try_lock(), busy_waiting between attempts.
+After a fixed number of attempts, we use pthread_mutex_lock().)
+<P>
+These measurements do not use incremental collection, nor was prefetching
+enabled in the marker. We used the C version of the benchmark.
+All measurements are in elapsed seconds on an unloaded machine.
+<P>
+<TABLE BORDER ALIGN="CENTER">
+<TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th>
+<TH>2 marker threads (secs.)</th></tr>
+<TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td>
+<TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td>
+</table>
+<PP>
+The execution time for the single threaded case is slightly worse than with
+simple locking. However, even the single-threaded benchmark runs faster than
+even the thread-unsafe version if a second processor is available.
+The execution time for two clients with thread local allocation time is
+only 1.4 times the sequential execution time for a single thread in a
+thread-unsafe environment, even though it involves twice the client work.
+That represents close to a
+factor of 2 improvement over the 2 client case with the old collector.
+The old collector clearly
+still suffered from some contention overhead, in spite of the fact that the
+locking scheme had been fairly well tuned.
+<P>
+Full linear speedup (i.e. the same execution time for 1 client on one
+processor as 2 clients on 2 processors)
+is probably not achievable on this kind of
+hardware even with such a small number of processors,
+since the memory system is
+a major constraint for the garbage collector,
+the processors usually share a single memory bus, and thus
+the aggregate memory bandwidth does not increase in
+proportion to the number of processors.
+<P>
+These results are likely to be very sensitive to both hardware and OS
+issues. Preliminary experiments with an older Pentium Pro machine running
+an older kernel were far less encouraging.
+
+</body>
+</html>