Jekyll2017-04-07T13:23:25+00:00http://www.monobrasil.com.br/Mono ProjectSystem.Numeric.Vectors are now accelerated in Mono2016-12-20T00:00:00+00:002016-12-20T00:00:00+00:00http://www.monobrasil.com.br/news/2016/12/20/system-numeric-vectors<p>While Mono has had support for SIMD instructions in the form of the
<a href="http://tirania.org/blog/archive/2008/Nov-03.html">Mono.SIMD API</a>,
this API was limited to run on x86 platforms.</p>
<p>.NET introduced the
<a href="https://msdn.microsoft.com/en-us/library/dn858218(v=vs.111).aspx">System.Numeric.Vectors</a>
API which sports a more general design that adapts to the SIMD
registers available on different platforms.</p>
<p>The <code>master</code> branch of Mono now treats the various Vector operations
as runtime intrinsics, so they are hardware accelerated. They are
supported by both the Mono JIT compiler on x86-x64 platforms and via
LLVM’s optimizating compiler on x86-64 and every other Mono/LLVM supported
platform.</p>
<p>We would love to see you try it and share your experience with us.</p>Miguel de IcazaWhile Mono has had support for SIMD instructions in the form of the Mono.SIMD API, this API was limited to run on x86 platforms.Future of Mono’s .NET Code Sharing2016-11-29T00:00:00+00:002016-11-29T00:00:00+00:00http://www.monobrasil.com.br/news/2016/11/29/mono-code-sharing<h1 id="monos-use-of-reference-source-code">Mono’s use of Reference Source Code</h1>
<p>Since the first release of the .NET open source code, the Mono project
has been integrating the published <a href="https://github.com/microsoft/referencesource">Reference
Source</a> code into Mono.</p>
<p>We have been using the Reference Source code instead of the
<a href="https://github.com/dotnet/corefx">CoreFX</a>, as Mono implements a
larger surface area than what is exposed in there - Mono implements
the .NET Desktop API.</p>
<p>Integrating the code sometimes is easy, we replace the code in Mono
with the reference source code. When the code is not exactly
portable, we need to make it portable and either write missing code,
or integrate some of the work that we did in Mono, with the code that
existed in .NET. There are some cases where porting the code is just
too complicated, and we have not been able to do the work.</p>
<p>We keep track of the <a href="https://trello.com/b/vRPTMfdz/net-framework-integration-into-mono">major
items</a>
in Trello.</p>
<p>Originally, we <a href="https://github.com/mono/referencesource">forked the reference source
code</a> and kept a
<a href="https://github.com/mono/referencesource/tree/mono">branch</a> with our
code, but it was too large of an external dependency and it was also a
module that was quite static. So recently, we started copying the
code that we needed <a href="https://github.com/mono/mono/tree/master/mcs/class/referencesource">directly into
Mono</a>.</p>
<p>While this has worked fine for a while, the .NET Reference Source is
only updated when a major .NET release takes place, and it tracks the
version of .NET that ships along with Windows. This means that we are
missing on many of the great optimizations and improvements that are
happening as part of <a href="https://www.microsoft.com/net/core">.NET Core</a>.</p>
<p>The optimizations and fine tuning typically take place in two places,
work that goes into <code>mscorlib.dll</code> is maintained in the
<a href="https://github.com/dotnet/coreclr">coreclr</a> module while the higher
level frameworks are maintained in the
<a href="https://github.com/dotnet/corefx">corefx</a> module.</p>
<h1 id="a-new-approach">A New Approach</h1>
<p>Many of the APIs that were originally removed from CoreFX are now
being added back, so we can start to consider switching away from
referencesource and into CoreFX.</p>
<p>After discussing with the .NET Core team, we came up with a better
approach for the long-term maintenance of the shared code.</p>
<p>The .NET team has now setup a new repository where the cleaned up and
optimized version of <code>mscorlib.dll</code> will live in the
<a href="https://github.com/dotnet/corert">corert</a> module.</p>
<p>What we will do is submodule both CoreRT and CoreFX and replace our
manually copied code from the Reference Source with code from CoreRT
and CoreFX.</p>
<p>The twist is that for scenarios where Mono’s API surface is larger, we
will contribute changes back to CoreRT/CoreFX where we either add
support for the larger API surface, or we make the API pluggable
(likely with a tasteful use of the <code>partial</code> modifier).</p>
<p>One open issue is that Mono has historically used a single set of
framework libraries (like mscorlib.dll, System.dll etc.) that work
across Linux, MacOS, Unix and Windows and they dynamically detect
how to behave based on the platform. This is useful on scenarios
where you want to bootstrap work in one platform by using another
one, as the framework libraries are identical.</p>
<p>CoreFX takes a different approach, the libraries are tied to a
particular platform flavor.</p>
<p>This means that some of the work that we will have to do will involve
either adjusting the CoreFX code to work in the way that Mono works,
or give up on our tradition of having the same assemblies work across
all platforms.</p>Miguel de IcazaMono’s use of Reference Source CodeA tale of an impossible bug: big.LITTLE and caching2016-09-12T00:00:00+00:002016-09-12T00:00:00+00:00http://www.monobrasil.com.br/news/2016/09/12/arm64-icache<p>When someone says multi-core, we unconsciously think SMP. That worked out well for us until recently when ARM announced big.LITTLE.
ARM’s <a href="https://en.wikipedia.org/wiki/ARM_big.LITTLE">big.LITTLE architecture</a>
is the first mass produced
<a href="http://www.embedded.com/design/mcus-processors-and-socs/4429496/Multicore-basics">AMP architecture</a>
and as we’ll see next, it raises the bar for how hard multi-core programing is.</p>
<h1 id="a-tale-of-an-impossible-bug">A tale of an impossible bug</h1>
<p>It all started with a <a href="https://bugzilla.xamarin.com/show_bug.cgi?id=39859">bug report</a> against a phone with such a CPU, the Exynos chipset used on Samsung phones in Europe.
Apps created with our software were dying with <code>SIGILL</code> at all completely random places.
Nothing could reasonably explain what was happening, and the crash was happening with valid instructions. This immediately made us suspect bad
instruction cache flushing.</p>
<p>After reviewing all JIT code around cache flushing we were sure that we were calling <code>__clear_cache</code> properly. That lead us to look around
for how other
<a href="https://github.com/v8/v8/blob/fec99c689b8587b863df4a5c4793c601772ef663/src/arm64/cpu-arm64.cc#L40">virtual machines</a>
or
<a href="https://github.com/llvm-mirror/compiler-rt/blob/ff75f2a0260b1940436a483413091c5770427c04/lib/builtins/clear_cache.c#L146">compilers</a>
do cache flushing on ARM64, and we found out about some related
<a href="https://silver.arm.com/download/Unspecified/BX500-DA-10400-r0p0-08rel0/Cortex_A53_MPCore_Software_Developers_Errata_Notice_v18.pdf">errata on the Cortex A53</a>. ARM’s description of those issues
is both cryptic and vague, but we tried the workaround anyways. No luck there.</p>
<p>Next we went with the other usual suspects. A lying signal handler? Nope. Funky userspace CPU emulation? No.
Broken <code>libc</code> implementation? Nice try. Faulty hardware? We reproduced it on multiple devices. Bad luck or karma? Yes!</p>
<p>Some of us could not sleep with such amazing puzzle in front of us and kept staring at
memory dumps around failure sites. And there was this funny thing: the fault address was always on the third or fourth line of the memory dumps.</p>
<p><a href="/images/2016-09-10-arm64-icache_hexdump.png"><img src="/images/2016-09-10-arm64-icache_hexdump.png" alt="hexdump" /></a></p>
<p>This was our only clue, and there are no coincidences when it comes to this sort of byzantine bug. Our memory dumps were of 16 bytes per line
and the <code>SIGILL</code> would <a href="https://gist.github.com/lewurm/97dff0a56929b56a0fc5ab49af06fd06">always happen to be somewhere between</a> <code>0x40-0x7f</code> or <code>0xc0-0xff</code>.
We aligned the memory dump to help verify whether the code allocator was doing something funky:</p>
<pre><code class="language-bash">$ grep SIGILL *.log
custom_01.log:E/mono (13964): SIGILL at ip=0x0000007f4f15e8d0
custom_02.log:E/mono (13088): SIGILL at ip=0x0000007f8ff76cc0
custom_03.log:E/mono (12824): SIGILL at ip=0x0000007f68e93c70
custom_04.log:E/mono (12876): SIGILL at ip=0x0000007f4b3d55f0
custom_05.log:E/mono (13008): SIGILL at ip=0x0000007f8df1e8d0
custom_06.log:E/mono (14093): SIGILL at ip=0x0000007f6c21edf0
[...]
</code></pre>
<p>With that we came to our first good hypothesis: Bad cache flushing was happening only on the upper 64 bytes of every 128-byte block.
Those numbers, if you deal with low level programming, immediately remind you of cache line sizes. And that is where it all started to make
sense.</p>
<p>Here is a pseudo version of how <code>libgcc</code> <a href="https://android.googlesource.com/toolchain/gcc/+/master/gcc-4.9/libgcc/config/aarch64/sync-cache.c#54">does cache flushing on arm64</a>:</p>
<pre><code class="language-c">void __clear_cache (char *address, size_t size)
{
static int cache_line_size = 0;
if (!cache_line_size)
cache_line_size = get_current_cpu_cache_line_size ();
for (int i = 0; i < size; i += cache_line_size)
flush_cache_line (address + i);
}
</code></pre>
<p>In the above pseudo-code <code>get_current_cpu_cache_line_size</code> is a CPU instruction that returns the line size of its caches, and <code>flush_cache_line</code>
flushes the cache line that contains the supplied address.</p>
<p>At that point we were using our own version of this function, so we instrumented it to print the cache line size as returned by the CPU and, lo and behold,
it printed both 128 and 64. We double verified that this was indeed the case. So we went to see that particular CPU manual and it turns out that the big core has
a 128 bytes cache line but on the LITTLE core it is only 64 bytes for the instruction cache.</p>
<p>So what was happening is that <code>__clear_cache</code> would be called first on a big core and cache 128 as the instruction cache line size. Later it would be called on one
of the LITTLE cores and would skip every other cache line when flushing. It doesn’t get simpler than that. We removed the caching and it all worked.</p>
<h1 id="summary">Summary</h1>
<p>Some ARM big.LITTLE CPUs can have cores with different cache line sizes, and pretty much no code out there is ready to deal with it as they
assume all cores to be symmetrical.</p>
<p>Worse, not even the ARM ISA is ready for this. An astute reader might realize that computing the cache line on every invocation
is not enough for user space code:
It can happen that a process gets scheduled on a different CPU while executing
the <code>__clear_cache</code> function with a certain cache line size, where it might not
be valid anymore.
Therefore, we have to try to figure out a global minimum of the cache line sizes across all CPUs.
Here is our fix for Mono: <a href="https://github.com/mono/mono/pull/3549">Pull Request</a>.
Other projects adopted our fix as well already: <a href="https://github.com/dolphin-emu/dolphin/pull/4204">Dolphin</a> and <a href="https://github.com/hrydgard/ppsspp/pull/8769">PPSSPP</a>.</p>Rodrigo Kumpera and Bernhard UrbanWhen someone says multi-core, we unconsciously think SMP. That worked out well for us until recently when ARM announced big.LITTLE. ARM’s big.LITTLE architecture is the first mass produced AMP architecture and as we’ll see next, it raises the bar for how hard multi-core programing is.Profiler Stability: Sampling and Managed Allocators2016-09-07T00:00:00+00:002016-09-07T00:00:00+00:00http://www.monobrasil.com.br/news/2016/09/07/profiler-managed-allocators<p>This is the first in a series of posts I’ll be writing on the work we’ve been
doing to improve the stability of Mono’s
<a href="/docs/debug+profile/profile/profiler/">log profiler</a>. All improvements
detailed in these blog posts are included in Mono 4.6.0, featuring version 1.0
of the profiler. Refer to the
<a href="/docs/about-mono/releases/4.6.0/">release notes</a> for the full list of changes
and fixes.</p>
<p>The problem we’ll be looking at today is a crash that arose when running the
profiler in sampling mode (i.e. <code>mono --profile=log:sample foo.exe</code>) together
with the <a href="/docs/advanced/garbage-collector/sgen/">SGen garbage collector</a>.</p>
<h2 id="the-problem">The Problem</h2>
<p>SGen uses so-called managed allocators to perform very fast allocations from
the nursery (generation 0). These managed allocators are generated by the Mono
runtime and contain specialized code for allocating particular kinds of
types (small objects, strings, arrays, etc). One important invariant that
managed allocators rely on is that a garbage collection absolutely cannot
proceed if any suspended thread was executing a managed allocator at the time
it was suspended. This allows the managed allocators to do their thing without
taking the global GC lock, which is a huge performance win for multithreaded
programs. Should this invariant be broken, however, the state of the managed
heap is essentially undefined and all sorts of bad things will happen when the
GC proceeds to doing scanning and sweeping.</p>
<p>Unfortunately, the way that sampling works caused this invariant to be broken.
When the profiler is running in sampling mode, it periodically sends out a
signal (e.g. <code>SIGPROF</code>) whose signal handler will collect a managed stack trace
for the target thread and write it as an event to the log file. There’s nothing
actually wrong with this. However, the way that SGen checks whether a thread is
currently executing a managed allocator is as follows (simplified a bit from
the actual source code):</p>
<pre><code class="language-c">static gboolean
is_ip_in_managed_allocator (MonoDomain *domain, gpointer ip)
{
/*
* ip is the instruction pointer of the thread, as obtained by the STW
* machinery when it temporarily suspends the thread.
*/
MonoJitInfo *ji = mono_jit_info_table_find_internal (domain, ip, FALSE, FALSE);
if (!ji)
return FALSE;
MonoMethod *m = mono_jit_info_get_method (ji);
return sgen_is_managed_allocator (m);
}
</code></pre>
<p>To understand why this code is problematic, we must first take a look at how
signal delivery works on POSIX systems. By default, a signal can be delivered
while another signal is still being handled. For example, if your program has
two separate signal handlers for <code>SIGUSR1</code> and <code>SIGUSR2</code>, both of which simply
do <code>while (1);</code> to spin forever, and you send <code>SIGUSR1</code> followed by <code>SIGUSR2</code>
to your program, you’ll see a stack trace looking something like this:</p>
<pre><code class="language-gdb">(gdb) bt
#0 0x0000000000400564 in sigusr2_signal_handler ()
#1 <signal handler called>
#2 0x0000000000400553 in sigusr1_signal_handler ()
#3 <signal handler called>
#4 0x00000000004005d3 in main ()
</code></pre>
<p>Unsurprisingly, if we print the instruction pointer, we’ll find that it’s
pointing into the most recently called signal handler:</p>
<pre><code class="language-gdb">(gdb) p $pc
$1 = (void (*)()) 0x400564 <sigusr2_signal_handler+15>
</code></pre>
<p>This matters because SGen uses a signal of its own to suspend threads in its
STW machinery. So that means we have two signals in play: The sampling signal
used by the profiler, and the suspend signal used by SGen. Both signals can
arrive at any point, including while the other is being handled. In an
allocation-heavy program, we could very easily see a stack looking something
like this:</p>
<pre><code class="language-gdb">#0 suspend_signal_handler ()
#1 <signal handler called>
#2 profiler_signal_handler ()
#3 <signal handler called>
#4 AllocSmall ()
...
</code></pre>
<p>(Here, <code>AllocSmall</code> is the managed allocator.)</p>
<p>Under normal (non-profiling) circumstances, it would look like this:</p>
<pre><code class="language-gdb">#0 suspend_signal_handler ()
#1 <signal handler called>
#2 AllocSmall ()
...
</code></pre>
<p>Mono installs signal handlers using the <code>sigaction</code> function with the
<code>SA_SIGINFO</code> flag. This means that the signal handler will receive a bunch of
information in its second and third arguments. One piece of that information
is the instruction pointer of the thread before the signal handler was invoked.
This is the instruction pointer that is passed to the
<code>is_ip_in_managed_allocator</code> function. So when we’re profiling, we pass an
instruction pointer that points into the middle of <code>profiler_signal_handler</code>,
while in the normal case, we pass an instruction pointer that points into
<code>AllocSmall</code> as expected.</p>
<p>So now the problem is clear: In the normal case, SGen detects that the thread
is executing a managed allocator, and therefore waits for it to finish
executing before truly suspending the thread. But in the profiling case, SGen
thinks that the thread is just executing any arbitrary, unimportant code that
it can go right ahead and suspend, even though we know for a fact (from
inspecting the stack in GDB) that it is actually in the middle of executing a
managed allocator.</p>
<p>When this situation arises, SGen will go right ahead and perform a full
garbage collection, even though it’ll see an inconsistent managed heap as a
result of the managed allocator being in the middle of modifying the
heap. Entire books could be written about the incorrect behavior that could
result from this under different circumstances, but long story short, your
program would crash in random ways.</p>
<h2 id="potential-solutions">Potential Solutions</h2>
<p>At first glance, this didn’t seem like a hard problem to solve. My initial
thought was to set up the signal handler for <code>SIGPROF</code> in such a way that the
GC suspend signal would be blocked while the handler is executing. Thing is,
this would’ve actually fixed the problem, but not in a general way. After all,
who’s to say that some other signal couldn’t come along and trigger this
problem? For example, a programmer might P/Invoke some native code that sets up
a signal handler for some arbitrary purpose. The user would then want to send
that signal to the Mono process and not have things break left and right. We
can’t reasonably ask every Mono user to block the GC suspend signal in any
signal handlers they might set up, directly or indirectly. Even if we could, it
isn’t a good idea, as we really want the GC suspend signal to arrive in a
timely fashion so that the STW process doesn’t take too long (resulting in long
pause times).</p>
<p>OK, so that idea wasn’t really gonna work. Another idea that we considered was
to unwind the stack of the suspended thread to see if any stack frame matches a
managed allocator. This would solve the problem and would work for any kind of
signal that might arrive before a GC suspend signal. Unfortunately, unwinding
the stack of a suspended thread is not particularly easy, and it can’t be done
in a portable way at all. Also, stack unwinding is not exactly a cheap affair,
and the last thing we want is to make the STW machinery slower - nobody wants
longer pause times.</p>
<p>Finally, there was the nuclear option: Disable managed allocators completely
when running the profiler in sampling mode. Sure, it would make this problem go
away, but as with the first idea, SGen would still be susceptible to this
problem with other kinds of signals when running normally. More importantly,
this would result in misleading profiles since the program would spend more
time in the native SGen allocation functions than it would have in the
managed allocators when not profiling.</p>
<p>None of the above are really viable solutions.</p>
<h2 id="the-fix">The Fix</h2>
<p>Conveniently, SGen has this feature called critical regions. They’re not quite
what you might think - they have nothing to do with mutexes. Let’s take a look
at how SGen allocates a <code>System.String</code> in the native allocation functions:</p>
<pre><code class="language-c">MonoString *
mono_gc_alloc_string (MonoVTable *vtable, size_t size, gint32 len)
{
TLAB_ACCESS_INIT;
ENTER_CRITICAL_REGION;
MonoString *str = sgen_try_alloc_obj_nolock (vtable, size);
if (str)
str->length = len;
EXIT_CRITICAL_REGION;
if (!str) {
LOCK_GC;
str = sgen_alloc_obj_nolock (vtable, size);
if (str)
str->length = len;
UNLOCK_GC;
}
return str;
}
</code></pre>
<p>The actual allocation logic (in <code>sgen_try_alloc_obj_nolock</code> and
<code>sgen_alloc_obj_nolock</code>) is unimportant. The important bits are
<code>TLAB_ACCESS_INIT</code>, <code>ENTER_CRITICAL_REGION</code>, and <code>EXIT_CRITICAL_REGION</code>. These
macros are defined as follows:</p>
<pre><code class="language-c">#define TLAB_ACCESS_INIT SgenThreadInfo *__thread_info__ = mono_native_tls_get_value (thread_info_key)
#define IN_CRITICAL_REGION (__thread_info__->client_info.in_critical_region)
#define ENTER_CRITICAL_REGION do { mono_atomic_store_acquire (&IN_CRITICAL_REGION, 1); } while (0)
#define EXIT_CRITICAL_REGION do { mono_atomic_store_release (&IN_CRITICAL_REGION, 0); } while (0)
</code></pre>
<p>As you can see, they simply set a variable on the current thread for the
duration of the attempted allocation. If this variable is set, the STW
machinery will refrain from suspending the thread in much the same way as it
would when checking the thread’s instruction pointer against the code ranges of
the managed allocators.</p>
<p>So the fix to this whole problem is actually very simple: We just set up a
critical region in the managed allocator, just like we do in the native SGen
functions. That is, we wrap all the code we emit in the managed allocator like
so:</p>
<pre><code class="language-c">static MonoMethod *
create_allocator (int atype, ManagedAllocatorVariant variant)
{
// ... snip ...
MonoMethodBuilder *mb = mono_mb_new (mono_defaults.object_class, name, MONO_WRAPPER_ALLOC);
int thread_var;
// ... snip ...
EMIT_TLS_ACCESS_VAR (mb, thread_var);
EMIT_TLS_ACCESS_IN_CRITICAL_REGION_ADDR (mb, thread_var);
mono_mb_emit_byte (mb, CEE_LDC_I4_1);
mono_mb_emit_byte (mb, MONO_CUSTOM_PREFIX);
mono_mb_emit_byte (mb, CEE_MONO_ATOMIC_STORE_I4);
mono_mb_emit_i4 (mb, MONO_MEMORY_BARRIER_NONE);
// ... snip: allocation logic ...
EMIT_TLS_ACCESS_IN_CRITICAL_REGION_ADDR (mb, thread_var);
mono_mb_emit_byte (mb, CEE_LDC_I4_0);
mono_mb_emit_byte (mb, MONO_CUSTOM_PREFIX);
mono_mb_emit_byte (mb, CEE_MONO_ATOMIC_STORE_I4);
mono_mb_emit_i4 (mb, MONO_MEMORY_BARRIER_REL);
// ... snip ...
MonoMethod *m = mono_mb_create (mb, csig, 8, info);
mono_mb_free (mb);
return m;
}
</code></pre>
<p>With that done, SGen will correctly detect that a managed allocator is
executing no matter how many signal handlers may be nested on the thread.
Checking the <code>in_critical_region</code> variable also happens to be quite a bit
cheaper than looking up JIT info for the managed allocators.</p>
<h2 id="performance-implications">Performance Implications</h2>
<p>I ran this small program before and after the changes:</p>
<pre><code class="language-csharp">using System;
class Program {
public static object o;
static void Main ()
for (var i = 0; i < 100000000; i++)
o = new object ();
}
}
</code></pre>
<p>Before:</p>
<pre><code class="language-bash">real 0m0.625s
user 0m0.652s
sys 0m0.032s
</code></pre>
<p>After:</p>
<pre><code class="language-bash">real 0m0.883s
user 0m0.948s
sys 0m0.012s
</code></pre>
<p>So we’re observing about a 40% slowdown from this change on a microbenchmark
that does nothing but allocate. The slowdown comes from the managed allocators
now having to do two atomic stores, one of which also carries with it a memory
barrier with release semantics.</p>
<p>So on one hand, the slowdown is not great. But on the other hand, there is no
point in being fast if Mono is crashy as a result. To put things in
perspective, this program is still way slower without managed allocators
(<code>MONO_GC_DEBUG=no-managed-allocator</code>):</p>
<pre><code class="language-bash">real 0m7.678s
user 0m8.529s
sys 0m0.024s
</code></pre>
<p>Managed allocators are still around 770% faster. The 40% slowdown doesn’t seem
like much when you consider these numbers, especially when it’s for the purpose
of fixing a crash.</p>
<p>More detailed benchmark results from the Xamarin benchmarking infrastructure
can be found
<a href="http://open.xamarin.com/benchmarker/front-end/pullrequest.html#id=337">here</a>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>By setting up a critical region in SGen’s managed allocator methods, we’ve
fixed a number of random crashes that would occur when using sample profiling
together with the SGen GC. A roughly 40% slowdown was observed on one
microbenchmark, however, managed allocators are still around 770% faster than
SGen’s native allocation functions, so it’s a small price to pay for a more
reliable Mono.</p>
<hr />
<p>In the next post, we’ll take a look at an issue that plagued profiler users on
OS X: A crash when sending sampling signals to threads that hadn’t yet finished
basic initialization in <code>libc</code>, leading to broken TLS access. This issue forced
us to rewrite the sampling infrastructure in the runtime.</p>Alex Rønne PetersenThis is the first in a series of posts I’ll be writing on the work we’ve been doing to improve the stability of Mono’s log profiler. All improvements detailed in these blog posts are included in Mono 4.6.0, featuring version 1.0 of the profiler. Refer to the release notes for the full list of changes and fixes.Lock-free GC Handles2016-08-16T00:00:00+00:002016-08-16T00:00:00+00:00http://www.monobrasil.com.br/news/2016/08/16/lock-free-gc-handles<p>In Mono 4.4.0, we improved the performance of GC handles by changing to a lock-free implementation. In this post we’ll take a look at the original implementation and its limitations, and how the new implementation solved these problems.</p>
<h2 id="background">Background</h2>
<p>For those unfamiliar, the <a href="https://msdn.microsoft.com/en-us/library/system.runtime.interopservices.gchandle(v=vs.110).aspx"><code>GCHandle</code></a> API provides access to the low-level GC handle primitive used to implement several types of “handles” to managed objects:</p>
<ul>
<li>
<p><strong>Normal handles</strong> prevent an object from being collected, even if no managed references to the object exist.</p>
</li>
<li>
<p><strong>Pinned handles</strong> prevent an object from being moved in memory.</p>
</li>
<li>
<p><strong>Weak references</strong> reference an object <em>without</em> preventing it from being collected.</p>
</li>
</ul>
<p>In addition to programmers’ regular use of handles, Mono uses them in its implementation of the <a href="https://msdn.microsoft.com/en-us/library/system.threading.monitor(v=vs.110).aspx"><code>Monitor</code></a> class, the basis of synchronization using <a href="https://msdn.microsoft.com/en-us/library/de0542zz(v=vs.110).aspx"><code>Monitor.Enter</code></a> or the C♯ <code>lock</code> statement. As such, it’s important that accesses to GC handles be as fast as possible.</p>
<h2 id="original-implementation">Original Implementation</h2>
<p>A <code>GCHandle</code> object consists of a type and an index, packed together into a 32-bit unsigned integer. To get the value of a handle, say with <code>WeakReference.Target</code>, we first look up the array of handle data corresponding to its type, then look up the value at the given index; in pseudocode:</p>
<pre><code class="language-text">(type, index) = unpack(handle)
value = handles[type][index]
</code></pre>
<p>The original implementation of GC handles was based on a bitmap allocator. For each handle type, it stored a bitmap indicating the available slots for allocating new handles, and an array of pointers to their target objects:</p>
<pre><code class="language-text">bitmap = 11010…
pointers = [0xPPPPPPPP, 0xPPPPPPPP, NULL, 0xPPPPPPPP, NULL, …]
</code></pre>
<p>There’s an interesting constraint, though: when we unload an <code>AppDomain</code>, we want to be able to free all of the weak references that point to objects in that domain, because we know they’ll never be accessed again.</p>
<p>But if the weak reference has expired, we can’t tell what domain it came from, because we no longer have an object to look at! So for weak references, we kept a parallel array of domain pointers:</p>
<pre><code class="language-text">domains = [0xDDDDDDDD, 0xDDDDDDDD, NULL, 0xDDDDDDDD, NULL, …]
</code></pre>
<p>Unfortunately, this implementation was wasteful in a few ways:</p>
<ul>
<li>To synchronize access to the handle allocator, we would lock a mutex on every access to a GC handle—<a href="https://msdn.microsoft.com/en-us/library/1246yz8f(v=vs.110).aspx"><code>GCHandle.Alloc</code></a>, <a href="https://msdn.microsoft.com/en-us/library/system.runtime.interopservices.gchandle.target(v=vs.110).aspx"><code>GCHandle.Target</code></a>, and so on. This was especially expensive on OS X, where <code>pthread_mutex_lock</code> can be very costly.</li>
</ul>
<ul>
<li>
<p>For correctness, the domain-pointer array always had to be allocated for weak references, spending memory even though it was unused most of the time.</p>
</li>
<li>
<p>For historical reasons related to Mono’s support of the Boehm GC, much of this information was duplicated in a separate hash table, wasting even more memory.</p>
</li>
</ul>
<h2 id="new-implementation">New Implementation</h2>
<p>After removing the redundant hash table, the first step toward a new lock-free implementation was to use <em>one</em> array to store the information from the <em>three</em> arrays of the previous implementation. We did so using <em>tagged pointers:</em> because objects are aligned to multiples of 8 bytes, the lower 3 bits of any object reference are guaranteed to be zero—so we can store extra information in those bits.</p>
<p>We ended up with a single array of <em>slots</em> in the following bit format:</p>
<pre><code class="language-text">PPPPPPPP…0VX
</code></pre>
<p>Where <code>PPPP…</code> are pointer bits, <code>V</code> is the “valid” flag, and <code>X</code> is the “occupied” flag, packed together with bitwise OR:</p>
<pre><code class="language-text">slot = pointer | valid_bit | occupied_bit
</code></pre>
<p>If the “occupied” flag is clear, the slot is free to be claimed by <code>GCHandle.Alloc</code>. To allocate a handle, we use a CAS (“compare and swap”, also known as <a href="https://msdn.microsoft.com/en-us/library/system.threading.interlocked.compareexchange(v=vs.110).aspx"><code>Interlocked.CompareExchange</code></a>) to replace a null slot with a tagged pointer, where the “occupied” and “valid” flags are set:</p>
<pre><code class="language-text">cas(slot, tag(pointer), NULL)
00000000…000
↓
PPPPPPPP…011
</code></pre>
<p>If the CAS succeeds, we now own a valid handle. If it fails, it means that another thread happened to be allocating a handle at the same time, so we just try the next free slot until we can successfully claim one. Unless you have many threads allocating many handles, allocating a handle will almost always succeed on the first try, without waiting to take a lock. And setting the target of a handle, with <code>WeakReference.Target</code> for example, works similarly.</p>
<p>As for <code>AppDomain</code> unloading, we can observe that we only need to store a domain pointer for <em>expired</em> weak references. If the reference hasn’t expired, then we have a valid object, and we can inspect it to find out which domain it came from.</p>
<p>Therefore, when a weak reference expires, all we have to do is clear the “valid” flag and replace the object pointer with a domain pointer:</p>
<pre><code class="language-text">PPPPPPPP…011
↓
DDDDDDDD…001
</code></pre>
<p>So these are the possible states of a slot:</p>
<table>
<thead>
<tr>
<th>Occupied?</th>
<th>Valid?</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1</td>
<td>This slot is occupied and points to an object.</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>This slot is occupied, but its object is expired, so it points to an <code>AppDomain</code>.</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>This slot is free and null.</td>
</tr>
</tbody>
</table>
<p>Now that we have our representation of slots, how do we grow the handle array when we run out of slots? Because the original implementation locked the handle array, it was safe to simply allocate a new array, copy the old contents into it, and store the pointer to the new array. But without a lock, this wouldn’t be thread-safe! For example:</p>
<ul>
<li>Thread 1 sees that the handle array needs to grow.</li>
<li>Thread 1 allocates a new handle array and copies its current contents.</li>
<li>Thread 2 changes the <code>Target</code> property of some weak reference.</li>
<li>Thread 1 stores the pointer to the new handle array, <strong>discarding Thread 2’s change.</strong></li>
</ul>
<p>To solve this, instead of a single handle array, we use a handle <em>table</em> consisting of an array of <em>buckets</em>, each twice the size of the last:</p>
<pre><code class="language-text">[0] → xxxxxxxx
[1] → xxxxxxxxxxxxxxxx
[2] → xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
[3] → NULL
…
</code></pre>
<p>Now we can grow the table in a thread-safe way:</p>
<ul>
<li>Thread 1 sees that the handle table needs to grow.</li>
<li>Thread 1 optimistically allocates a new empty <em>bucket</em>.</li>
<li>Thread 2 changes the <code>Target</code> property of some weak reference in an existing bucket.</li>
<li>Thread 1 uses a compare-and-swap to store the new bucket, <strong>leaving Thread 2’s change intact.</strong></li>
</ul>
<p>If the CAS fails, then another thread has already allocated a new bucket, so we can just free the extra one we allocated, because it won’t contain any data yet. And this will only happen if the handle table is highly contended, which is rare.</p>
<h2 id="performance-comparisons">Performance Comparisons</h2>
<p><code>sgen-weakref-stress</code>, in the Mono runtime test suite, is a microbenchmark that allocates weak references from many threads.</p>
<p>Before this change was implemented, these were the average timings over 5 runs:</p>
<pre><code class="language-bash">real 0m2.441s
user 0m1.591s
sys 0m0.959s
</code></pre>
<p>After the change:</p>
<pre><code class="language-bash">real 0m0.358s
user 0m0.406s
sys 0m0.063s
</code></pre>
<p>Cool! We got about an 80% improvement.</p>
<p>Let’s look at <code>monitor-stress</code>, which stress-tests <code>Monitor</code> operations using the C♯ <code>lock</code> statement. Before the change, average of 5 runs:</p>
<pre><code class="language-bash">real 0m2.714s
user 0m6.963s
sys 0m0.244s
</code></pre>
<p>Now, with the change:</p>
<pre><code class="language-bash">real 0m2.681s
user 0m6.783s
sys 0m0.242s
</code></pre>
<p>It looks like these measurements are within error bounds, so we can’t claim any more than a modest improvement of 1–2%. The numbers are similar for our macrobenchmarks of <code>roslyn</code> and <code>fsharp</code>. On the bright side, we haven’t introduced any regressions.</p>
<h2 id="conclusions">Conclusions</h2>
<p>Converting the GC handle code to a lock-free implementation let us delete a sizable chunk of old code, save memory, and dramatically improve the performance of weak references by avoiding expensive locking.</p>
<p>This optimization didn’t improve the performance of <code>Monitor</code> much; in the future, we’ll talk about how another optimization, <em>thin locks</em>, gave us much greater improvements to locking performance.</p>Jon PurdyIn Mono 4.4.0, we improved the performance of GC handles by changing to a lock-free implementation. In this post we’ll take a look at the original implementation and its limitations, and how the new implementation solved these problems.Mono Relicensed MIT2016-03-31T00:00:00+00:002016-03-31T00:00:00+00:00http://www.monobrasil.com.br/news/2016/03/31/mono-relicensed-mit<p>At Microsoft Build today, we announced that we are re-releasing Mono under the <a href="https://opensource.org/licenses/mit-license.html">MIT license</a> and have contributed it to the <a href="http://www.dotnetfoundation.org/">.NET Foundation</a>. These are major news for Mono developers and contributors, and I am incredibly excited about the opportunities that this will create for the Mono project, and for other projects that will be able to benefit from this.</p>
<h2 id="mono-runtime-released-under-mit-license">Mono Runtime Released under MIT License</h2>
<p>While Mono’s class libraries have always been available under the MIT license, the Mono runtime was dual-licensed. Most developers could run their apps on Windows, Linux or Mac OS X on the LGPL version of the runtime, but we also offered Mono’s runtime under commercial terms for scenarios where the LGPL was not suitable.</p>
<p>Moving the Mono runtime to the MIT license removes barriers to the adoption of C# and .NET in a large number of scenarios, embedded applications, including embedding Mono as a scripting engine in game engines or other applications.</p>
<h2 id="open-sourcing-proprietary-mono-extensions">Open Sourcing Proprietary Mono Extensions</h2>
<p>Over the past 5 years, Xamarin has developed a number of proprietary extensions to Mono, including:</p>
<ul>
<li>ARM64 port of the Mono runtime</li>
<li>Workarounds for bugs in some ARM chips</li>
<li>Use of Apple’s CommonCrypto to implement the crypto classes in the .NET API</li>
<li>Integration with X509 certificates on Apple platforms</li>
<li>Support for “Native Types” on Apple platforms</li>
<li>Generic Value Type Sharing</li>
<li>Offset tool to maintain the cross compiler</li>
</ul>
<p>These have been integrated with the main Mono codebase, contributed along with Mono to the .NET Foundation, and are being released under the MIT license today.</p>Miguel de IcazaAt Microsoft Build today, we announced that we are re-releasing Mono under the MIT license and have contributed it to the .NET Foundation. These are major news for Mono developers and contributors, and I am incredibly excited about the opportunities that this will create for the Mono project, and for other projects that will be able to benefit from this.O Mono 4.2 está disponível!2015-08-27T00:00:00+00:002015-08-27T00:00:00+00:00http://www.monobrasil.com.br/news/2015/08/27/mono-4-2-is-out<s>Mono 4.2 está disponível no [canal alpha](/download/alpha).</s>
<p>Mono 4.2 está disponível no <a href="/download/">canal stable</a></p>
<p>Veja as <a href="/docs/about-mono/releases/4.2.1/">notas de liberação</a>
para detalhes sobre as novidades no Mono 4.2.1.</p>
<p>Esta é a segunda liberação do Mono que integra grandes porções do <a href="https://github.com/mono/referencesource">código
.NET aberto pela Microsoft</a>.</p>
<p>Esta liberação foi construída com mais de 2,338 commits desde o Mono 4.0.0 e como
usual, é a nossa melhor versão até o momento. Aprecie!</p>Miguel de IcazaMono 4.2 está disponível no [canal alpha](/download/alpha). Mono 4.2 está disponível no canal stableSelected Projects for Google Summer of Code 20152015-05-18T00:00:00+00:002015-05-18T00:00:00+00:00http://www.monobrasil.com.br/news/2015/05/18/gsoc-2015<p>Check out the <a href="http://monosoc.blogspot.com/2015/04/excited-for-this-years-summer-of-code.html">projects
that were accepted for the Google Summer of Code 2015</a>.</p>Miguel de IcazaCheck out the projects that were accepted for the Google Summer of Code 2015.Mono 4.0 disponível!2015-05-04T00:00:00+00:002015-05-04T00:00:00+00:00http://www.monobrasil.com.br/news/2015/05/04/mono-4-0-is-out<p>A versão 4.0 do Mono está disponível.</p>
<p>Veja as <a href="http://www.mono-project.com/docs/about-mono/releases/4.0.0/">notas de liberação</a>
para detalhes sobre as novidades na versão 4.0.</p>
<p>Esta é a primeira liberação do Mono que contem <a href="http://github.com/mono/referencesource">código da versão aberta pela Microsoft do .NET</a>.
Estamos apenas começando esse trabalho. Estamos rapidamente <a href="https://trello.com/b/vRPTMfdz/net-framework-integration-into-mono">avançando</a> na substituição/porte de muito mais código para dentro do mono/master.</p>
<p>Esta versão também é a primeira a usar o C# 6.0 como default. Aprenda tudo sobre o C# 6.0 em oito minutos assistindo <a href="http://channel9.msdn.com/Series/ConnectOn-Demand/211">esta apresentação (em inglês)</a></p>Miguel de IcazaA versão 4.0 do Mono está disponível.Mono and Google Summer of Code 20152015-03-09T00:00:00+00:002015-03-09T00:00:00+00:00http://www.monobrasil.com.br/news/2015/03/09/google-summer-of-code<p>Hey everyone! The Mono team is pleased to announce that <a href="https://www.google-melange.com/gsoc/org2/google/gsoc2015/mono">we are a
mentor organization in the Google Summer of Code
2015!</a>
This is the eleventh year of Summer of Code for us, and we’re really
excited to work with a new group of students.</p>
<p>This is a great opportunity to spend the summer with a great community
working on cutting edge open-source C# tools and frameworks. You can
hone your development skills by working on large and complex codebases
with experienced mentors, and get paid for your hard work too.</p>
<p>If you’re an <a href="https://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2015/help_page#2._Whos_eligible_to_participate_as_a">eligible
student</a>,
the application period runs from <a href="https://www.google-melange.com/gsoc/events/google/gsoc2015">March
15-27</a>. But
don’t let that stop you from starting on your proposals! Feel free to
introduce yourself to the community and mentors, talk about your
ideas, and do some preliminary research to make your proposal as
strong as it can be. If you’re feeling particularly ambitious, you
could even get started on some quick bugfixes and patches to show off
your skills; while this isn’t required, it is really helpful in seeing
how you work and getting your name out in the community. Show us how
excited you are about coding!</p>
<p>Same as last year, our project ideas and rules are available on our
<a href="http://mono-project.com/Gsoc">GSoC ideas page</a>, and we’ll be updating
the list as we come up with new ideas. Don’t let these ideas limit you
though; if you have your own idea for a great project for the summer,
put it in a proposal and send it our way. Or, if you can’t decide, you
can always submit multiple proposals. Keep in mind, though, quality is
better than quantity in this case.</p>
<p>Our project mailing lists should be your first stop for questions
about contributing to Mono. There are <a href="http://mono-project.com/Mailing_Lists">many
lists</a> for different topics,
but the main ones are
<a href="http://lists.ximian.com/mailman/listinfo/mono-list">mono</a>,
<a href="http://lists.ximian.com/mailman/listinfo/mono-devel-list">mono-devel</a>
and
<a href="http://lists.ximian.com/mailman/listinfo/monodevelop-devel-list">monodevelop-devel</a>. For
external projects, you should also contact the developers in their
project mailing lists.</p>
<p>And of course IRC is where you can find everyone online, on the
<a href="http://irc.gnome.org/">irc.gnome.org</a> server. There’s the #mono channel for general Mono
discussions, #monodev for Mono development, #monodevelop for
MonoDevelop and Xamarin Studio, and #monosoc for Summer of
Code-specific questions and saying “Hi” to your fellow students. Hang
around a while after asking a question - we have mentors in many
timezones so they may be asleep or busy when you visit.</p>
<p>If you’re not a student, you can participate in Summer of Code by
helping the students feel welcome in our community! Or, if you’re
interested in mentoring C# tools and libraries under the Mono
umbrella, send an email to the Mono GSoC administrator at
<a href="mailto:soc@xamarin.com">soc@xamarin.com</a>.</p>
<p>To stay up to date with the applications process and the work of our
students, follow us on <a href="https://twitter.com/monosoc">Twitter</a> and
<a href="https://plus.google.com/103975535372519150528/posts">Google+</a>. Good
luck, and here’s to another great summer of coding!</p>Miguel de IcazaHey everyone! The Mono team is pleased to announce that we are a mentor organization in the Google Summer of Code 2015! This is the eleventh year of Summer of Code for us, and we’re really excited to work with a new group of students.