System.Numeric.Vectors are now accelerated in Mono

Miguel de Icaza December 20, 2016 runtime

While Mono has had support for SIMD instructions in the form of the Mono.SIMD API, this API was limited to run on x86 platforms.

.NET introduced the System.Numeric.Vectors API which sports a more general design that adapts to the SIMD registers available on different platforms.

The master branch of Mono now treats the various Vector operations as runtime intrinsics, so they are hardware accelerated. They are supported by both the Mono JIT compiler on x86-x64 platforms and via LLVM’s optimizating compiler on x86-64 and every other Mono/LLVM supported platform.

We would love to see you try it and share your experience with us.

Future of Mono's .NET Code Sharing

Miguel de Icaza November 29, 2016 plans

Mono’s use of Reference Source Code

Since the first release of the .NET open source code, the Mono project has been integrating the published Reference Source code into Mono.

We have been using the Reference Source code instead of the CoreFX, as Mono implements a larger surface area than what is exposed in there - Mono implements the .NET Desktop API.

Integrating the code sometimes is easy, we replace the code in Mono with the reference source code. When the code is not exactly portable, we need to make it portable and either write missing code, or integrate some of the work that we did in Mono, with the code that existed in .NET. There are some cases where porting the code is just too complicated, and we have not been able to do the work.

We keep track of the major items in Trello.

Originally, we forked the reference source code and kept a branch with our code, but it was too large of an external dependency and it was also a module that was quite static. So recently, we started copying the code that we needed directly into Mono.

While this has worked fine for a while, the .NET Reference Source is only updated when a major .NET release takes place, and it tracks the version of .NET that ships along with Windows. This means that we are missing on many of the great optimizations and improvements that are happening as part of .NET Core.

The optimizations and fine tuning typically take place in two places, work that goes into mscorlib.dll is maintained in the coreclr module while the higher level frameworks are maintained in the corefx module.

A New Approach

Many of the APIs that were originally removed from CoreFX are now being added back, so we can start to consider switching away from referencesource and into CoreFX.

After discussing with the .NET Core team, we came up with a better approach for the long-term maintenance of the shared code.

The .NET team has now setup a new repository where the cleaned up and optimized version of mscorlib.dll will live in the corert module.

What we will do is submodule both CoreRT and CoreFX and replace our manually copied code from the Reference Source with code from CoreRT and CoreFX.

The twist is that for scenarios where Mono’s API surface is larger, we will contribute changes back to CoreRT/CoreFX where we either add support for the larger API surface, or we make the API pluggable (likely with a tasteful use of the partial modifier).

One open issue is that Mono has historically used a single set of framework libraries (like mscorlib.dll, System.dll etc.) that work across Linux, MacOS, Unix and Windows and they dynamically detect how to behave based on the platform. This is useful on scenarios where you want to bootstrap work in one platform by using another one, as the framework libraries are identical.

CoreFX takes a different approach, the libraries are tied to a particular platform flavor.

This means that some of the work that we will have to do will involve either adjusting the CoreFX code to work in the way that Mono works, or give up on our tradition of having the same assemblies work across all platforms.

A tale of an impossible bug: big.LITTLE and caching

Rodrigo Kumpera and Bernhard Urban September 12, 2016 runtime

When someone says multi-core, we unconsciously think SMP. That worked out well for us until recently when ARM announced big.LITTLE. ARM’s big.LITTLE architecture is the first mass produced AMP architecture and as we’ll see next, it raises the bar for how hard multi-core programing is.

A tale of an impossible bug

It all started with a bug report against a phone with such a CPU, the Exynos chipset used on Samsung phones in Europe. Apps created with our software were dying with SIGILL at all completely random places. Nothing could reasonably explain what was happening, and the crash was happening with valid instructions. This immediately made us suspect bad instruction cache flushing.

After reviewing all JIT code around cache flushing we were sure that we were calling __clear_cache properly. That lead us to look around for how other virtual machines or compilers do cache flushing on ARM64, and we found out about some related errata on the Cortex A53. ARM’s description of those issues is both cryptic and vague, but we tried the workaround anyways. No luck there.

Next we went with the other usual suspects. A lying signal handler? Nope. Funky userspace CPU emulation? No. Broken libc implementation? Nice try. Faulty hardware? We reproduced it on multiple devices. Bad luck or karma? Yes!

Some of us could not sleep with such amazing puzzle in front of us and kept staring at memory dumps around failure sites. And there was this funny thing: the fault address was always on the third or fourth line of the memory dumps.

This was our only clue, and there are no coincidences when it comes to this sort of byzantine bug. Our memory dumps were of 16 bytes per line and the SIGILL would always happen to be somewhere between 0x40-0x7f or 0xc0-0xff. We aligned the memory dump to help verify whether the code allocator was doing something funky:

$ grep SIGILL *.log
custom_01.log:E/mono           (13964): SIGILL at ip=0x0000007f4f15e8d0
custom_02.log:E/mono           (13088): SIGILL at ip=0x0000007f8ff76cc0
custom_03.log:E/mono           (12824): SIGILL at ip=0x0000007f68e93c70
custom_04.log:E/mono           (12876): SIGILL at ip=0x0000007f4b3d55f0
custom_05.log:E/mono           (13008): SIGILL at ip=0x0000007f8df1e8d0
custom_06.log:E/mono           (14093): SIGILL at ip=0x0000007f6c21edf0
[...]

With that we came to our first good hypothesis: Bad cache flushing was happening only on the upper 64 bytes of every 128-byte block. Those numbers, if you deal with low level programming, immediately remind you of cache line sizes. And that is where it all started to make sense.

Here is a pseudo version of how libgcc does cache flushing on arm64:

void __clear_cache (char *address, size_t size)
{
    static int cache_line_size = 0;
    if (!cache_line_size)
        cache_line_size = get_current_cpu_cache_line_size ();

    for (int i = 0; i < size; i += cache_line_size)
        flush_cache_line (address + i);
}

In the above pseudo-code get_current_cpu_cache_line_size is a CPU instruction that returns the line size of its caches, and flush_cache_line flushes the cache line that contains the supplied address.

At that point we were using our own version of this function, so we instrumented it to print the cache line size as returned by the CPU and, lo and behold, it printed both 128 and 64. We double verified that this was indeed the case. So we went to see that particular CPU manual and it turns out that the big core has a 128 bytes cache line but on the LITTLE core it is only 64 bytes for the instruction cache.

So what was happening is that __clear_cache would be called first on a big core and cache 128 as the instruction cache line size. Later it would be called on one of the LITTLE cores and would skip every other cache line when flushing. It doesn’t get simpler than that. We removed the caching and it all worked.

Summary

Some ARM big.LITTLE CPUs can have cores with different cache line sizes, and pretty much no code out there is ready to deal with it as they assume all cores to be symmetrical.

Worse, not even the ARM ISA is ready for this. An astute reader might realize that computing the cache line on every invocation is not enough for user space code: It can happen that a process gets scheduled on a different CPU while executing the __clear_cache function with a certain cache line size, where it might not be valid anymore. Therefore, we have to try to figure out a global minimum of the cache line sizes across all CPUs. Here is our fix for Mono: Pull Request. Other projects adopted our fix as well already: Dolphin and PPSSPP.

Profiler Stability: Sampling and Managed Allocators

Alex Rønne Petersen September 7, 2016 profiler

This is the first in a series of posts I’ll be writing on the work we’ve been doing to improve the stability of Mono’s log profiler. All improvements detailed in these blog posts are included in Mono 4.6.0, featuring version 1.0 of the profiler. Refer to the release notes for the full list of changes and fixes.

The problem we’ll be looking at today is a crash that arose when running the profiler in sampling mode (i.e. mono --profile=log:sample foo.exe) together with the SGen garbage collector.

The Problem

SGen uses so-called managed allocators to perform very fast allocations from the nursery (generation 0). These managed allocators are generated by the Mono runtime and contain specialized code for allocating particular kinds of types (small objects, strings, arrays, etc). One important invariant that managed allocators rely on is that a garbage collection absolutely cannot proceed if any suspended thread was executing a managed allocator at the time it was suspended. This allows the managed allocators to do their thing without taking the global GC lock, which is a huge performance win for multithreaded programs. Should this invariant be broken, however, the state of the managed heap is essentially undefined and all sorts of bad things will happen when the GC proceeds to doing scanning and sweeping.

Unfortunately, the way that sampling works caused this invariant to be broken. When the profiler is running in sampling mode, it periodically sends out a signal (e.g. SIGPROF) whose signal handler will collect a managed stack trace for the target thread and write it as an event to the log file. There’s nothing actually wrong with this. However, the way that SGen checks whether a thread is currently executing a managed allocator is as follows (simplified a bit from the actual source code):

static gboolean
is_ip_in_managed_allocator (MonoDomain *domain, gpointer ip)
{
    /*
     * ip is the instruction pointer of the thread, as obtained by the STW
     * machinery when it temporarily suspends the thread.
     */

    MonoJitInfo *ji = mono_jit_info_table_find_internal (domain, ip, FALSE, FALSE);

    if (!ji)
        return FALSE;

    MonoMethod *m = mono_jit_info_get_method (ji);

    return sgen_is_managed_allocator (m);
}

To understand why this code is problematic, we must first take a look at how signal delivery works on POSIX systems. By default, a signal can be delivered while another signal is still being handled. For example, if your program has two separate signal handlers for SIGUSR1 and SIGUSR2, both of which simply do while (1); to spin forever, and you send SIGUSR1 followed by SIGUSR2 to your program, you’ll see a stack trace looking something like this:

(gdb) bt
#0  0x0000000000400564 in sigusr2_signal_handler ()
#1  <signal handler called>
#2  0x0000000000400553 in sigusr1_signal_handler ()
#3  <signal handler called>
#4  0x00000000004005d3 in main ()

Unsurprisingly, if we print the instruction pointer, we’ll find that it’s pointing into the most recently called signal handler:

(gdb) p $pc
$1 = (void (*)()) 0x400564 <sigusr2_signal_handler+15>

This matters because SGen uses a signal of its own to suspend threads in its STW machinery. So that means we have two signals in play: The sampling signal used by the profiler, and the suspend signal used by SGen. Both signals can arrive at any point, including while the other is being handled. In an allocation-heavy program, we could very easily see a stack looking something like this:

#0 suspend_signal_handler ()
#1 <signal handler called>
#2 profiler_signal_handler ()
#3 <signal handler called>
#4 AllocSmall ()
...

(Here, AllocSmall is the managed allocator.)

Under normal (non-profiling) circumstances, it would look like this:

#0 suspend_signal_handler ()
#1 <signal handler called>
#2 AllocSmall ()
...

Mono installs signal handlers using the sigaction function with the SA_SIGINFO flag. This means that the signal handler will receive a bunch of information in its second and third arguments. One piece of that information is the instruction pointer of the thread before the signal handler was invoked. This is the instruction pointer that is passed to the is_ip_in_managed_allocator function. So when we’re profiling, we pass an instruction pointer that points into the middle of profiler_signal_handler, while in the normal case, we pass an instruction pointer that points into AllocSmall as expected.

So now the problem is clear: In the normal case, SGen detects that the thread is executing a managed allocator, and therefore waits for it to finish executing before truly suspending the thread. But in the profiling case, SGen thinks that the thread is just executing any arbitrary, unimportant code that it can go right ahead and suspend, even though we know for a fact (from inspecting the stack in GDB) that it is actually in the middle of executing a managed allocator.

When this situation arises, SGen will go right ahead and perform a full garbage collection, even though it’ll see an inconsistent managed heap as a result of the managed allocator being in the middle of modifying the heap. Entire books could be written about the incorrect behavior that could result from this under different circumstances, but long story short, your program would crash in random ways.

Potential Solutions

At first glance, this didn’t seem like a hard problem to solve. My initial thought was to set up the signal handler for SIGPROF in such a way that the GC suspend signal would be blocked while the handler is executing. Thing is, this would’ve actually fixed the problem, but not in a general way. After all, who’s to say that some other signal couldn’t come along and trigger this problem? For example, a programmer might P/Invoke some native code that sets up a signal handler for some arbitrary purpose. The user would then want to send that signal to the Mono process and not have things break left and right. We can’t reasonably ask every Mono user to block the GC suspend signal in any signal handlers they might set up, directly or indirectly. Even if we could, it isn’t a good idea, as we really want the GC suspend signal to arrive in a timely fashion so that the STW process doesn’t take too long (resulting in long pause times).

OK, so that idea wasn’t really gonna work. Another idea that we considered was to unwind the stack of the suspended thread to see if any stack frame matches a managed allocator. This would solve the problem and would work for any kind of signal that might arrive before a GC suspend signal. Unfortunately, unwinding the stack of a suspended thread is not particularly easy, and it can’t be done in a portable way at all. Also, stack unwinding is not exactly a cheap affair, and the last thing we want is to make the STW machinery slower - nobody wants longer pause times.

Finally, there was the nuclear option: Disable managed allocators completely when running the profiler in sampling mode. Sure, it would make this problem go away, but as with the first idea, SGen would still be susceptible to this problem with other kinds of signals when running normally. More importantly, this would result in misleading profiles since the program would spend more time in the native SGen allocation functions than it would have in the managed allocators when not profiling.

None of the above are really viable solutions.

The Fix

Conveniently, SGen has this feature called critical regions. They’re not quite what you might think - they have nothing to do with mutexes. Let’s take a look at how SGen allocates a System.String in the native allocation functions:

MonoString *
mono_gc_alloc_string (MonoVTable *vtable, size_t size, gint32 len)
{
    TLAB_ACCESS_INIT;

    ENTER_CRITICAL_REGION;

    MonoString *str = sgen_try_alloc_obj_nolock (vtable, size);

    if (str)
        str->length = len;

    EXIT_CRITICAL_REGION;

    if (!str) {
        LOCK_GC;

        str = sgen_alloc_obj_nolock (vtable, size);

        if (str)
            str->length = len;

        UNLOCK_GC;
    }

    return str;
}

The actual allocation logic (in sgen_try_alloc_obj_nolock and sgen_alloc_obj_nolock) is unimportant. The important bits are TLAB_ACCESS_INIT, ENTER_CRITICAL_REGION, and EXIT_CRITICAL_REGION. These macros are defined as follows:

#define TLAB_ACCESS_INIT SgenThreadInfo *__thread_info__ = mono_native_tls_get_value (thread_info_key)
#define IN_CRITICAL_REGION (__thread_info__->client_info.in_critical_region)
#define ENTER_CRITICAL_REGION do { mono_atomic_store_acquire (&IN_CRITICAL_REGION, 1); } while (0)
#define EXIT_CRITICAL_REGION do { mono_atomic_store_release (&IN_CRITICAL_REGION, 0); } while (0)

As you can see, they simply set a variable on the current thread for the duration of the attempted allocation. If this variable is set, the STW machinery will refrain from suspending the thread in much the same way as it would when checking the thread’s instruction pointer against the code ranges of the managed allocators.

So the fix to this whole problem is actually very simple: We just set up a critical region in the managed allocator, just like we do in the native SGen functions. That is, we wrap all the code we emit in the managed allocator like so:

static MonoMethod *
create_allocator (int atype, ManagedAllocatorVariant variant)
{
    // ... snip ...

    MonoMethodBuilder *mb = mono_mb_new (mono_defaults.object_class, name, MONO_WRAPPER_ALLOC);
    int thread_var;

    // ... snip ...

    EMIT_TLS_ACCESS_VAR (mb, thread_var);

    EMIT_TLS_ACCESS_IN_CRITICAL_REGION_ADDR (mb, thread_var);
    mono_mb_emit_byte (mb, CEE_LDC_I4_1);
    mono_mb_emit_byte (mb, MONO_CUSTOM_PREFIX);
    mono_mb_emit_byte (mb, CEE_MONO_ATOMIC_STORE_I4);
    mono_mb_emit_i4 (mb, MONO_MEMORY_BARRIER_NONE);

    // ... snip: allocation logic ...

    EMIT_TLS_ACCESS_IN_CRITICAL_REGION_ADDR (mb, thread_var);
    mono_mb_emit_byte (mb, CEE_LDC_I4_0);
    mono_mb_emit_byte (mb, MONO_CUSTOM_PREFIX);
    mono_mb_emit_byte (mb, CEE_MONO_ATOMIC_STORE_I4);
    mono_mb_emit_i4 (mb, MONO_MEMORY_BARRIER_REL);

    // ... snip ...

    MonoMethod *m = mono_mb_create (mb, csig, 8, info);
    mono_mb_free (mb);

    return m;
}

With that done, SGen will correctly detect that a managed allocator is executing no matter how many signal handlers may be nested on the thread. Checking the in_critical_region variable also happens to be quite a bit cheaper than looking up JIT info for the managed allocators.

Performance Implications

I ran this small program before and after the changes:

using System;

class Program {
    public static object o;

    static void Main ()

        for (var i = 0; i < 100000000; i++)
            o = new object ();
    }
}

Before:

real    0m0.625s
user    0m0.652s
sys     0m0.032s

After:

real    0m0.883s
user    0m0.948s
sys     0m0.012s

So we’re observing about a 40% slowdown from this change on a microbenchmark that does nothing but allocate. The slowdown comes from the managed allocators now having to do two atomic stores, one of which also carries with it a memory barrier with release semantics.

So on one hand, the slowdown is not great. But on the other hand, there is no point in being fast if Mono is crashy as a result. To put things in perspective, this program is still way slower without managed allocators (MONO_GC_DEBUG=no-managed-allocator):

real    0m7.678s
user    0m8.529s
sys     0m0.024s

Managed allocators are still around 770% faster. The 40% slowdown doesn’t seem like much when you consider these numbers, especially when it’s for the purpose of fixing a crash.

More detailed benchmark results from the Xamarin benchmarking infrastructure can be found here.

Conclusion

By setting up a critical region in SGen’s managed allocator methods, we’ve fixed a number of random crashes that would occur when using sample profiling together with the SGen GC. A roughly 40% slowdown was observed on one microbenchmark, however, managed allocators are still around 770% faster than SGen’s native allocation functions, so it’s a small price to pay for a more reliable Mono.

In the next post, we’ll take a look at an issue that plagued profiler users on OS X: A crash when sending sampling signals to threads that hadn’t yet finished basic initialization in libc, leading to broken TLS access. This issue forced us to rewrite the sampling infrastructure in the runtime.

Lock-free GC Handles

Jon Purdy August 16, 2016 performance

In Mono 4.4.0, we improved the performance of GC handles by changing to a lock-free implementation. In this post we’ll take a look at the original implementation and its limitations, and how the new implementation solved these problems.

Background

For those unfamiliar, the GCHandle API provides access to the low-level GC handle primitive used to implement several types of “handles” to managed objects:

Normal handles prevent an object from being collected, even if no managed references to the object exist.
Pinned handles prevent an object from being moved in memory.
Weak references reference an object without preventing it from being collected.

In addition to programmers’ regular use of handles, Mono uses them in its implementation of the Monitor class, the basis of synchronization using Monitor.Enter or the C♯ lock statement. As such, it’s important that accesses to GC handles be as fast as possible.

Original Implementation

A GCHandle object consists of a type and an index, packed together into a 32-bit unsigned integer. To get the value of a handle, say with WeakReference.Target, we first look up the array of handle data corresponding to its type, then look up the value at the given index; in pseudocode:

(type, index) = unpack(handle)
value = handles[type][index]

The original implementation of GC handles was based on a bitmap allocator. For each handle type, it stored a bitmap indicating the available slots for allocating new handles, and an array of pointers to their target objects:

bitmap   = 11010…
pointers = [0xPPPPPPPP, 0xPPPPPPPP, NULL, 0xPPPPPPPP, NULL, …]

There’s an interesting constraint, though: when we unload an AppDomain, we want to be able to free all of the weak references that point to objects in that domain, because we know they’ll never be accessed again.

But if the weak reference has expired, we can’t tell what domain it came from, because we no longer have an object to look at! So for weak references, we kept a parallel array of domain pointers:

domains  = [0xDDDDDDDD, 0xDDDDDDDD, NULL, 0xDDDDDDDD, NULL, …]

Unfortunately, this implementation was wasteful in a few ways:

To synchronize access to the handle allocator, we would lock a mutex on every access to a GC handle—GCHandle.Alloc, GCHandle.Target, and so on. This was especially expensive on OS X, where pthread_mutex_lock can be very costly.

For correctness, the domain-pointer array always had to be allocated for weak references, spending memory even though it was unused most of the time.
For historical reasons related to Mono’s support of the Boehm GC, much of this information was duplicated in a separate hash table, wasting even more memory.

New Implementation

After removing the redundant hash table, the first step toward a new lock-free implementation was to use one array to store the information from the three arrays of the previous implementation. We did so using tagged pointers: because objects are aligned to multiples of 8 bytes, the lower 3 bits of any object reference are guaranteed to be zero—so we can store extra information in those bits.

We ended up with a single array of slots in the following bit format:

PPPPPPPP…0VX

Where PPPP… are pointer bits, V is the “valid” flag, and X is the “occupied” flag, packed together with bitwise OR:

slot = pointer | valid_bit | occupied_bit

If the “occupied” flag is clear, the slot is free to be claimed by GCHandle.Alloc. To allocate a handle, we use a CAS (“compare and swap”, also known as Interlocked.CompareExchange) to replace a null slot with a tagged pointer, where the “occupied” and “valid” flags are set:

cas(slot, tag(pointer), NULL)

00000000…000
     ↓
PPPPPPPP…011

If the CAS succeeds, we now own a valid handle. If it fails, it means that another thread happened to be allocating a handle at the same time, so we just try the next free slot until we can successfully claim one. Unless you have many threads allocating many handles, allocating a handle will almost always succeed on the first try, without waiting to take a lock. And setting the target of a handle, with WeakReference.Target for example, works similarly.

As for AppDomain unloading, we can observe that we only need to store a domain pointer for expired weak references. If the reference hasn’t expired, then we have a valid object, and we can inspect it to find out which domain it came from.

Therefore, when a weak reference expires, all we have to do is clear the “valid” flag and replace the object pointer with a domain pointer:

PPPPPPPP…011
     ↓
DDDDDDDD…001

So these are the possible states of a slot:

Occupied?	Valid?	Description
1	1	This slot is occupied and points to an object.
1	0	This slot is occupied, but its object is expired, so it points to an `AppDomain`.
0	0	This slot is free and null.

Now that we have our representation of slots, how do we grow the handle array when we run out of slots? Because the original implementation locked the handle array, it was safe to simply allocate a new array, copy the old contents into it, and store the pointer to the new array. But without a lock, this wouldn’t be thread-safe! For example:

Thread 1 sees that the handle array needs to grow.
Thread 1 allocates a new handle array and copies its current contents.
Thread 2 changes the Target property of some weak reference.
Thread 1 stores the pointer to the new handle array, discarding Thread 2’s change.

To solve this, instead of a single handle array, we use a handle table consisting of an array of buckets, each twice the size of the last:

[0] → xxxxxxxx
[1] → xxxxxxxxxxxxxxxx
[2] → xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
[3] → NULL
…

Now we can grow the table in a thread-safe way:

Thread 1 sees that the handle table needs to grow.
Thread 1 optimistically allocates a new empty bucket.
Thread 2 changes the Target property of some weak reference in an existing bucket.
Thread 1 uses a compare-and-swap to store the new bucket, leaving Thread 2’s change intact.

If the CAS fails, then another thread has already allocated a new bucket, so we can just free the extra one we allocated, because it won’t contain any data yet. And this will only happen if the handle table is highly contended, which is rare.

Performance Comparisons

sgen-weakref-stress, in the Mono runtime test suite, is a microbenchmark that allocates weak references from many threads.

Before this change was implemented, these were the average timings over 5 runs:

real    0m2.441s
user    0m1.591s
sys     0m0.959s

After the change:

real    0m0.358s
user    0m0.406s
sys     0m0.063s

Cool! We got about an 80% improvement.

Let’s look at monitor-stress, which stress-tests Monitor operations using the C♯ lock statement. Before the change, average of 5 runs:

real    0m2.714s
user    0m6.963s
sys     0m0.244s

Now, with the change:

real    0m2.681s
user    0m6.783s
sys     0m0.242s

It looks like these measurements are within error bounds, so we can’t claim any more than a modest improvement of 1–2%. The numbers are similar for our macrobenchmarks of roslyn and fsharp. On the bright side, we haven’t introduced any regressions.

Conclusions

Converting the GC handle code to a lock-free implementation let us delete a sizable chunk of old code, save memory, and dramatically improve the performance of weak references by avoiding expensive locking.

This optimization didn’t improve the performance of Monitor much; in the future, we’ll talk about how another optimization, thin locks, gave us much greater improvements to locking performance.