TMonitor redux

Rather embarrassingly, it appears the wait/pulse functionality of TMonitor is hopelessly broken. That’ll teach me for trying to demo something new, eh? In my (partial) defence, I had found serious issues before, but came to believe they must have arisen from how I was originally trying to use TMonitor. Obviously not though!

Frankly, if the wait/pulse functionality can’t be relied upon, TMonitor is pretty pointless in my view given its only other function replicates what TRTLCriticalSection/TCriticalSection already provided for. Even worse though, Eric Grange has now posted claiming TMonitor doesn’t work properly even as a TCriticalSection alternative – while exceptions don’t get raised, ‘you can quickly run into situations where everything gets serialized, even when there is no need to’. In the comments he goes further, claiming that with the test code presented, ‘A single thread runs in 70 ms on the quad-core here, two threads run in 140 ms, 3 threads in 210 ms, etc.’

The statement in the comments especially I found surprising, and given the test code used looked just a little too simple to me, I thought I’d do some tests myself. Obviously, if I had the brains for it I could write a sampling profiler or something, but I dashed off the following test program instead. It does pretty much the same thing as Eric’s code — i.e., it tests for when threads are uncontended — but in just a slightly fuller form:

program TMonitorVsTCriticalSection;

{$APPTYPE CONSOLE}

uses
  SysUtils,
  Classes,
  Diagnostics,
  SyncObjs;

type
  TTestThread = class(TThread)
  strict private
    FThreadCounter: TCountdownEvent;
  public
    constructor Create(AThreadCounter: TCountdownEvent); virtual;
    destructor Destroy; override;
  end;

  TTestThreadClass = class of TTestThread;

constructor TTestThread.Create(AThreadCounter: TCountdownEvent);
begin
  inherited Create(True);
  FThreadCounter := AThreadCounter;
  FreeOnTerminate := True;
end;

destructor TTestThread.Destroy;
begin
  FThreadCounter.Signal;
  inherited Destroy;
end;

procedure RunTest(const TestName: string; ThreadCount: Integer;
  ThreadClass: TTestThreadClass);
var
  ThreadCounter: TCountdownEvent;
  I: Integer;
  Stopwatch: TStopwatch;
  Threads: array of TThread;
begin
  SetLength(Threads, ThreadCount);
  ThreadCounter := TCountdownEvent.Create(ThreadCount);
  try
    for I := 0 to ThreadCount - 1 do
      Threads[I] := ThreadClass.Create(ThreadCounter);
    Stopwatch := TStopwatch.StartNew;
    for I := 0 to ThreadCount - 1 do
      Threads[I].Start;
    ThreadCounter.WaitFor;
    Stopwatch.Stop;
    Writeln(TestName, ': ', ThreadCount, ' thread(s) took ',
      Stopwatch.ElapsedMilliseconds, 'ms');
  finally
    ThreadCounter.Free;
  end;
end;

const
  CountdownFrom = $FFFFFF; //increase if necessary...
  MaxThreads = 10;

type
  TCriticalSectionThread = class(TTestThread)
  strict private
    FCriticalSection: TCriticalSection;
  protected
    procedure Execute; override;
  public
    constructor Create(AThreadCounter: TCountdownEvent); override;
    destructor Destroy; override;
  end;

  TMonitorThread = class(TTestThread)
  protected
    procedure Execute; override;
  end;

constructor TCriticalSectionThread.Create(AThreadCounter: TCountdownEvent);
begin
  inherited Create(AThreadCounter);
  FCriticalSection := TCriticalSection.Create;
end;

destructor TCriticalSectionThread.Destroy;
begin
  FCriticalSection.Free;
  inherited Destroy;
end;

procedure TCriticalSectionThread.Execute;
var
  Counter: Integer;
begin
  Counter := CountdownFrom;
  repeat
    FCriticalSection.Enter;
    try
      Dec(Counter);
    finally
      FCriticalSection.Leave;
    end;
  until (Counter end;

procedure TMonitorThread.Execute;
var
  Counter: Integer;
begin
  Counter := CountdownFrom;
  repeat
    TMonitor.Enter(Self);
    try
      Dec(Counter);
    finally
      TMonitor.Exit(Self);
    end;
  until (Counter end;

var
  I, J: Integer;
begin
  for I := 1 to 3 do
  begin
    Writeln('*** ROUND ', I, ' ***');
    for J := 1 to MaxThreads do
    begin
      RunTest('TCriticalSection', J, TCriticalSectionThread);
      RunTest('TMonitor', J, TMonitorThread);
    end;
    WriteLn;
  end;
  Write('Press ENTER to exit...');
  ReadLn;
end.

When run on my quad core laptop (Win7 64 bit, Delphi XE, stock memory manager, default debug build), I get results like the following:

*** ROUND 1 ***
TCriticalSection: 1 thread(s) took 1011ms
TMonitor: 1 thread(s) took 1957ms
TCriticalSection: 2 thread(s) took 2674ms
TMonitor: 2 thread(s) took 8667ms
TCriticalSection: 3 thread(s) took 5562ms
TMonitor: 3 thread(s) took 8996ms
TCriticalSection: 4 thread(s) took 5829ms
TMonitor: 4 thread(s) took 9154ms
TCriticalSection: 5 thread(s) took 4953ms
TMonitor: 5 thread(s) took 8936ms
TCriticalSection: 6 thread(s) took 3652ms
TMonitor: 6 thread(s) took 10578ms
TCriticalSection: 7 thread(s) took 3004ms
TMonitor: 7 thread(s) took 6982ms
TCriticalSection: 8 thread(s) took 8471ms
TMonitor: 8 thread(s) took 6903ms
TCriticalSection: 9 thread(s) took 5844ms
TMonitor: 9 thread(s) took 6310ms
TCriticalSection: 10 thread(s) took 3496ms
TMonitor: 10 thread(s) took 7285ms

My (initial) conclusion? There probably is some issue here, but it isn’t the case that going from (say) 2 threads to 4 doubles the problem – you take the hit at thread 2 or 3, and it stays stable after that. Moreover, TCriticalSection appears to suffer from the *same* issue. However, TCriticalSection is fundamentally faster than TMonitor, given the situation in question (i.e., where locks are being acquired on different objects). [Or is it? See the updates below…]

Note that for slight variant, you might exchange

TMonitor.Enter;

for

if not TMonitor.TryEnter then
  raise EProgrammerNotFound.Create('Where on Earth is he?');

While the exception not being raised doesn’t prove there’s no thread contention where there shouldn’t be, it does make for a verification of the explicit contract made by TMonitor’s interface.

Update 1 – using spin counts

Andreas Hausladen in the comments suggests initialising the spin count of the critical section object to something other than its default value of 0, and indeed, that finally gets the expected behaviour (at least for me), i.e. the cost of the extra threads minus the startup time coming to effectively nil [see update 3 below though!]. Here’s the TTestThread descendant to ‘show’ this — you’ll need to add Windows to the uses clause too:

  TRTLCriticalSectionThread = class(TTestThread)
  strict private
    FCriticalSection: TRTLCriticalSection;
  protected
    procedure Execute; override;
  public
    constructor Create(AThreadCounter: TCountdownEvent); override;
    destructor Destroy; override;
  end;

constructor TRTLCriticalSectionThread.Create(AThreadCounter: TCountdownEvent);
begin
  inherited Create(AThreadCounter);
  InitializeCriticalSectionAndSpinCount(FCriticalSection, 4000);
end;

destructor TRTLCriticalSectionThread.Destroy;
begin
  DeleteCriticalSection(FCriticalSection);
  inherited Destroy;
end;

procedure TRTLCriticalSectionThread.Execute;
var
  Counter: Integer;
begin
  Counter := CountdownFrom;
  repeat
    FCriticalSection.Enter;
    try
      Dec(Counter);
    finally
      FCriticalSection.Leave;
    end;
  until (Counter end;

You can initialise the spin count of the TMonitor too by calling TMonitor.SetSpinCount before the first call to Enter. However, that doesn’t make a difference for me. On the other hand, I wouldn’t expect it to – after all, each thread in the example has its own monitor or critical section object, so where does the possibility for needing to spin or block arise in the first place? [It doesn’t – see below.]

Update 2 – comparing with C#/.NET 4

Here’s a quick C# translation (insofar as I can write C# ‘quickly’…) – as I’m using the CountdownEvent class (like I used TCountdownEvent in the Delphi version), it requires the 4.0 Framework:

using System;
using System.Diagnostics;
using System.Collections.Generic;
using System.Threading;

namespace ConsoleApplication1
{
    class Program
    {
        const int CountdownFrom = 0xFFFFFF;
        const int MaxThreads = 10;

        static void RunTest(int ThreadCount)
        {
            CountdownEvent ThreadCounter = new CountdownEvent(ThreadCount);
            List<Thread> Threads = new List<Thread>(ThreadCount);
            for (int I = 0; I < ThreadCount; I++)
                Threads.Add(new Thread( delegate() {
                    try
                    {
                        int Counter = CountdownFrom;
                        object Key = new object();
                        do
                        {
                            lock (Key)
                            {
                                Counter--;
                            }
                        }
                        while (Counter > 0);
                    }
                    finally
                    {
                        ThreadCounter.Signal();
                    }
                }));
            Stopwatch Stopwatch = Stopwatch.StartNew();
            for (int I = 0; I < ThreadCount; I++)
            {
                Threads[I].Start();
            }
            ThreadCounter.Wait();
            Stopwatch.Stop();
            Console.WriteLine("{0} thread(s) took {1}ms", ThreadCount,
                Stopwatch.ElapsedMilliseconds);
        }
        static void Main(string[] args)
        {
            for (int I = 1; I <= 3; I++)
            {
                Console.WriteLine("*** ROUND {0} ***", I);
                for (int J = 1; J <= MaxThreads; J++)
                {
                    RunTest(J);
                }
                Console.WriteLine();
            }
            Console.Write("Press ENTER to exit...");
            Console.ReadLine();
        }
    }
}

I find this beating even the TRTLCriticalSectionThread Delphi class so long as the number of threads doesn’t exceed the number of cores. Once it does, the times progressively increase, but still nowhere near the TMonitor ones. Go figure…

Update 3 – problem found…

Eric Grange, and equally importantly, his blog commentators have been back on the case and found the issue: TMonitor allocates a small block of memory internally, and given my test code initialises a series of monitors in quick succession, all have their internal memory block allocated in the same cache line. Use a different (read: typically less good) memory manager from the stock FastMM, and the problem can dissipate; e.g., it halves for me when using a custom memory manager that wraps the WinAPI’s HeapXXX functions. To make it totally disappear, the TMonitorThread class previously presented can be modified as thus (the Execute method stays the same):

  TMonitorThread = class(TTestThread)
  protected
    FDummyIntfs: array[0..32] of IInterface;
    procedure Execute; override;
  public
    constructor Create(AThreadCounter: TCountdownEvent); override;
  end;

constructor TMonitorThread.Create(AThreadCounter: TCountdownEvent);
var
  I: Integer;
begin
  inherited;
  //force our monitor to be initialised
  TMonitor.Enter(Self);
  TMonitor.Exit(Self);
  //cause some further dynamic memory allocations
  for I := Low(FDummyIntfs) to High(FDummyIntfs) do
    FDummyIntfs[I] := TInterfaceList.Create;
end;

On my laptop, TMonitor now easily beats TCriticalSection for spead, though still not TRTLCriticalSection. I’ve also discovered that enabling spinning on the latter was a red herring – just avoiding the class wrapper (even though said wrapper is extremely simple!) was the key thing. In fact, if TRTLCriticalSectionThread allocates its TRTLCriticalSection record dynamically using New, its advantage over TCriticalSectionThread melts away, and would be completely gone were TCriticalSection to map its Acquire and Release methods onto Enter and Leave rather than vice versa. I almost feel a follow up post coming on…

Advertisements

30 thoughts on “TMonitor redux

  1. Have you tried the TRTLCriticalSection with a spincount other than the default zero value? (TCriticalSection doesn allow you to set the spin count)

    • That fixes things, yes (using the MSDN semi-suggested value of 4000). Attempting the same with TMonitor doesn’t make a difference, though I don’t understand why it should in either case. Do you know…?

  2. weird is that in 8 threads, TMonitor took less time!?

    I am curious, how many times you run all of it before get that?
    Have you get an average time? or just the first run was good?

    • Well no, I didn’t bother to average the timings – I leave that as an exercise for the reader! 😉

      That said, it’s not particularly weird for odd iteration to be out of sync – it’s not like I was running on something like Server Core where there’s no background processes you don’t pretty much directly control.

  3. Interestingly, if you remove the critical section and monitor ENTER/EXIT calls from the thread execute methods (since they are no-ops as far as synch’ing is concerned, this has no material impact on the outcome) you still never-the-less get very similar results.

    Similarly, I modified your example to set the affinity mask for each thread to alternate CPU’s after they are created but before they are started (i.e. thread 0, 2, 4 etc on CPU 1, and thread 1, 3, 5 etc on CPU 2).

    I have only 2 CPU’s so this was the best explicit parallel scheduling I could achieve.

    Again – no material impact on the outcome !!

    I also removed the use of TCountdownEvent and replaced it with a much simpler WaitForMultipleObjects(), waiting on the thread handles themselves. I do not know what impact this had as TCountdownEvent is an XE class and I was using Delphi 2010 so I couldn’t do an A/B comparison in this case – I removed TCountdownEven because I had to more than anything else. 🙂

    But, whatever is causing the serialisation it does not appear to be the SYNCHRONisation mechanism.

    Of more concern to me is that TMonitor exacts such a high performance penalty over TCriticalSection – the sort of performance penalty that the convenience of having a TMonitor potentially attach to EVERY object, whether it wants or needs thread synchronization or not, just isn’t worth and we’d be better off not having it and simply declaring the synchronization objects we need when and where we need them.

    • Jolyon – yes, it’s probably a good idea to take TCountdownEvent out of the equation, so thanks for confirming it isn’t the issue. On the other hand, Andreas’ suggestion works for me – does it work for you too?

      That said, on my laptop, commenting out the enter/exit/leave calls entirely cuts the time down to about 70ms-90ms. Using Andreas’ suggested fix for the critical section case, by comparison, cuts the time down to about 1050ms, so there’s still a big cost even there – it’s just that this cost is no longer larger when there is more than one thread created.

      • Yes RTLCriticalSection gives the expected results.

        Also, I moved the stopwatch timing so that it starts measuring only after all threads have been started, rather than before it has started the first one. I’m not sure how much difference that would make, but it felt like the right thing to do. 🙂

        Still my biggest concern is the performance penalty exacted by TMonitor.

      • Before or after, that is the question… I probably would have put it after myself if it weren’t for the simpler test I was basing it on having the timing begin after the Start calls. Neither is completely ‘correct’ I suppose, since one or more of the worker threads may or may not have actually started before the Start loop finishes…

        “Still my biggest concern is the performance penalty exacted by TMonitor.”

        I don’t know, I’m still of the mind that TMonitor as a mere TCriticalSection/TRTLCriticalSection alternative is utterly pointless in itself, even if its performance was comparable. Or, put positively: it would be OK for TMonitor to be a bit less optimised than the Windows API’s own critical section primitive were its Wait/Pulse functionality (i.e., the thing it adds beyond just critical section functionality) not buggy too.

  4. Some new info:

    I created a “NoSync” thread with no enter/exits – this gave some puzzling results. Even with explicit thread processor affinity, a simple countdown in this thread class saw a proportional increase in execution time pro rata with the number of threads.

    That is, 2 thread took roughly 2x as long as 1 thread, 3 threads took 3x etc etc

    This didn’t tally with my previous observation, but I realised that a key difference was that previously I had simply commented out the Enter/Exit code in the other thread classes but I had left the try/finally construct in place. In my NoSync thread there was no try/finally.

    When I introduced a redundant try/finally two things happened:

    1) execution time jumped dramatically (from 13ms for 1 thread timing, to over 200)

    2) the execution times drew themselves back in line with what I would expect for threads explicitly scheduled on separate processors…. that is 1-2 threads roughly the same time, 3-4 threads roughly the same time but double that for 1-2 etc

    I’m now going to toss in two words that I don’t know very much about but which I seem to remember having seen mentioned in connection with try/finally constructs in the past:

    Stack frames ?

    • Interesting, thanks for the update. My average goes up to about 330ms when the try/finally is added back in.

  5. Don’t have access to Delphi right now, but given the timings, silly question, but did you select the “high performance” profile in Windows prior to starting the tests? Or manually disable speedstep/cool’n’quiet? Balanced profile won’t do.
    Even desktop have variable CPU freqs, and it can have weird performance effects that could explain your timings on its own.

    • Processor at full pelt, and I made sure things like that appalling WPF 4.0 optimiser thing weren’t running.

      From my point of view, the ‘weird’ timings were those you quoted in the comments to your own post which implied TMonitor’s cost went up for every new thread – I haven’t been able to replicate that all.

      • Your critical section timings are surprising, 2 threads should run in the same time as one, and they did in my tests, without adjusting spin count or anything. That in your timing they take more than twice the time is very weird.

      • I mean, your results look like what one would expect on a single core machine, that would get upgraded to multi-core in the middle of the test… I’ll have a look tomorrow 🙂

      • Oh dear, I’ve just translated the code to C# and found the .NET Monitor beating even the Win32 critical section object with the spin count optimisation so long as the number of threads doesn’t go beyond the number of cores. Once it does, the time progressively increases, but still nowhere near the Delphi TMonitor’s poor show.

    • Thanks. However, that doesn’t help to explain why the C# version is the quickest of the lot. All I can say is that it’s good that some sort of fix is being looked into…

  6. Maybe .Net “lock” is faster because of internal CAS lock? (instead of “heavier” critical section API)

  7. I suspect that using a Critical Section or a TMonitor just for one Dec(Count) instruction is not relevant.
    You should use InterlockedDecrement(Count) instead.
    In real world applications, your loop instruction will be MUCH more complex than this one OP instruction.
    What you’re timing here is the CriticalSection / TMonitor overhead, not the scalability of this technique in real world.

    • The dec is just a minimal workload, you can replace it with something else if that disturbs you. Good practice is to have whatever is in a critical section be as minimal as possible (otherwise there is no point in being multithreaded), and CriticalSection/TMonitor is thus very relevant to the scalability in the real world (the alternative being to go for lock-free approaches, which just aren’t as trivial as they may sound).

  8. Okay now there is something very weird happening, starting from your code, I initially got results similar to yours (with CriticalSection not benefitting from multiple threads).

    After a few tweaks of your code: replaced the countdown event with a simple loop on the thread’s WaitFor, manually freeing the threads between the runs rather than relying on FreeOnTerminate (to make sure they are thoroughly freed and gone from the memory), making sure there is no processor affinity, and that CPU speed is at max, my results are below, TMonitor is twice slower than a critical section, and only exhibits contention at 5 threads (on a quad-core). CS are rather rock-solid in terms of contention, with only the bus-lock showing through.

    TCriticalSection: 1 thread(s) took 477ms
    TMonitor: 1 thread(s) took 936ms
    TCriticalSection: 2 thread(s) took 484ms
    TMonitor: 2 thread(s) took 954ms
    TCriticalSection: 3 thread(s) took 489ms
    TMonitor: 3 thread(s) took 963ms
    TCriticalSection: 4 thread(s) took 506ms
    TMonitor: 4 thread(s) took 1004ms
    TCriticalSection: 5 thread(s) took 749ms
    TMonitor: 5 thread(s) took 2167ms
    TCriticalSection: 6 thread(s) took 797ms
    TMonitor: 6 thread(s) took 5675ms
    TCriticalSection: 7 thread(s) took 948ms
    TMonitor: 7 thread(s) took 4582ms
    TCriticalSection: 8 thread(s) took 1073ms
    TMonitor: 8 thread(s) took 3748ms
    TCriticalSection: 9 thread(s) took 1244ms
    TMonitor: 9 thread(s) took 5208ms
    TCriticalSection: 10 thread(s) took 1306ms
    TMonitor: 10 thread(s) took 4045ms

  9. I’ve just been replacing some of the TCriticalSection usages in our (large) code base with TMonitor based synchronisation.

    After replacing quite a few I reviewed the current set of code changes and then ran through some of our tests where I found a pretty disturbing reduction in performance. The reduction seems to be a result of the CPU running at ~10-15%, rather than 100% as it the usual case in our performance tests. The code is highly concurrent (many more threads than cores). The performance tests are run cold and then again once system and application caches are warmed up.

    Blocks of code covered by synchronisation locks range from “Add two numbers together” (for which there is not an x86 interlocked operation) to “Perform quite a bit of processing on a chunk of data”.

    A quick Google showed up this blog. I thought it useful to include my feedback in terms of running into (potentially) the same performance issue with TMonitor.

    I had been operating under the assumption that the TMonitor approach would be lighter and more performant than the OS Critical Section object (else there’s no point, right?).

    I have not yet started zeroing into exactly which TMonitor usage instances are giving me the performance hit (and to prove I have not done something silly). I appreciate lots of factors can conspire to impact system performance and more investigation is needed to prove the exact cause – still, it is suspicious!

    If there is an issue here hopefully it will be corrected in Delphi 2011. 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s