The tyranny of simple tests

Last TMonitor post I promise (well, if until the ‘hotfix’ for TMonitor.Wait gets forgotten about…): it seems TMonitor’s apparent slowness in uncontended scenarios as discussed in my previous post is less than it first appeared. In fact, by tweaking my original test, I can now get it to perform faster than TCriticalSection so long as more than one thread is created. Credit for this discovery must go to commentator Krystian over at Eric Grange’s blog – specifically, the problem with TMonitor’s apparent performance was primarily a function of it needing to dynamically allocate a small bit of memory for its own state, which when several TMonitor’s are intialised in quick succesion leads them to allocate memory in the same processor cache line. Ensure the TMonitor is initialised up front though, along with something else getting allocated at the same time, and the problem goes away.

While this has TMonitor now besting TCriticalSection, TRTLCriticalSection still comes out on top: quite simply, even though TCriticalSection could barely be any lighter, the simple fact of it being a class ‘kills’ its performance. This is then aggrevated by the fact its Enter and Leave methods are inlined to call the virtual Acquire and Release rather than vice versa. OK, other things being equal it should be the other way round, but really…

Anyhow, onto the revised test: since there were complaints about me using TCountdownEvent and TStopwatch, I’ve replaced them with direct API calls. While TStringList and indeed the VCL are nonetheless perfectively acceptable (?), I’ve nonetheless kept the test program as a console application. However, I’ve taken memory deallocation out of the picture by freeing the thread classes explicitly (explicit memory allocation was already not tested for given I created the thread classes before timing started). So, without further ado, here’s the new code:

program TMonitorVsTCriticalSectionV2;

{$APPTYPE CONSOLE}

uses
  Windows,
  SysUtils,
  Classes,
  SyncObjs;

type
  TTestThread = class(TThread)
  strict private
    FSomeHeapData: IInterface;
  public
    constructor Create; virtual;
  end;

  TTestThreadClass = class of TTestThread;

constructor TTestThread.Create;
begin
  inherited Create(True);
  FSomeHeapData := TInterfaceList.Create;
end;

procedure RunTest(const TestName: string; ThreadCount: Integer;
  ThreadClass: TTestThreadClass);
var
  I: Integer;
  Threads: array of TThread;
  ThreadHandles: array of THandle;
  StartCounts, EndCounts, CountsFreq: Int64;
begin
  SetLength(ThreadHandles, ThreadCount);
  SetLength(Threads, ThreadCount);
  for I := 0 to ThreadCount - 1 do
  begin
    Threads[I] := ThreadClass.Create;
    ThreadHandles[I] := Threads[I].Handle;
  end;
  QueryPerformanceCounter(StartCounts);
  for I := 0 to ThreadCount - 1 do
    Threads[I].Start;
  WaitForMultipleObjects(ThreadCount, @ThreadHandles[0], True, INFINITE);
  QueryPerformanceCounter(EndCounts);
  QueryPerformanceFrequency(CountsFreq);
  //free the threads explicitly to take memory deallocation out of the equation
  for I := 0 to ThreadCount - 1 do
    Threads[I].Free;
  Writeln(TestName, ' ', ThreadCount, ' thread(s) took ',
    Round((EndCounts - StartCounts) * 1000 / CountsFreq), 'ms');
end;

const
  CountdownFrom = $FFFFFF; //increase if necessary...
  MaxThreads = 10;

type
  TCriticalSectionThread = class(TTestThread)
  protected
    FCriticalSection: TCriticalSection;
    procedure Execute; override;
  public
    constructor Create; override;
    destructor Destroy; override;
  end;

  TCriticalSectionThreadNoVirt = class(TCriticalSectionThread)
  protected
    procedure Execute; override;
  end;

  TMonitorThread = class(TTestThread)
  protected
    procedure Execute; override;
  public
    constructor Create; override;
  end;

  TRTLCriticalSectionThread = class(TTestThread)
  strict private
    FCriticalSection: TRTLCriticalSection;
  protected
    procedure Execute; override;
  public
    constructor Create; override;
    destructor Destroy; override;
  end;

  TRTLCriticalSectionThreadDynAlloc = class(TTestThread)
  strict private
    FCriticalSection: PRTLCriticalSection;
  protected
    procedure Execute; override;
  public
    constructor Create; override;
    destructor Destroy; override;
  end;

constructor TCriticalSectionThread.Create;
begin
  inherited Create;
  FCriticalSection := TCriticalSection.Create;
end;

destructor TCriticalSectionThread.Destroy;
begin
  FCriticalSection.Free;
  inherited Destroy;
end;

procedure TCriticalSectionThread.Execute;
var
  Counter: Integer;
begin
  Counter := CountdownFrom;
  repeat
    FCriticalSection.Enter;
    try
      Dec(Counter);
    finally
      FCriticalSection.Leave;
    end;
  until (Counter <= 0);
end;

type
  TCSAccess = class(TCriticalSection);

procedure TCriticalSectionThreadNoVirt.Execute;
var
  Counter: Integer;
begin
  Counter := CountdownFrom;
  repeat
    TCSAccess(FCriticalSection).FSection.Enter;
    try
      Dec(Counter);
    finally
      TCSAccess(FCriticalSection).FSection.Leave;
    end;
  until (Counter <= 0);
end;

constructor TMonitorThread.Create;
begin
  inherited;
  //force our monitor to be initialised
  TMonitor.Enter(Self);
  TMonitor.Exit(Self);
end;

procedure TMonitorThread.Execute;
var
  Counter: Integer;
begin
  Counter := CountdownFrom;
  repeat
    TMonitor.Enter(Self);
    try
      Dec(Counter);
    finally
      TMonitor.Exit(Self);
    end;
  until (Counter <= 0);
end;

constructor TRTLCriticalSectionThread.Create;
begin
  inherited Create;
  InitializeCriticalSection(FCriticalSection);
end;

destructor TRTLCriticalSectionThread.Destroy;
begin
  DeleteCriticalSection(FCriticalSection);
  inherited Destroy;
end;

procedure TRTLCriticalSectionThread.Execute;
var
  Counter: Integer;
begin
  Counter := CountdownFrom;
  repeat
    FCriticalSection.Enter;
    try
      Dec(Counter);
    finally
      FCriticalSection.Leave;
    end;
  until (Counter <= 0);
end;

constructor TRTLCriticalSectionThreadDynAlloc.Create;
begin
  inherited Create;
  New(FCriticalSection);
  InitializeCriticalSection(FCriticalSection^);
end;

destructor TRTLCriticalSectionThreadDynAlloc.Destroy;
begin
  DeleteCriticalSection(FCriticalSection^);
  Dispose(FCriticalSection);
  inherited Destroy;
end;

procedure TRTLCriticalSectionThreadDynAlloc.Execute;
var
  Counter: Integer;
begin
  Counter := CountdownFrom;
  repeat
    FCriticalSection.Enter;
    try
      Dec(Counter);
    finally
      FCriticalSection.Leave;
    end;
  until (Counter <= 0);
end;

var
  I, J: Integer;
begin
  for I := 1 to 3 do
  begin
    Writeln('*** ROUND ', I, ' ***');
    for J := 1 to MaxThreads do
    begin
      RunTest('TMonitor                               ',
        J, TMonitorThread);
      RunTest('TCriticalSection                       ',
        J, TCriticalSectionThread);
      RunTest('TCriticalSection (avoid virtual calls) ',
        J, TCriticalSectionThreadNoVirt);
      RunTest('TRTLCriticalSection (New/Dispose)      ',
        J, TRTLCriticalSectionThreadDynAlloc);
      RunTest('TRTLCriticalSection                    ',
        J, TRTLCriticalSectionThread);
      WriteLn;
    end;
  end;
  Write('Press ENTER to exit...');
  ReadLn;
end.

As said, for two or more threads, I now consistently get TMonitor to both outperform TCriticalSection and not exhibit the weirdness it did in my original test – instead, said weirdness is transposed to TCriticalSection. However, as before, using TRTLCriticalSection directly performs best by far.

Since everyone likes definite conclusions, I guess I can only conclude by saying this: you should avoid virtual method calls at the very least, and preferably classes too. Indeed, even dynamic memory allocations should be verbotten – stick to locals and opaque (or at least semi-opaque) records whose data are allocated for you by the Windows API. Sounds about right, eh?

11 thoughts on “The tyranny of simple tests

  1. On a StackOverflow question, I made a quick recap of my little experiment about multi-threading applications. It matches, then extends your own conclusion.
    About Critical Sections (and TMonitor), it states: “Don’t abuse on critical sections, let them be as small as possible, and rely on some atomic modifiers if you need some concurrent access – see e.g. InterlockedIncrement / InterlockedExchangeAdd;”
    Using InterlockedDecrement should have been the right way to implement your testing. Just make a test, and you’ll see it’s faster than any other possibility.
    InterlockedExchange (from SysUtils.pas) is a good way of updating a buffer or a shared object. You create an updated version of of some content, then you exchange a shared pointer to the data (e.g. a TObject instance) in one low-level CPU operation, you notifies the change very fast to the other threads, with very good multi-thread scaling. You’ll have to take care of the data integrity, but it works very well in practice.
    See http://stackoverflow.com/questions/6072269/need-multi-threading-memory-manager/6076407#6076407

    • Arnaud, the point wasn’t to test a decrement, but a critical section, if the Dec() disturbs you, feel free to replace it with an assignment or any small payload 😉

  2. A conclusion is that multi-threading is hard, and even simple wrapper/helper classes can introduce concurrency issues, as illustrated by TMonitor’s very own small allocation.

    Or in other words, trust Microsoft, trust OmniThreadLib, don’t trust the RTL 😉

    • Sad… 😦
      But true!

      What about TMultiReadExclusiveWriteSynchronizer ?
      I used it several times, and found out this little class to be efficient, more efficient than a Critical Section when you have multiple readers, and only seldom write on the data (which is a very common case).
      Do you know any problems? Code seems safe and proven, even in old Delphi versions.

    • You managed to miss the point of the post entirely, though I’m not entirely surprised ;-). Here’s it made more bluntly: your initial claim that TMonitor doesn’t work at all, based on an even simpler test, was rubbish, and your revised claim only stands up in a very artificial situation. When I was playing around with this, the pattern of noise went from one implementation to another simply by rearranging the test order and messing about with what dummy data was allocated. This suggests to me the tests we’ve been doing have neglible practical value – if in a real situation using TRTLCriticalSection is significantly quicker than TMonitor then great, but I wouldn’t bank on that being the case.

      • Thanks for proving what I was thinking from the very beginning although I am no “pretty much by definition expert on the Windows threading model” 😉
        *steals some of Masons popcorn*

      • Is it rubbish? you had to do *special* code to make it behave well, that’s a sign of flawed design, RtlCriticalSection behaves well all the time. Do you consider having to work around design flaws a sign is a good working library? If it happens in simple cases it’s, bound to happen in complex situations too, with no easy diagnostics or workaround. Murphy’s law.

        Reminds me if sketch about a bad tailor that keeps arguing he does a good job, it’s the people wearing his suits that don’t stand up correctly 🙂

      • Eric – er, yeah, your initial claim (that TMonitor is highly likely to always synchronise, and so due to a ‘race condition’) was indeed ‘rubbish’. Almost as rubbish as my previous attempt to demo TMonitor in fact! 😉

        “Do you consider having to work around design flaws a sign is a good working library?”

        The only ‘workaround’ presented is explicitly initialising the monitor up front, which takes all of two lines (perhaps an EnsureInitialize method could be added to cut that down to one line?). Adding some dynamically created data is adding realism, not a ‘workaround’.

        ‘If it happens in simple cases it’s, bound to happen in complex situations too, with no easy diagnostics or workaround.’

        Er, no, what our testing has indicated is the opposite for this particular case.

        ‘Reminds me if sketch about a bad tailor that keeps arguing he does a good job, it’s the people wearing his suits that don’t stand up correctly’

        Actually, it’s you who is turning matters onto their head, thinking of quickly written artificial tests as somehow more ‘real’ than anything else…

        • Alas initializing up front is a circumstancial “fix”, whether it’ll work or not will depend on the state of the memory allocator, which outside of simple tests, you have no control of. (sorry for typos, answering from phone)

        • “Adding some dynamically created data is adding realism, not a ‘workaround’.”

          Just so you understand, the dynamic data you add in your tests needs to be added just so that the dynamic areas allocated by TMonitor won’t be in the same cache line, for that you need to have a particular sequence, of allocation, and particular size, alongside with a non-deallocation of the dynamic memory that got allocated in between TMonitor’s allocations, so that a should another TMonitor allocate its internal data, it won’t get the spot that was just freed.

          F.i. if you have in memory:

          TMonitor 1’s data
          Some dynamic Data
          TMonitor 2’s data

          If you free the dynamic data and have a 3rd TMonitor allocate its stuff, you’ll end up with

          TMonitor 1’s data
          TMonitor 3’s data
          TMonitor 2’s data

          And the issue will be back.

          Also if the intermediate dynamic data you allocate isn’t fitting in the *same* bucket size as your first TMonitor’s data, you’ll still have both TMonitor 1 & 2 be contiguous.

          That’s a pretty severe restriction, even in the very simple case of your demo, just add a few more fields in your TTestThread, and you’ll see the issue pop back.

Leave a reply to Eric Cancel reply