Last TMonitor post I promise (well, if until the ‘hotfix’ for TMonitor.Wait gets forgotten about…): it seems TMonitor’s apparent slowness in uncontended scenarios as discussed in my previous post is less than it first appeared. In fact, by tweaking my original test, I can now get it to perform faster than TCriticalSection so long as more than one thread is created. Credit for this discovery must go to commentator Krystian over at Eric Grange’s blog – specifically, the problem with TMonitor’s apparent performance was primarily a function of it needing to dynamically allocate a small bit of memory for its own state, which when several TMonitor’s are intialised in quick succesion leads them to allocate memory in the same processor cache line. Ensure the TMonitor is initialised up front though, along with something else getting allocated at the same time, and the problem goes away.
While this has TMonitor now besting TCriticalSection, TRTLCriticalSection still comes out on top: quite simply, even though TCriticalSection could barely be any lighter, the simple fact of it being a class ‘kills’ its performance. This is then aggrevated by the fact its Enter and Leave methods are inlined to call the virtual Acquire and Release rather than vice versa. OK, other things being equal it should be the other way round, but really…
Anyhow, onto the revised test: since there were complaints about me using TCountdownEvent and TStopwatch, I’ve replaced them with direct API calls. While TStringList and indeed the VCL are nonetheless perfectively acceptable (?), I’ve nonetheless kept the test program as a console application. However, I’ve taken memory deallocation out of the picture by freeing the thread classes explicitly (explicit memory allocation was already not tested for given I created the thread classes before timing started). So, without further ado, here’s the new code:
program TMonitorVsTCriticalSectionV2; {$APPTYPE CONSOLE} uses Windows, SysUtils, Classes, SyncObjs; type TTestThread = class(TThread) strict private FSomeHeapData: IInterface; public constructor Create; virtual; end; TTestThreadClass = class of TTestThread; constructor TTestThread.Create; begin inherited Create(True); FSomeHeapData := TInterfaceList.Create; end; procedure RunTest(const TestName: string; ThreadCount: Integer; ThreadClass: TTestThreadClass); var I: Integer; Threads: array of TThread; ThreadHandles: array of THandle; StartCounts, EndCounts, CountsFreq: Int64; begin SetLength(ThreadHandles, ThreadCount); SetLength(Threads, ThreadCount); for I := 0 to ThreadCount - 1 do begin Threads[I] := ThreadClass.Create; ThreadHandles[I] := Threads[I].Handle; end; QueryPerformanceCounter(StartCounts); for I := 0 to ThreadCount - 1 do Threads[I].Start; WaitForMultipleObjects(ThreadCount, @ThreadHandles[0], True, INFINITE); QueryPerformanceCounter(EndCounts); QueryPerformanceFrequency(CountsFreq); //free the threads explicitly to take memory deallocation out of the equation for I := 0 to ThreadCount - 1 do Threads[I].Free; Writeln(TestName, ' ', ThreadCount, ' thread(s) took ', Round((EndCounts - StartCounts) * 1000 / CountsFreq), 'ms'); end; const CountdownFrom = $FFFFFF; //increase if necessary... MaxThreads = 10; type TCriticalSectionThread = class(TTestThread) protected FCriticalSection: TCriticalSection; procedure Execute; override; public constructor Create; override; destructor Destroy; override; end; TCriticalSectionThreadNoVirt = class(TCriticalSectionThread) protected procedure Execute; override; end; TMonitorThread = class(TTestThread) protected procedure Execute; override; public constructor Create; override; end; TRTLCriticalSectionThread = class(TTestThread) strict private FCriticalSection: TRTLCriticalSection; protected procedure Execute; override; public constructor Create; override; destructor Destroy; override; end; TRTLCriticalSectionThreadDynAlloc = class(TTestThread) strict private FCriticalSection: PRTLCriticalSection; protected procedure Execute; override; public constructor Create; override; destructor Destroy; override; end; constructor TCriticalSectionThread.Create; begin inherited Create; FCriticalSection := TCriticalSection.Create; end; destructor TCriticalSectionThread.Destroy; begin FCriticalSection.Free; inherited Destroy; end; procedure TCriticalSectionThread.Execute; var Counter: Integer; begin Counter := CountdownFrom; repeat FCriticalSection.Enter; try Dec(Counter); finally FCriticalSection.Leave; end; until (Counter <= 0); end; type TCSAccess = class(TCriticalSection); procedure TCriticalSectionThreadNoVirt.Execute; var Counter: Integer; begin Counter := CountdownFrom; repeat TCSAccess(FCriticalSection).FSection.Enter; try Dec(Counter); finally TCSAccess(FCriticalSection).FSection.Leave; end; until (Counter <= 0); end; constructor TMonitorThread.Create; begin inherited; //force our monitor to be initialised TMonitor.Enter(Self); TMonitor.Exit(Self); end; procedure TMonitorThread.Execute; var Counter: Integer; begin Counter := CountdownFrom; repeat TMonitor.Enter(Self); try Dec(Counter); finally TMonitor.Exit(Self); end; until (Counter <= 0); end; constructor TRTLCriticalSectionThread.Create; begin inherited Create; InitializeCriticalSection(FCriticalSection); end; destructor TRTLCriticalSectionThread.Destroy; begin DeleteCriticalSection(FCriticalSection); inherited Destroy; end; procedure TRTLCriticalSectionThread.Execute; var Counter: Integer; begin Counter := CountdownFrom; repeat FCriticalSection.Enter; try Dec(Counter); finally FCriticalSection.Leave; end; until (Counter <= 0); end; constructor TRTLCriticalSectionThreadDynAlloc.Create; begin inherited Create; New(FCriticalSection); InitializeCriticalSection(FCriticalSection^); end; destructor TRTLCriticalSectionThreadDynAlloc.Destroy; begin DeleteCriticalSection(FCriticalSection^); Dispose(FCriticalSection); inherited Destroy; end; procedure TRTLCriticalSectionThreadDynAlloc.Execute; var Counter: Integer; begin Counter := CountdownFrom; repeat FCriticalSection.Enter; try Dec(Counter); finally FCriticalSection.Leave; end; until (Counter <= 0); end; var I, J: Integer; begin for I := 1 to 3 do begin Writeln('*** ROUND ', I, ' ***'); for J := 1 to MaxThreads do begin RunTest('TMonitor ', J, TMonitorThread); RunTest('TCriticalSection ', J, TCriticalSectionThread); RunTest('TCriticalSection (avoid virtual calls) ', J, TCriticalSectionThreadNoVirt); RunTest('TRTLCriticalSection (New/Dispose) ', J, TRTLCriticalSectionThreadDynAlloc); RunTest('TRTLCriticalSection ', J, TRTLCriticalSectionThread); WriteLn; end; end; Write('Press ENTER to exit...'); ReadLn; end.
As said, for two or more threads, I now consistently get TMonitor to both outperform TCriticalSection and not exhibit the weirdness it did in my original test – instead, said weirdness is transposed to TCriticalSection. However, as before, using TRTLCriticalSection directly performs best by far.
Since everyone likes definite conclusions, I guess I can only conclude by saying this: you should avoid virtual method calls at the very least, and preferably classes too. Indeed, even dynamic memory allocations should be verbotten – stick to locals and opaque (or at least semi-opaque) records whose data are allocated for you by the Windows API. Sounds about right, eh?
On a StackOverflow question, I made a quick recap of my little experiment about multi-threading applications. It matches, then extends your own conclusion.
About Critical Sections (and TMonitor), it states: “Don’t abuse on critical sections, let them be as small as possible, and rely on some atomic modifiers if you need some concurrent access – see e.g. InterlockedIncrement / InterlockedExchangeAdd;”
Using InterlockedDecrement should have been the right way to implement your testing. Just make a test, and you’ll see it’s faster than any other possibility.
InterlockedExchange (from SysUtils.pas) is a good way of updating a buffer or a shared object. You create an updated version of of some content, then you exchange a shared pointer to the data (e.g. a TObject instance) in one low-level CPU operation, you notifies the change very fast to the other threads, with very good multi-thread scaling. You’ll have to take care of the data integrity, but it works very well in practice.
See http://stackoverflow.com/questions/6072269/need-multi-threading-memory-manager/6076407#6076407
Arnaud, the point wasn’t to test a decrement, but a critical section, if the Dec() disturbs you, feel free to replace it with an assignment or any small payload 😉
My point was just that for a dec or an assignment, you should use atomic Interlock*() functions instead of a Critical Section or a TMonitor.
A conclusion is that multi-threading is hard, and even simple wrapper/helper classes can introduce concurrency issues, as illustrated by TMonitor’s very own small allocation.
Or in other words, trust Microsoft, trust OmniThreadLib, don’t trust the RTL 😉
Sad… 😦
But true!
What about TMultiReadExclusiveWriteSynchronizer ?
I used it several times, and found out this little class to be efficient, more efficient than a Critical Section when you have multiple readers, and only seldom write on the data (which is a very common case).
Do you know any problems? Code seems safe and proven, even in old Delphi versions.
You managed to miss the point of the post entirely, though I’m not entirely surprised ;-). Here’s it made more bluntly: your initial claim that TMonitor doesn’t work at all, based on an even simpler test, was rubbish, and your revised claim only stands up in a very artificial situation. When I was playing around with this, the pattern of noise went from one implementation to another simply by rearranging the test order and messing about with what dummy data was allocated. This suggests to me the tests we’ve been doing have neglible practical value – if in a real situation using TRTLCriticalSection is significantly quicker than TMonitor then great, but I wouldn’t bank on that being the case.
Thanks for proving what I was thinking from the very beginning although I am no “pretty much by definition expert on the Windows threading model” 😉
*steals some of Masons popcorn*
Is it rubbish? you had to do *special* code to make it behave well, that’s a sign of flawed design, RtlCriticalSection behaves well all the time. Do you consider having to work around design flaws a sign is a good working library? If it happens in simple cases it’s, bound to happen in complex situations too, with no easy diagnostics or workaround. Murphy’s law.
Reminds me if sketch about a bad tailor that keeps arguing he does a good job, it’s the people wearing his suits that don’t stand up correctly 🙂
Eric – er, yeah, your initial claim (that TMonitor is highly likely to always synchronise, and so due to a ‘race condition’) was indeed ‘rubbish’. Almost as rubbish as my previous attempt to demo TMonitor in fact! 😉
“Do you consider having to work around design flaws a sign is a good working library?”
The only ‘workaround’ presented is explicitly initialising the monitor up front, which takes all of two lines (perhaps an EnsureInitialize method could be added to cut that down to one line?). Adding some dynamically created data is adding realism, not a ‘workaround’.
‘If it happens in simple cases it’s, bound to happen in complex situations too, with no easy diagnostics or workaround.’
Er, no, what our testing has indicated is the opposite for this particular case.
‘Reminds me if sketch about a bad tailor that keeps arguing he does a good job, it’s the people wearing his suits that don’t stand up correctly’
Actually, it’s you who is turning matters onto their head, thinking of quickly written artificial tests as somehow more ‘real’ than anything else…
Alas initializing up front is a circumstancial “fix”, whether it’ll work or not will depend on the state of the memory allocator, which outside of simple tests, you have no control of. (sorry for typos, answering from phone)
“Adding some dynamically created data is adding realism, not a ‘workaround’.”
Just so you understand, the dynamic data you add in your tests needs to be added just so that the dynamic areas allocated by TMonitor won’t be in the same cache line, for that you need to have a particular sequence, of allocation, and particular size, alongside with a non-deallocation of the dynamic memory that got allocated in between TMonitor’s allocations, so that a should another TMonitor allocate its internal data, it won’t get the spot that was just freed.
F.i. if you have in memory:
TMonitor 1’s data
Some dynamic Data
TMonitor 2’s data
If you free the dynamic data and have a 3rd TMonitor allocate its stuff, you’ll end up with
TMonitor 1’s data
TMonitor 3’s data
TMonitor 2’s data
And the issue will be back.
Also if the intermediate dynamic data you allocate isn’t fitting in the *same* bucket size as your first TMonitor’s data, you’ll still have both TMonitor 1 & 2 be contiguous.
That’s a pretty severe restriction, even in the very simple case of your demo, just add a few more fields in your TTestThread, and you’ll see the issue pop back.