Pearls in the ‘nextgen’-ified RTL source

Back in September 2012, a post appeared in non-tech reporting Delphi’s PCRE wrapper to be many, many times slower than Python’s. With sample code attached, the problem was undeniable, though the immediate cause was soon identified: the Delphi wrapper’s failure to include a ‘don’t validate the UTF8’ flag (PCRE was traditionally a UTF-8 based library, so the Delphi wrapper was using UTF8String). Putting everyone’s work together, I posted a QC report. Soon after doing that an even better solution was noted, which was for the wrapper to wrap a newer version of PCRE that supported UTF-16 internally, i.e. Delphi’s native string encoding, and so allow avoiding UTF-8 roundtrips entirely.

To be fair to Embarcadero, the second solution might have been considered a bit problematic in practice, given PCRE’s UTF-16 mode was only 6 months old at that point, and using it may have been tricky for OS X. This is because on that platform, the Delphi wrapper uses the system PCRE dylib rather than statically linking equivalent C object files, due to the fact DCCOSX only consumes object files produced by the Windows C++Builder compiler (or at least, only did when I last looked into the matter). On the other hand, the additional flag fix involves adding just a couple of lines… so perhaps it could be implemented fairly quickly?

Alas, but it isn’t been implemented as yet. Oh well – I can see shipping iOS and Android support were much bigger fish to fry. Does this mean the unit in question hasn’t been touched at all? Oh no: it has been extensively fiddled about with due to the fact the UTF8String type was removed from the so-called ‘nextgen’ (i.e., LLVM-based) compilers. As such, an elegant UTF8String interface has been replaced with an ordinary string one that now has to use the ugly ‘marshaller’ API and TBytes internally. Even worse, it now includes pearls like the following:

function CopyBytes(const S: TBytes; Index, Count: Integer): TBytes;
  Len, I: Integer;
  Len := Length(S);
  if Len = 0 then
    Result := TEncoding.UTF8.GetBytes('')
    if Index < 0 then Index := 0
    else if Index > Len then Count := 0;
    Len := Len - Index;
    if Count <= 0 then
      Result := TEncoding.UTF8.GetBytes('')
      if Count > Len then Count := Len;
      SetLength(Result, Count);
      for I := 0 to Count - 1 do
        Result[I] := S[Index + I];

If you’re reading this and thinking ‘oh no – it looks like the Move procedure has been removed!’, don’t worry, because it hasn’t. Likewise, Delphi hasn’t suddenly gone all Java-esque and dropped the equation of an empty dynamic array with a nil one – i.e., this code:

    Result := TEncoding.UTF8.GetBytes('')

really is what it seems, namely an obscure way of assigning nil that if you step through it, passes through several method calls and IF tests to do the deed.


5 thoughts on “Pearls in the ‘nextgen’-ified RTL source

  1. Dynarray is no worse function result than string. But why they did not reimplemented utf8string as implicitly casted record?

    Okay, at least they don’t generate permutation of all values to filter the 1st that would suffice, like they did (do?) for normalising rects

  2. “an elegant UTF8String interface has been replaced with an ordinary string one that now has to use the ugly ‘marshaller’ API”

    Could you show an example of using this API, please? I’ve read about removing support for ANSI strings but haven’t seen much code for how to still use them when support is removed. The example you just posted is odd I agree, and I’d like to see more about the new way to do it and why it’s strange.

    • Well I didn’t say it’s ‘strange’ – it’s just nowhere near as elegant as the D2009+ UTF8String, given the latter has the usual string nicieties (strong typing, copy on write, Copy standard function support etc.) and converts to and from UnicodeString with a simple cast. Whoever made the decision to drop UTF8String was extremely short-sighted, since removing a bit of maintenance work for the compiler guy(s) has just caused more work elsewhere, and moreover, caused the same work to be duplicated several times over and with solutions that are sub-optimal compared to what was possible before.

      That said, I wouldn’t bother with the ‘marshaller’ API – just call LocaleCharsFromUnicode and UnicodeFromLocaleChars directly. Added to the System unit in XE, these map to WideCharToMultiByte and MultiByteToWideChar on Windows, and have backfilled implementations on other platforms.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s