Compiling a Hunspell DLL, step by step

2010 February 6
by CR

As I got asked, I thought I might as well make a separate post out of it, so here goes. To compile a new Hunspell DLL, you could do the following:

  1. Make sure you have a suitable C++ compiler at the ready. I’ve no idea about C++Builder, but Visual C++ Express does the job fine, and it’s free.
  2. Make sure you have something like the (open source) 7-ZIP installed so you can open a compressed tarball file (.tar.gz).
  3. Download the Hunspell source from Hunspell’s Sourceforge homepage.
  4. Extract said source to a suitable directory. If you inspect the extracted files, you’ll come across more than one MSVC project file — specifically, you should be able to find one for MSVC 6 in src\hunspell (hunspell.dsp) and a MSVC 2005 project group (’solution’) in src\win_api (Hunspell.sln, hunspell.vcproj, libhunspell.vcproj, testparser.vcproj). Basically, Hunspell is a C++ library, though at some point, a new DLL project file with C exports (the one in win_api) was added to the standard distribution to make using Hunspell with (e.g.) Delphi simpler. Later still however, a C interface with slightly different signatures was added to the main source. Consequently, it doesn’t really matter which project file you choose — my wrapper supports DLLs produced by either. Also, while only the win_api project file is set to create a DLL by default, you’ll still have to fiddle about with it before getting it to compile, and once done, it will create a slightly larger DLL compared to the other one.
  5. That in mind, load up MSVC Express, go to File|Open|Project/Solution…, and find your way to src\hunspell\hunspell.dsp.  Accept the resulting prompt to convert the project file to the newest format.
  6. Go to Project|Add Existing Item, and from src\hunspell, choose every .cxx and .hxx file. I also suggest adding Hunspell.rc from src\win_api too.
  7. Go to Project|Properties, select Configuration Properties in the tree view, then All Configurations in the Configurations combo box (the latter lies directly above the tree view). Next, change Configuration Type to Dynamic Library (.dll), before selecting  C/C++ in the tree view and setting Additional Include Directories to ..\win_api.
  8. You should now be able to build the project, which will produce hunspell.dll under src\hunspell\Debug. To create a release build, select the Release configuration (most easily from the combo box immediately to the right of the ‘run’ toolbar button), and rebuild. This should cause another hunspell.dll to be created, only this time under src\hunspell\Release.

Naïvely Latin-1

2010 January 23
by CR

Just checking out DelphiFeeds.com, I see a post at the top which makes an error (or at least veers towards it) that I find weirdly irritating: conflating ASCII with the Windows Latin-1 codepage. Let’s get some things clear:

  • Latin-1 covers more than just modern English.
  • Latin-1 is itself a full 8 bit code page in the sense of having code points with values greater than 127.
  • Many of these code points are needed even for basic English. (That this is typically with respect to French loan words is neither here nor there.)

Consequently, for ‘human-readable’ case conversions, Delphi programmers should always have used AnsiUpperCase/AnsiLowerCase rather than just UpperCase and LowerCase, even when only English was (legitimately) assumed. Just try calling UpperCase(‘café’) to see what I mean.

Another update to my Exif (and now IPTC) code (v1.0.0)

2010 January 18
by CR

Having been stuck on v0.9.x for ages due to my somewhat arbrary versioning scheme (or lack of one!), I thought I might as well get a v1.0.0 out, so here it is. Changes since v0.9.9 are as thus:

  • Added support for IPTC metadata as stored in Adobe APP13 segments —
    • Implemented an IPTC reader/writer class, TIPTCData, in a new unit, CCR.Exif.IPTC.pas.
    • The interface of TIPTCData is broadly modelled on TExifData’s — thus, there are ‘sections’ and ‘tags’, with high level tag properties on TIPTCData itself.
    • At a lower level, the RemoveMetadataFromJPEG global routine can now delete IPTC data, and you can enumerate the data blocks of an Adobe APP13 segment from an IJPEGSegment instance.
  • TJPEGImageEx has received a few amendments —
    • Added an IPTCData property.
    • Added an overload to Assign that allows for the preservation of any metadata, interpreted by my code or not, when a bitmap is assigned.
    • Fixed a bug in which calling the regular Assign didn’t cause the ExifData property to be updated.
  • Two more Nikon maker note types now parsed. Thanks goes to Stefan Grube for updating the Exif List demo’s MakerNotes.ini for this.
  • Fixed bug of JPEG parsing code not realising a segment with a marker number of 0 has no data.
  • Fixed typo in TStreamHelper.ReadLongInt spotted by Jeff Hamblin.
  • Changed the types of the ExifImageWidth, ExifImageHeight and FocalLengthIn35mmFilm properties of TCustomExifData so as to give them MissingOrInvalid and AsString sub-properties. (Basically, they now use custom record types that have methods and operator overloads.)
  • Changed behaviour of TCustomExifData’s enumerator to not skip empty sections.
  • The LoadFromJPEG methods of TExifData are now procedures rather than functions.
  • Added a couple more demos, namely an IPTC editor and a console app to strip specified types of metadata from one or more JPEG files.
  • Removed all previously deprecated symbols.

[Update 19/1/10 -- grr, gremlins. Try downloading again to get a version compilable in D2009 or D2010 (CCR.Exif.JPEGUtils.pas and CCR.Exif.IPTCUtils.pas should now be marked v1.0.0a).]

Added IPTC support —
Implemented an IPTC reader/writer class, TIPTCData, in a new unit, CCR.Exif.IPTC.pas.
The interface of TIPTCData is broadly modelled on TExifData’s — thus, there are ‘sections’ and ‘tags’, with high level tag properties on TIPTCData itself. (See http://www.iptc.org/std/IIM/4.1/specification/IIMV4.1.pdf for the IPTC specification.)
At a lower level, the RemoveMetadataFromJPEG global routine can now delete IPTC data, and you can enumerate the data blocks of an Adobe APP13 segment from an IJPEGSegment instance.
TJPEGImageEx has received a few amendments —
Added an IPTCData property.
Added an overload to Assign that allows for the preservation of any metadata, interpreted by my code or not, when a bitmap is assigned.
Fixed a bug in which calling the regular Assign didn’t cause the ExifData property to be updated.
Two more Nikon maker note types now parsed. Thanks goes to Stefan Grube for updating the Exif List demo’s MakerNotes.ini for this.
Fixed bug of JPEG parsing code not realising a segment with a marker number of 0 has no data.
Fixed typo in TStreamHelper.ReadLongInt spotted by Jeff Hamblin.
Changed the types of the ExifImageWidth, ExifImageHeight and FocalLengthIn35mmFilm properties of TCustomExifData so as to give them MissingOrInvalid and AsString sub-properties. (Basically, they now use custom record types that have methods and operator overloads.)
Changed behaviour of TCustomExifData’s enumerator to now not skip empty sections.
The LoadFromJPEG methods of TExifData are now procedures rather than functions.
Added a couple more demos, namely an IPTC editor and a console app to strip specified types of metadata from one or more JPEG files.
Removed all previously deprecated symbols.

Small thought

2009 December 13
by CR

It’s hardly an original thought, but why does the off-topic group at the Embarcadero forums still exist? At my time of writing (I’m doing so in advance of actually posting BTW), it’s dominated by some American guy whining about supposed censorship from political opponents — typical ideologue, he assumes everyone else is ‘really’ as politically obsessed as himself, and so finds political motivation everywhere regardless of how non-political the words of others appear to be. In response to what turned out to be only a temporary banning of him, a supporter writes:

Nobody will be left to post here if it goes on, the whole newsgroup will stop to exist. Nobody posts a lot here anymore, not like a few years ago anyway.

To which I can only reply: good riddance. The group does only harm to Embarcadero (a technology company, not a low-rent rival for The Huffington Post or Pajamas Media) for hosting it.

PS: while the off-topic group has its dubious prominence (it’s proudly listed near the top of the forums homepage), interesting technical things like Eli Boling’s blog are buried away. Admittedly, this largely saves him from the ‘you’re all a bunch of idiots who don’t know what they’re doing’ crowd, but still…

Dynamic arrays — pure reference types, except when they’re not

2009 December 9
by CR

A reasonable way to understand the semantics of dynamic arrays in Delphi is to recall the sort of code you might have used as a substitute before they where introduced in Delphi 4. Assuming only a single dimension to keep things simple, stage one would be to declare dummy static array type, together with a corresponding pointer type:

type
  PRectArray = ^TRectArray;
  TRectArray = array[0..$FFFFF] of TRect;

Allocation and reallocation may then be done using the appropriately-named ReallocMem routine. More exactly, you can use GetMem and FreeMem as well, though since ReallocMem can do both initial allocation and final deallocation, there’s no need — just remember to initialise the variable to nil if declared in a local routine, and always call ReallocMem at the end to free the array:
read more…

Mr Hater gets constructive

2009 December 7
by CR

I haven’t looked at his blog in a while, but I see Mr Hater had an unusally constructive post last month. As with other examples of the genre, it does have its parochial edge (see in particular point 13 — Embarcadero are going wrong since, having failed to create a direct replacement, they haven’t restarted maintaining the BDE?!). Nonetheless, it’s an interesting list with some good points. If reading such things are what you like to do, then have a look

Are dynamic arrays in Delphi half-baked?

2009 December 4
by CR

In a nice little series on ‘Delphi in a Unicode world’ written and published around the time of Delphi 2009’s release, Nick Hodges writes on the topic of using strings as binary buffers as thus:

A common idiom is to use a string as a data buffer. It’s common because it’s been easy — manipulating strings is generally pretty straight forward. However, existing code that does this will almost certainly need to be adjusted given the fact that string now is a UnicodeString.

There are a couple of ways to deal with code that uses a string as a data buffer. The first is to simply declare the variable being used as a data buffer as an AnsiString instead of string [...] The second and preferred way dealing with this situation [, however,] is to convert your buffer from a string type to an array of bytes, or TBytes. TBytes is designed specifically for this purpose, and works as you likely were using the string type previously.

Now, I’m totally at one with those who think misusing the string type for binary buffers was a silly thing to do. Nevertheless, to say TBytes was ‘designed specifically for this purpose’ is equally as silly in my view, since in being a simple typedef for a dynamic array of bytes that was only added in D2007 (dynamic arrays themselves being added way back in D4), it patently wasn’t.

More to the point, despite having an implementation that redeployed that of the original AnsiString type for more general purposes, dynamic arrays at large — and thus, TBytes specifically — suffer from various key shortcomings in comparison:

  1. No copy-on-write semantics. The fact that dynamic arrays and strings share key RTL functions (Copy, Length and SetLength) frequently leads me to forget this, as well as the fact that dynamic arrays aren’t in fact pure reference types in use.
  2. The equals (=) and not equals (<>) operators compare references rather than data. (Note how the string type is simply more flexible here, since you can just cast to Pointer if you do want to compare string references.)
  3. You can’t use the addition (+) operator. For sure, using this in a light loop is highly inefficient — but if it’s so terrible in principle, why allow it for strings? [Edit: before you get the wrong idea, see my response to Luigi Sandon -- 'LDS' -- in the comments.]
  4. You cannot assign an array constant to a dynamic array. Cf. how there isn’t a practical distinction between string constants and string variables — they’re all just ’strings’, and even under the hood, a string constant is just a string with a dummy reference count.
  5. No copy-on-write semantics means you lose much of the const-ness of constant paramaters and read-only properties — basically, the consumers of an object can change the elements of a read-only dynamic array property where they can’t change the characters of a read-only string property.  Admittedly, the loss of the const-ness of constant parameters is much alleviated by the open array syntax (though let’s not dilute this by encouraging the use of paramaters declared as TBytes rather than ‘const array of Byte’, eh?).* Nonetheless, it is still an unfortunate side effect of dynamic arrays not being implemented as quasi-value types, à la AnsiString and UnicodeString.

In my view, it is these features that make manipulating strings ‘pretty straight forward’, and moreover, not prone to bugs through not fully understanding the type’s internal semantics. The fact that dynamic arrays do not have them, then, makes the idea of TBytes being some sort of genuine substitute for the misused old AnsiString quite false. That said, one particular issue with dynamic arrays especially gets my beef, but I’ll leave elucidating that to another time…

* Thus:

procedure Test(const Arg1: TBytes; const Arg2: array of Byte);
begin
  Arg1[0] := 99; //compiles!
  Arg2[0] := 99; //doesn't compile
end;

New revision of my Exif library (v0.9.9)

2009 November 21
by CR

I’ve just put up another revision of my Delphi Exif parsing code. This revision has two main themes:

  1. Sanity checks have been added to the parsing code, meaning every single TIFF offset is now checked. Connected to this, and by popular demand (or so it seems), the balance between accepting malformed metadata and raising an exception has now swung a bit towards the former.
  2. Better maker note support: specifically, the tag structures of Canon, Panasonic and Sony MakerNotes are now understood. The interpretation of maker note tag values is still left to the user however.

Other, more minor changes include:

  • Fixed typo in GPS direction tag setter which meant the value could never be changed.
  • Added memory leak fix to CCR.XMPUtils.pas suggested by David Hoyle.
  • Added delay loading semantics to the XMPPacket property of TCustomExifData, the idea being that attempts to read Exif tags should not ever lead to an EInvalidXMPPacket exception being raised. Equivalent behaviour has been built into the new maker note parser code too.
  • More helper methods of the TryGetXXXValue and ReadXXX kind.
  • Surfaced two interop IFD tags as properties on TCustomExifData.
  • Maker note data are now moved back to their original position on save if the OffsetSchema tag had been set. (Actually, this should have been the case for the previous release but for a bug, typing Inc where I meant Dec.)
  • Demos rejigged a bit — PanasonicMakerNoteView.exe removed (its functionality has been added to an improved ExifList.exe), and two new console ones added (CreateXMPSidecar.exe and PanaMakerPatch.exe). You can download compiled versions of the demos from here.

One final note — idly Googling, I’ve found that there’s at least one person around who believes it might be realistic to backport my code to Delphi 7. Two words of advice: don’t bother. You’ll just have one problem after another.

Using Hunspell — a code page-aware wrapper

2009 October 30
by CR

Yes, it’s been done before – indeed, I’ve used Brian Moelke’s simple Hunspell wrapper from a few years back myself – but a few posts on the Embarcadero forums in the past couple of months have prompted me to write up my own.

Basically, if you haven’t heard of it, Hunspell is the open source spell checking engine used in OpenOffice, and very good it is too, at least for English – I’ve found it much better than Ispell, for example, in terms of both speed and the quality of its suggestions.

The Hunspell source itself can be downloaded from SourceForge here – you’ll need a C++ compiler to build a DLL from it (VC++ Express is fine for this purpose, and so might C++Builder – I don’t know). Calling a resulting DLL is then fairly straightforward, though one slightly tricky thing – and where my own code has its main reason for being – is in using dictionaries with foreign code pages, such as a Greek dictionary on an English system. The difficulty here is that while Hunspell itself supports UTF-8 encoded dictionaries, most actually-existing ones have an ANSI encoding – and the strings you pass to the Hunspell engine must have the encoding of the dictionary being used, the engine itself doing no conversions. In light of that, my wrapper transparently does any needed conversions for you, with the key methods having Ansi and Unicode overloads when compiling in Delphi 2006 or 2007. Moreover, I’ve also tried to write the source in a D2009+ friendly manner too.

Naturally, it may turn out that no one but myself will find it useful though, but anyhow, it’s available here if you’re interested. The ZIP includes a demo app (as one might expect), together with a prebuilt Hunspell DLL compiled with the current-at-my-time-of-typing version of the Hunspell source, namely v1.2.8.

New revision of my Exif library (v0.9.8a)

2009 October 19
by CR

It’s taken a while, but I’ve just completed another revision of my MPL’ed Exif reader/writer code, Exif being the standard format for JPEG metadata (see here). The biggest new feature in terms of effort spent is much better XMP support – the default behaviour is now to update the equivalent XMP property whenever an Exif tag value is changed, though only when the former already exists. If you want, you can get the behaviour of Vista’s (and quite possibly Windows 7’s) Windows Explorer instead, which is to always create an XMP value whenever an Exif one is set, with a single property change – set XMPWritePolicy to xwAlwaysUpdate.

In terms of actual usefulness though, possibly a bigger change is the fact that by default, MakerNote tag data are now always written out to their original location. Unlike implementing proper XMP support, which was a right drag, this turned out to be pretty straightforward. Other than that, I’ve also fixed some bugs and fiddled around with some of the lower level code a bit – in particular, where I had previously assumed the ExifImageWidth and ExifImageHeight tags would always have longword values, I now support word-sized ones too. Moreover, the ‘correct’ positions of the JFIF, Exif and XMP segments are now enforced by TExifData.SaveToJPEG, any comment segments (for example) being moved below.

That said, looking at the CodeCentral stats, it seems quite a few people have downloaded earlier revisions of the code, which makes me think – it could do with a better name! Unfortunately, the most obvious one (dExif) has already been taken. So, any ideas..?

Update 1 (19/10/09): in the hours since I first posted this, I’ve slightly amended the original ZIP to avoid some D2009 issues — the current version of CCR.Exif.pas is thus 0.9.8a.

Update 2 (29/10/09): I’ve also now slightly amended CCR.Exif.XMPUtils.pas. Like CCR.Exif.pas, it now stands at version 0.9.8a.