Word Documents

Matt Lawrence matt.lawrence at virgin.net
Fri Dec 9 13:41:52 GMT 2005


Paul Makepeace wrote:
> Sam Smith wrote:
> 
>> On Wed, 7 Dec 2005, Steve Mynott wrote:
>>
>>> On Tue, Dec 06, 2005 at 10:49:57PM +0000, Sam Smith typed:
>>>
>>>> Does anyone know if there's a way to tell, from perl (on
>>>> Unix) whether a word document has track changes turned on?
>>>
>>>
>>> Why don't you save a document without track changes and then with
>>> track changes on and try a binary compare to work out the difference?
>>>
>>> (Although admittedly modern versions of Word documents always seem to
>>> think they have been changed after opening and you may find several
>>> binary changes).
>>
>>
>> I tried that, it didn't help.
>>
>> I was hoping that it would be something like read byte X and
>> jump to the offset stored in it. It isn't. Which is no
>> surprise.
> 
> 
> The reason it's unlikely to work is that Word's binary "format" is 
> essentially a serialized blob of the in-memory representation of the 
> document. (This, IIRC, led to some interesting side-effects like users 
> having access to the undo history of other people's documents.)
> 
> Depending how much time you have you could spelunk the sources or ask on 
> the developer lists of OpenOffice, Abiword, or Antiword.

OLE::Storage and OLE::Storage_Lite can help you access these .doc files,

The file format is described in a zipped html file from here:

http://wvware.sourceforge.net/word97.zip

I've had limited some success getting data out of this format in the 
past. Although I haven't (yet) managed to extract that particular data.

I started working on a module to access data in the Word format, but 
it's fallen on to the back burner and is far from ready for public 
consumption. I'd be happy to share what I have so far if you think it'll 
help.

Matt


More information about the london.pm mailing list