Preserving our digital future

John Chelsom started the XML Summer School in 2000, and continues as a board member and lecturer at this annual event. Since 2010 he has been the lead architect of the open source cityEHR product – an XML-based electronic health records system which combines clinical data with medical knowledge bases and is currently used in a number of hospitals in England.

2016 saw the unearthing of the oldest written document yet found in the British Isles – on a wooden Roman tablet from about 50 AD. We put huge effort into creating the digital text published today, but how much of it will still be around to read 2000 years from now?

At the XML Summer School over ten years ago, someone asked our expert panel to give guidance on the best way to archive digital publications. The most creative answer, from Robin Cover (a renowned digital archivist!) was that we should carve our text on tablets of stone, marked up in XML. I can still remember laughing heartily at his suggestion, but now I’m beginning to think he was right.

Robin’s argument was that text alone is not good enough for representing the information we want to preserve – we also need some representation of structure and metadata. For that he proposed XML – its encoding is just plain text and given a sufficiently large sample, its logic can be decoded without having the original specifications. His proof was to go back to digital assets of the 1960’s – how many of us would have software available that could read documents created way back then? Well if those documents were marked up in GML, the Generalized Markup Language invented by Charles Goldfarb at IBM in 1969, we would find it could still be read by any software that handles XML, the Extensible Markup Language descended from GML. Such software is all around us and much of its is free, including any web browser or plain text editor. Try doing the same with a proprietary word processing, desktop publishing or typesetting format, where the original application ceased to exist even ten years ago.

As for the tablets of stone, if we found a 50-year-old GML file what are the chances we’d be able to read the media it was stored on? Preservation of our digital assets is dependent on the technology used to store it, and even in fifty years we have seen many technologies come and go; paper tape streamers, tape drives and floppy disk drives will no doubt be followed into the dustbin of technology by DVDs and USB sticks over the next fifty years. So in thousands of years time, when the electricity has been switched off and archaeologists are picking over the debris of our silicon age, its still more likely they will find the text written on stone, rather than tapes, disks or chips. You may think this sounds a little crazy, but the Memory of Mankind project in Austria is aiming to do exactly that – preserving contemporary human knowledge on stone, buried deep in a salt mine for future generations to find.

I sometimes tell people that publishing should be viewed as an investment in digital assets – the more value we can create in those assets and the more we can reuse them, the greater will be the return on our investment. Our most cherished digital assets deserve to be preserved for the future, if only to protect our investment in producing them. And though many of us won’t be ready just yet to carve our documents in stone, we should at least be thinking about the first step of representing those documents in XML.

