Prepress & Integration June 1999 newspaper techniques Markup languages XML – teaching newspapers how to read themselves Although HTML cannot be considered a failure as a markup language, it does perhaps fall a bit short of the solution newspapers desire ... that’s where XML enters the picture. > p. 11. XML needs a lot of practice within the industry to become an industry tool > p. 12. The XML family tree. > p. 14. Just as XML tries to bring meaning to the world of text layout, it can also bring added dimension to images and graphics. . The impact of XML on journalists. > p. 18. NAA’s Classified Ad > p. 16 Standard based on XML > p. 20. 10 newspaper techniques June 1999 Newspapers, though they may have the wisdom of nations within their pages, are not in themselves very smart things; Inert, dumb and with no self-knowledge they rely for their impact on a truly remarkable data processor – us. People are phenomenal information processors. Just look at the person in the street studying a paper – skimming headlines, soaking up captions and then delving into dense information to emerge with precisely the detail he or she wants, like a heron impaling a fish. It’s a good thing we are so gifted, because even in the information age, with the web completely re-writing the rules of publishing, we still rely heavily on that human talent for data fishing. HTML may have transformed the web from an academic’s notice board to the richest information source on the planet, but it has little more self-awareness than its rag and pulp predecessors. In order to take the next big step forward, it seems, we are simply going to have to teach publications a little of our own skill. No one could ever describe HTML as a failure – it has undoubtedly ushered in a new era of global information access, but for all the fanfare surrounding the web, it is important to remember that HTML is seriously limited. As a markup language it can specify certain details about how information is displayed – font, page positioning, etc., but it simply has no idea about the content of that information. It can ensure that a string of characters are displayed on a new line after the heading, but it has no way of knowing that it has just spelt out the name of the person who wrote the article – something that would be plainly obvious to any human scanning the page. If you could go just that one step further, and include not just information about appearance but also about content, then the possibilities are immense. By identifying the content, you make it searchable – you turn all your published output into a database at a stroke, and that in turn means that you can select elements of publications, retrieve them, and re-use them for different purposes. Not that you are even expected to know for sure what those purposes are yet. With the advent of interactive TV, smart phones with built-in displays, and wireless capable PDAs such as Steve Shipside the Palm Pilot, no one can be quite sure of how the public will want their information delivered in the near future. Anyone who has battled with the hoary issue of browser incompatibility will be aware that HTML, when presented with a screen of an unexpected definition, cheerfully scrambles your beloved pages. However, if the markup language were aware of what was a headline, a stand first (or introductory paragraph), a Reuters story or last Thursday’s Dilbert cartoon, it could be told exactly what to display. A browser that you don’t know, because it hasn’t been developed yet, on a portable device you haven’t seen, because it doesn’t exist, would know how to ask for your day’s headlines, lead stories or weather maps, all without you having to do a thing to the format of your existing media. To do all that, of course, requires something a little beyond the reach of < C E N T E R > < B > < F O N T COLOR=“#FF0000”><FONT SIZE=+3>. It needs a vocabulary of its own, to describe all the different elements of content that matter to the publisher, the art department, the subscriptions department, editorial, ads, and the reader. Plus it needs a grammar, a structure so that browsers or retrieval tools understand, for example, that several of those vocabulary words in one context might mean several elements, or they all might be applicable to one picture or piece of text. Which is where XML comes in. Allan Marshall, Group Technology Director of Associated Newspapers in London, says: “By separating content and structure from presentation, the same XML source document can be written once, then displayed in a variety of ways.” been simplified down to the rules and tools we now know, and some would say love, as XML. “XML proponents claim it will cure everything that is wrong with HTML and enable the seamless exchange of data between different applications and operating systems,” said Allan Marshall, Group Technology Director of Associated Newspapers in London. “By separating content and structure from presentation, the same XML source document can be written once, then displayed in a variety of ways.” > More than just a language XML, or eXtensible Markup Language, is a little bit of a misnomer, since it is not truly a language. Instead it is a metalanguage, an off-putting term that means a language with which to describe languages, a set of rules and tools from which individual industries can create their own specialised dialects, without rendering them incomprehensible to a browser that is versed in the original metalanguage. XML is actually derived from the same parentage as HTML, both being subsets of SGML (Standard Generalised Markup Language). SGML isn’t a language either; it’s an even more complex metalanguage, established as an international standard (complete with an ISO number - 8879) deemed too comprehensive (i.e. incomprehensible) to be used wholesale, and so has For further information on XMLNews, go to http://www.xmlnews.org. 11 Prepress & Integration Steve Shipside “Today pages are taken from our print sites, distilled into PDF files, and passed electronically to the online site where ICENI software separates stories, captions, headlines,” explains Marshall. “Eventually the XML-based editorial system will do away with the need for the ICENI application.” An XML language for the newspaper industry would include crucial details such as publication number, volume, issue, publication title, data, edition, zones of distribution and page. Since a single publication may be published several times it will also need to handle separate information added each time. Try doing that with HTML. XML also presents an answer to the perennial headache of flexible layout whereby a layout may change with each edition so that a paragraph that appears on page one of one edition may be on page seven of the next. That headache gets even worse with electronic delivery ‘on the fly’ where information is requested and delivered live to user specifications. Therefore an XML language must have a system to handle links and jumps so that each block of text has an attribute linking it to a counting and layout attribute. That is set into the header of the data which is always sent with it, so that the document can keep track of itself, however it changes. Similarly, researchers would be delighted to find attributes that checked and indicated word, character and page counts. The editorial team, and especially the subs and production editor, could check a workflow feature that tracks the editing history of a document as it flows to and fro within a publishing system. Plus, of course, blocks of text could be identified by their type and function, so a search engine, or even a browser could tell the difference between a summary, a caption, or a column. For the first time a computer could arguably spot a joke when it came across one. Importantly this tagging system also allows a means of specifying ‘no-run’ blocks, the elements of text that are not to be displayed, either because they are not suited to a certain type of device, or perhaps because of restricted rights access (building rights information into documents is another obvious bonus). For more information on the NewsPak service, go to http://www.newspak.com. For details on NITF, try http://www.mediacenter.org. Ever since electronic publishing sneaked onto the scene, re-purposing and re-publishing has been a key strategy. What XML adds to that is not only a means of ensuring that all your media assets are made available for re-purposing but also the possibility of automating that whole process. Practice makes perfect If you’ve noticed a certain hypothetical air about many of these possibilities, 12 June 1999 newspaper techniques it’s because XML, although ratified by W3C (World Wide Web Consortium that sets web standards) is, as mentioned, a metalanguage and not industry specific. In order for it to become a hard and fast industry tool, it needs to be used to hammer out an industry language, and trade bodies and software manufacturers must come to an agreement about the necessary vocabulary and grammar. This sounds like a formula for disaster, you might think, and indeed the sceptics would have been gratified in May this year to see the headlines to the effect that WavePhore had announced a new XML format, XMLNews, and with it a web service called NewsPak. Corel promptly blessed the new-born format by announcing support for XMLNews within WordPerfect Office 2000. “The XMLNews initiative is going to have a lot of momentum,” said Dr. David Megginson, Chair of the World Wide Web Consortium’s XML Information Set Working Group, and the man WavePhore turned to for XMLNews. “XMLNews will come down the wire from WavePhore, the company that distributes content from the world’s largest news organisations and information providers. The news industry has been waiting a long time for someone to stop just talking about new standards and start actually implementing them – this is it.” Excellent, except that in this bold fait accompli, WavePhore may have managed to put more than a few noses out of joint. XMLNews consists of two XML formats: XMLNews-Story for content and XMLNews-Meta for metadata. XMLNewsStory which WavePhore describes as “fully compatible subset[s] of the September 1998 XML version of the News Industry Text Format (NITF) developed by the International Press Telecommunications Council and the Newspaper Association of America.” However, while WavePhore cheerfully acknowledges the NITF committee, it appears that WavePhore didn’t actually consult the committee prior to announcing XMLNews. It would be a very dull industry that didn’t have the occasional spat, and the good news is that it looks certain to goad NITF into action. NITF is an XML standard developed by Reuters, Associated Press, Agence France Presse, Dow Jones, Ifra, and others with the International Press Telecommunications Council and the News- newspaper techniques June 1999 paper Association of America to replace the outdated ANPA 1312 newswire format. Originally written in SGML, NITF was simplified into an XML format last year and, according to its creators, can “lower editing and transmission costs while making it easy to repackage news for publication in multiple media,” pointing out that “by settling upon a single markup language, news organisations can share news articles and graphics among print, broadcast, electronic, Internet and archive systems without the need for costly translations and manual editing. Using a language that embraces the latest internationally accepted standards assures newspapers and broadcasters that stories can flow unimpeded between their news systems and the Internet.” The sharp-eyed will have noted that emphasis on a “single markup language.” Perhaps spurred into action by the attention surrounding XMLNews, the NITF lob- Steve Shipside Commercially available by has recently begun to round up allies to endorse it. “We spend an enormous amount of time and money supporting over 150 different news formats in our archive products,” explains Glenn Cruickshank, director of Tribune Solutions which supplies the NewsView archive product line. “NITF makes our life enormously easier and our customers can spend more time improving their content rather than converting data.” Similarly Christian Ratenburg, product manager of Denmark-based CCI Europe, confirms that “it makes sense, business-wise, because NITF has allowed us to spend more time on adding value to the benefit of our customers.” Intype, makers of Handoff, has also adopted NITF for web publishing, as confirmed by Bob Gale, senior program manager; “NITF allows us to deliver a lot of power and simplicity to Web news publishers.” What’s really important here is not the familiar internecine quibbling of rival standards. Those standards are only sub/super sets of each other, all derived from the one metalanguage, XML. No, what really counts is that while the likes of NITF have always received endorsements from industry bodies and newswire agencies like Reuters and AFP, we are now hearing endorsements from the makers of tools. XML-based solutions are no longer a vague promise but a commercially available reality. Nothing emphasises that point more clearly than the news that Microsoft, that de facto establisher of day-to-day standards, has released an XML Parser for third-party developers to incorporate (free of charge) into their applications. The offer has already been pounced upon by a number of software developers, including net- 13 Prepress & Integration Steve Shipside > The XML family tree All the acronyms you need to become an instant XML guru. Metadata: data about data; information which enables a computer to order, prioritise or ‘understand’ the raw data that it is presenting. In the case of XML that means going beyond knowing that a series of characters are to be laid out with a space in the middle just below the header, and instead realising that those characters represent the first and second names of the author. Metalanguage: a language to describe languages, laying down guidelines for usage and vocabulary but allowing scope for individual languages to differ in the terms and instructions needed for their specific purpose. XML (eXtensible Markup Language): actually a misnomer since XML is not in itself a markup language but a metalanguage used to design markup languages for specific tasks. It’s ‘extensible’ because unlike a markup language it is not a fixed format and can be added to or modified, whilst still maintaining a linguistic structure or ‘grammar’. SGML (Standard Generalised Markup Language): an international standard (ISO 8879) metalanguage from which both HTML and XML were defined. For practical purposes XML can be seen as more complex, more sophisticated, and far more powerful than HTML, but not as complex as the full SGML set of tags and instructions. PGML (Precision Graphics Markup Language): basically an XML-aware graphics format geared towards converting PostScript and PDF documents and enabling them to hold information such as author, nature of the illustration, or even usage fees and rights. SVG (Scalable Vector Graphics): a format which Adobe in particular hopes will become the standard vector graphic format to rival familiar web bitmap formats such as GIFs and JPEGs. Based on the PGML specification, it brings designer illustration features including kerning, text along paths, and unlimited fonts. VML (Vector Markup Language): Microsoft’s chosen XML graphics format, being built into the Office suite of applications. RDF (Resource Data Framework): a system for archiving and retrieving data, often referred to as the library card system for XML documents. XLL (eXtensible Linking Language): arguably the future of hyperlinking and a language which actually consists of two parts, one (the XPointer specification) is the addressing system, pointing to the document to go to, the other (XLink) describing the different possible links and how they behave. In practice one hope for the future of XLL is for linking data to be storable in an external table, rather than just in the document as with HTML, which would make life considerably easier when it comes to updating large numbers of links. DTD (Document Type Definition): the ‘vocabulary’ of an XML language, including how many specific tags it has, and how they are used. XSL (eXtensible Stylesheet Language): a series of rules to automatically reformat material for different platforms – to the delight of re-publishers and data re-purposers everywhere. NITF (News Industry Text Format): an SGML-based language devised specifically for news and news wires by the NAA (Newspaper Association of America) and IPTC (The International Press Telecommunications Council). NML (News Markup Language): a set of XML tags aimed at describing content, now being incorporated into NITF. ICE (Information and Content Exchange): a protocol (based on XML) proposed by Ziff Davis, Vignette, Tribune Media Services, News Internet Services, CNET, and Hollinger International as an industry standard for exchanging content from one site to another in order to facilitate updates, and content purchasing between web-based news sources. 14 June 1999 newspaper techniques working giant Novell. “XML is a key piece of the puzzle for building multi-tier, Webenabled applications because it helps solve the applications integration problem,” according to Richard Hamblen, Developer Platform Marketing Manager at Microsoft. “Microsoft is the first to deliver a native XML solution for developers that reaches from the browser back to the database behind the Web server.” Microsoft Internet Explorer 5, the industry’s first XML-compliant browser software, is already available, and Microsoft database technologies such as SQL Server 7.0 save and retrieve in XML. For Microsoft, the appeal of XML is in flexible ecommerce and electronic data interchange ultimately linking up all published web data with searchable databases. It potentially does away with the need for standard order forms for products, since any XMLenabled browser would be able to spot that certain text blocks relate to products and availability, while other areas are needed to be filled in to complete a transaction. Forward-thinking publishers too should relish that possibility, not least since it means that any and every page can be a subscription form, or part of the marketing and merchandising service. Initially proposed back in 1996, and debated back and forth for the last three years, XML is no longer the hypothetical solution it is still widely supposed to be. Teaching newspapers, and indeed any online documents, to read themselves is not the pipe dream it might at first appear. With Microsoft aggressively promoting XML as the future of data exchange and e-commerce, and Adobe pushing it enthusiastically as the key to online graphics, it was never going to go away, but this year it has truly achieved breakthrough with integration into products – both generic and news trade specific – and the release of XML parsers as components for integration into the next generation of tools. With the NITF initiative now gaining momentum, events like WavePhore’s XMLNews can only serve to spur the adoption of XML across the board, and quibbles over differing dialects of what is ultimately the same language will hopefully be ironed out in the rush to communicate. For the publishing world, XML looks set to become the electronic lingua franca for the dawn of the new millennium. >