XML – teaching newspapers how to read themselves

Prepress & Integration
June 1999
newspaper techniques
Markup languages
XML – teaching newspapers
how to read themselves
Although HTML cannot be considered a failure as a markup language, it does
perhaps fall a bit short of the solution newspapers desire ... that’s where XML
enters the picture. > p. 11. XML needs a lot of practice within the industry to become
an industry tool > p. 12. The XML family tree. > p. 14. Just as XML tries to bring
meaning to the world of text layout, it can also bring added dimension to images
and graphics.
. The impact of XML on journalists. > p. 18. NAA’s Classified Ad
> p. 16
Standard based on XML > p. 20.
10
newspaper techniques
June 1999
Newspapers, though they may have
the wisdom of nations within their pages,
are not in themselves very smart things;
Inert, dumb and with no self-knowledge
they rely for their impact on a truly remarkable data processor – us. People are
phenomenal information processors. Just
look at the person in the street studying a
paper – skimming headlines, soaking up
captions and then delving into dense information to emerge with precisely the detail
he or she wants, like a heron impaling a
fish.
It’s a good thing we are so gifted, because even in the information age, with the
web completely re-writing the rules of
publishing, we still rely heavily on that human talent for data fishing. HTML may
have transformed the web from an academic’s notice board to the richest information source on the planet, but it has little
more self-awareness than its rag and pulp
predecessors. In order to take the next big
step forward, it seems, we are simply going
to have to teach publications a little of our
own skill.
No one could ever describe HTML as
a failure – it has undoubtedly ushered in a
new era of global information access, but
for all the fanfare surrounding the web, it
is important to remember that HTML is seriously limited. As a markup language it
can specify certain details about how information is displayed – font, page positioning, etc., but it simply has no idea about
the content of that information. It can ensure that a string of characters are displayed on a new line after the heading, but
it has no way of knowing that it has just
spelt out the name of the person who wrote
the article – something that would be
plainly obvious to any human scanning the
page.
If you could go just that one step further, and include not just information
about appearance but also about content,
then the possibilities are immense. By identifying the content, you make it searchable
– you turn all your published output into a
database at a stroke, and that in turn
means that you can select elements of publications, retrieve them, and re-use them
for different purposes. Not that you are
even expected to know for sure what those
purposes are yet. With the advent of interactive TV, smart phones with built-in displays, and wireless capable PDAs such as
Steve Shipside
the Palm Pilot, no one can be quite sure of
how the public will want their information
delivered in the near future. Anyone who
has battled with the hoary issue of browser
incompatibility will be aware that HTML,
when presented with a screen of an unexpected definition, cheerfully scrambles
your beloved pages.
However, if the markup language
were aware of what was a headline, a stand
first (or introductory paragraph), a Reuters
story or last Thursday’s Dilbert cartoon, it
could be told exactly what to display. A
browser that you don’t know, because it
hasn’t been developed yet, on a portable
device you haven’t seen, because it doesn’t
exist, would know how to ask for your
day’s headlines, lead stories or weather
maps, all without you having to do a thing
to the format of your existing media.
To do all that, of course, requires
something a little beyond the reach of
< C E N T E R > < B > < F O N T
COLOR=“#FF0000”><FONT SIZE=+3>. It
needs a vocabulary of its own, to describe
all the different elements of content that
matter to the publisher, the art department,
the subscriptions department, editorial, ads,
and the reader.
Plus it needs a grammar, a structure
so that browsers or retrieval tools understand, for example, that several of those
vocabulary words in one context might
mean several elements, or they all might be
applicable to one picture or piece of text.
Which is where XML comes in.
Allan Marshall, Group Technology Director of Associated Newspapers in London, says: “By separating content and structure from presentation, the same XML
source document can be written once, then displayed
in a variety of ways.”
been simplified down to the rules and tools
we now know, and some would say love, as
XML.
“XML proponents claim it will cure
everything that is wrong with HTML and
enable the seamless exchange of data between different applications and operating
systems,” said Allan Marshall, Group Technology Director of Associated Newspapers
in London. “By separating content and
structure from presentation, the same XML
source document can be written once, then
displayed in a variety of ways.” >
More than just a language
XML, or eXtensible Markup Language, is a little bit of a misnomer, since it
is not truly a language. Instead it is a
metalanguage, an off-putting term that
means a language with which to describe
languages, a set of rules and tools from
which individual industries can create their
own specialised dialects, without rendering
them incomprehensible to a browser that is
versed in the original metalanguage.
XML is actually derived from the
same parentage as HTML, both being subsets of SGML (Standard Generalised
Markup Language). SGML isn’t a language
either; it’s an even more complex metalanguage, established as an international standard (complete with an ISO number - 8879)
deemed too comprehensive (i.e. incomprehensible) to be used wholesale, and so has
For further information on XMLNews, go to
http://www.xmlnews.org.
11
Prepress & Integration
Steve Shipside
“Today pages are taken from our
print sites, distilled into PDF files, and
passed electronically to the online site
where ICENI software separates stories,
captions, headlines,” explains Marshall.
“Eventually the XML-based editorial system will do away with the need for the
ICENI application.”
An XML language for the newspaper
industry would include crucial details such
as publication number, volume, issue, publication title, data, edition, zones of distribution and page. Since a single publication
may be published several times it will also
need to handle separate information added
each time. Try doing that with HTML.
XML also presents an answer to the
perennial headache of flexible layout
whereby a layout may change with each
edition so that a paragraph that appears on
page one of one edition may be on page
seven of the next. That headache gets even
worse with electronic delivery ‘on the fly’
where information is requested and delivered live to user specifications. Therefore
an XML language must have a system to
handle links and jumps so that each block
of text has an attribute linking it to a
counting and layout attribute. That is set
into the header of the data which is always
sent with it, so that the document can keep
track of itself, however it changes.
Similarly, researchers would be delighted to find attributes that checked and
indicated word, character and page counts.
The editorial team, and especially the subs
and production editor, could check a workflow feature that tracks the editing history
of a document as it flows to and fro within
a publishing system. Plus, of course, blocks
of text could be identified by their type
and function, so a search engine, or even a
browser could tell the difference between a
summary, a caption, or a column. For the
first time a computer could arguably spot a
joke when it came across one. Importantly
this tagging system also allows a means of
specifying ‘no-run’ blocks, the elements of
text that are not to be displayed, either because they are not suited to a certain type
of device, or perhaps because of restricted
rights access (building rights information
into documents is another obvious bonus).
For more information on the NewsPak service, go to
http://www.newspak.com.
For details on NITF, try http://www.mediacenter.org.
Ever since electronic publishing
sneaked onto the scene, re-purposing and
re-publishing has been a key strategy.
What XML adds to that is not only a
means of ensuring that all your media assets are made available for re-purposing
but also the possibility of automating that
whole process.
Practice makes perfect
If you’ve noticed a certain hypothetical air about many of these possibilities,
12
June 1999
newspaper techniques
it’s because XML, although ratified by W3C
(World Wide Web Consortium that sets web
standards) is, as mentioned, a metalanguage and not industry specific. In order
for it to become a hard and fast industry
tool, it needs to be used to hammer out an
industry language, and trade bodies and
software manufacturers must come to an
agreement about the necessary vocabulary
and grammar. This sounds like a formula
for disaster, you might think, and indeed
the sceptics would have been gratified in
May this year to see the headlines to the
effect that WavePhore had announced a
new XML format, XMLNews, and with it a
web service called NewsPak. Corel promptly blessed the new-born format by announcing support for XMLNews within
WordPerfect Office 2000.
“The XMLNews initiative is going to
have a lot of momentum,” said Dr. David
Megginson, Chair of the World Wide Web
Consortium’s XML Information Set Working Group, and the man WavePhore turned
to for XMLNews. “XMLNews will come
down the wire from WavePhore, the company that distributes content from the
world’s largest news organisations and information providers. The news industry has
been waiting a long time for someone to
stop just talking about new standards and
start actually implementing them – this is
it.” Excellent, except that in this bold fait
accompli, WavePhore may have managed
to put more than a few noses out of joint.
XMLNews consists of two XML formats: XMLNews-Story for content and
XMLNews-Meta for metadata. XMLNewsStory which WavePhore describes as “fully
compatible subset[s] of the September 1998
XML version of the News Industry Text
Format (NITF) developed by the International Press Telecommunications Council
and the Newspaper Association of America.” However, while WavePhore cheerfully
acknowledges the NITF committee, it appears that WavePhore didn’t actually consult the committee prior to announcing
XMLNews.
It would be a very dull industry that
didn’t have the occasional spat, and the
good news is that it looks certain to goad
NITF into action. NITF is an XML standard
developed by Reuters, Associated Press,
Agence France Presse, Dow Jones, Ifra, and
others with the International Press Telecommunications Council and the News-
newspaper techniques
June 1999
paper Association of America to replace the
outdated ANPA 1312 newswire format.
Originally written in SGML, NITF was simplified into an XML format last year and,
according to its creators, can “lower editing
and transmission costs while making it
easy to repackage news for publication in
multiple media,” pointing out that “by settling upon a single markup language, news
organisations can share news articles and
graphics among print, broadcast, electronic,
Internet and archive systems without the
need for costly translations and manual editing. Using a language that embraces the
latest internationally accepted standards
assures newspapers and broadcasters that
stories can flow unimpeded between their
news systems and the Internet.”
The sharp-eyed will have noted that
emphasis on a “single markup language.”
Perhaps spurred into action by the attention surrounding XMLNews, the NITF lob-
Steve Shipside
Commercially available
by has recently begun to round up allies to
endorse it. “We spend an enormous amount
of time and money supporting over 150
different news formats in our archive products,” explains Glenn Cruickshank, director
of Tribune Solutions which supplies the
NewsView archive product line.
“NITF makes our life enormously easier and our customers can spend more time
improving their content rather than converting data.” Similarly Christian Ratenburg, product manager of Denmark-based
CCI Europe, confirms that “it makes sense,
business-wise, because NITF has allowed us
to spend more time on adding value to the
benefit of our customers.”
Intype, makers of Handoff, has also
adopted NITF for web publishing, as confirmed by Bob Gale, senior program manager; “NITF allows us to deliver a lot of
power and simplicity to Web news publishers.”
What’s really important here is not
the familiar internecine quibbling of rival
standards. Those standards are only
sub/super sets of each other, all derived
from the one metalanguage, XML. No,
what really counts is that while the likes of
NITF have always received endorsements
from industry bodies and newswire agencies like Reuters and AFP, we are now
hearing endorsements from the makers of
tools. XML-based solutions are no longer a
vague promise but a commercially available reality.
Nothing emphasises that point more
clearly than the news that Microsoft, that
de facto establisher of day-to-day standards, has released an XML Parser for
third-party developers to incorporate (free
of charge) into their applications. The offer
has already been pounced upon by a number of software developers, including net-
13
Prepress & Integration
Steve Shipside
> The XML family tree
All the acronyms you need to become an instant XML guru.
Metadata: data about data; information which enables a computer to order,
prioritise or ‘understand’ the raw data that it is presenting. In the case of XML
that means going beyond knowing that a series of characters are to be laid out
with a space in the middle just below the header, and instead realising that
those characters represent the first and second names of the author.
Metalanguage: a language to describe languages, laying down guidelines
for usage and vocabulary but allowing scope for individual languages to differ in
the terms and instructions needed for their specific purpose.
XML (eXtensible Markup Language): actually a misnomer since XML is not
in itself a markup language but a metalanguage used to design markup languages for specific tasks. It’s ‘extensible’ because unlike a markup language it
is not a fixed format and can be added to or modified, whilst still maintaining a
linguistic structure or ‘grammar’.
SGML (Standard Generalised Markup Language): an international standard
(ISO 8879) metalanguage from which both HTML and XML were defined. For
practical purposes XML can be seen as more complex, more sophisticated, and
far more powerful than HTML, but not as complex as the full SGML set of tags
and instructions.
PGML (Precision Graphics Markup Language): basically an XML-aware
graphics format geared towards converting PostScript and PDF documents and
enabling them to hold information such as author, nature of the illustration, or
even usage fees and rights.
SVG (Scalable Vector Graphics): a format which Adobe in particular hopes
will become the standard vector graphic format to rival familiar web bitmap formats such as GIFs and JPEGs. Based on the PGML specification, it brings designer illustration features including kerning, text along paths, and unlimited fonts.
VML (Vector Markup Language): Microsoft’s chosen XML graphics format,
being built into the Office suite of applications.
RDF (Resource Data Framework): a system for archiving and retrieving data,
often referred to as the library card system for XML documents.
XLL (eXtensible Linking Language): arguably the future of hyperlinking and
a language which actually consists of two parts, one (the XPointer specification)
is the addressing system, pointing to the document to go to, the other (XLink)
describing the different possible links and how they behave. In practice one
hope for the future of XLL is for linking data to be storable in an external table,
rather than just in the document as with HTML, which would make life considerably easier when it comes to updating large numbers of links.
DTD (Document Type Definition): the ‘vocabulary’ of an XML language, including how many specific tags it has, and how they are used.
XSL (eXtensible Stylesheet Language): a series of rules to automatically reformat material for different platforms – to the delight of re-publishers and data
re-purposers everywhere.
NITF (News Industry Text Format): an SGML-based language devised specifically for news and news wires by the NAA (Newspaper Association of America)
and IPTC (The International Press Telecommunications Council).
NML (News Markup Language): a set of XML tags aimed at describing content, now being incorporated into NITF.
ICE (Information and Content Exchange): a protocol (based on XML) proposed by Ziff Davis, Vignette, Tribune Media Services, News Internet Services,
CNET, and Hollinger International as an industry standard for exchanging content from one site to another in order to facilitate updates, and content purchasing between web-based news sources.
14
June 1999
newspaper techniques
working giant Novell. “XML is a key piece
of the puzzle for building multi-tier, Webenabled applications because it helps solve
the applications integration problem,” according to Richard Hamblen, Developer
Platform Marketing Manager at Microsoft.
“Microsoft is the first to deliver a native
XML solution for developers that reaches
from the browser back to the database behind the Web server.”
Microsoft Internet Explorer 5, the industry’s first XML-compliant browser software, is already available, and Microsoft
database technologies such as SQL Server
7.0 save and retrieve in XML. For Microsoft, the appeal of XML is in flexible ecommerce and electronic data interchange
ultimately linking up all published web
data with searchable databases. It potentially does away with the need for standard
order forms for products, since any XMLenabled browser would be able to spot that
certain text blocks relate to products and
availability, while other areas are needed to
be filled in to complete a transaction. Forward-thinking publishers too should relish
that possibility, not least since it means
that any and every page can be a subscription form, or part of the marketing and
merchandising service.
Initially proposed back in 1996, and
debated back and forth for the last three
years, XML is no longer the hypothetical
solution it is still widely supposed to be.
Teaching newspapers, and indeed any online documents, to read themselves is not
the pipe dream it might at first appear.
With Microsoft aggressively promoting XML as the future of data exchange and e-commerce, and Adobe pushing it enthusiastically as the key to online
graphics, it was never going to go away,
but this year it has truly achieved breakthrough with integration into products –
both generic and news trade specific – and
the release of XML parsers as components
for integration into the next generation of
tools. With the NITF initiative now gaining
momentum, events like WavePhore’s XMLNews can only serve to spur the adoption
of XML across the board, and quibbles over
differing dialects of what is ultimately the
same language will hopefully be ironed out
in the rush to communicate. For the publishing world, XML looks set to become
the electronic lingua franca for the dawn of
the new millennium. >