WebDevelopersJournal.comTips on Web Page Design, HTML and Graphics
SITE SEARCH
Newsletters
HTML (M-F) Text (M,TH)



Jobs at webdeveloper.com

Resources By Subject
Technical
Graphical
Authoring
Business
WDJ resources
Archive

internet.com

internet.commerce


Developer Channel


Find a web host with:
CGI Access DB Support Telnet Access
NT Servers UNIX Servers



Semi-automatic?

JavaScript
JavaScript Helper:
Meet Paige Turner, the least geeky geek we've ever come across.

Variables and Operators Explained:
First of a three part guide to JavaScript basics.

Controlling Forms:
Enhance your HTML forms with a touch of JS.

DHTML:
Forget how it works, let's see some in action!

Reduce Energy Costs and Go Green with VMware Virtualization. Learn how VMware can help you green your datacenter while decreasing costs and improving service levels. Click here.

XML Content Syndication

Multiple Formats From One XML File

"Applied XML Solutions," a new book from Benoît Marchal, shows professional developers how to apply XML to a variety of real-world applications. These include using XML as a scripting substitute and using XSLT to facilitate communication between incompatible systems. Here we present an interesting extract from the book - the chapter devoted to content syndication: producing HTML, WML and RSS from XML.
November 15, 2000
Extract published courtesy of Sams Publishing.

This article is in four parts:

The Web is many things to many people but, for publishers and authors, it is another media comparable to print, radio, and TV. Don't get me wrong, I recognize that the Internet has unique characteristics, but its reach is comparable to other popular media.

As proof, look at initiatives by existing publishers to offer their content online (visit www.informit.com), the emergence of new publishers (such as www.earthweb.com), and, of course, the growing involvement of authors (such as my own www.marchal.com).

Furthermore, a growing number of companies, who are not necessarily publishers, use their Web sites to distribute information, articles, and reports (such as developer.iplanet.com).

However, the medium is still young and changing. At the peak of the rivalry between Microsoft and Netscape, the so-called "browser war," Web fashion was changing every six months. We are now enjoying more stability, but, mark my words, the browser war is about to start again with new actors. And this time, it will be more painful for the under-prepared.

According to the W3C, non-desktop browsers might account for as much as 75% of all surfers by 2002. Non-desktop browsers include mobile phones, PDAs (such as the PalmPilot), and WebTV.

Most of these devices simply won't use HTML. During the browser war, designers could at least rely on some level of commonality between the two major browsers. This won't be the case any more because mobile phones use a special language, Wireless Markup Language (WML), which is incompatible with HTML.

What to do? Should content providers (publishers, authors, and companies) limit themselves to either HTML or WAP? Should they support both formats? Should they prepare for even more formats?

Developing original content (articles, books, reports, and so on) is expensive. To offset the cost, content owners want to distribute their content as widely as possible. Ideally, it should not matter whether the reader uses a PC, a mobile phone, or another device.

In this chapter, we will see how XML helps address this challenge. As you know, XML's roots are in the publishing industry, and that heritage guarantees that there is no lack of quality tools for publishing problems.

Architecture

Webmasters typically edit their Web sites with an HTML editor. The major disadvantage of this approach is that it freezes the site. Indeed, to change the presentation, you must manually re-edit every page. It's possible to do, but it's a lot of work.

The XML solution is to separate authoring from publishing. The author of the pages writes the document in XML. While doing so, she ignores presentation. She instead adopts an XML vocabulary that focuses on the organization of the document: sections, titles, abstracts, and more.

Publishing the document then simply requires converting the document into HTML, WML, or another popular format. Fortunately, this can be automated because the original XML document is structure rich. The operative word here is automated.

For medium to large sites, it is more cost effective to automate publishing. Rewriting a couple of pages by hand is feasible; however, for a hundred pages, it is too expensive.

Figure 4.1 illustrates how we'll apply these principles in this chapter.



Figure 4.1: XML separates authoring and publishing.

The three main elements are as follows:

  • Documents in structure-rich XML

  • XSLT style sheets that implement the conversion to HTML, WML, and RSS (more on RSS in the next section)

  • A servlet that is responsible for applying the style sheets

XML Stylesheet Language

To publish XML documents we will use XSL, the XML Stylesheet Language. More specifically, we will use XSLT, XSL Transformation.

XSLT is a scripting language optimized for conversion between XML documents. In that respect it differs from early style sheet languages, such as CSS (Cascading Style Sheet), or word processor style sheets.

CSS describes how each element should be presented onscreen: which font, which color, which size, and more.

XSLT transforms the XML document into another XML document. It goes much further than simple presentation instructions. In fact, XSLT can completely reorganize a document and, for example, add a table of contents or delete a section.

How does that help? The trick is to transform from a structure-rich XML document into a format that contains display instructions, such as HTML or WML.

A browser (or another viewer) can render the second document onscreen or on paper. What display format should you use? The following are some popular options:

  • HTML—Strictly speaking, HTML is not an XML vocabulary. This is not an XML-to-XML transformation. However, HTML is so popular, and so close to XML, that the W3C decided to support it.

  • XHTML—The XML version of HTML.

  • WML—The markup language for WAP devices.

  • Open eBook—The format for eBooks, based on HTML.

  • XSLFO—A new display language that is optimized for printed documents. At the time of writing, two XSLFO viewers exist: a browser (www.indelv.com) and a PDF converter (xml.apache.org).

The XSLT standard is available online at www.w3.org/TR/xslt.

XML Vocabulary

XML does not define any vocabulary. It is up to developers to create vocabularies for their applications.

For this application, we have two realistic options. The first option is to use DocBook (www.docbook.org) or another standard SGML/XML vocabulary for documents. DocBook is particularly attractive because it is widely used and well supported.

However, DocBook is so rich that it is too complicated for such a simple project.

The second option, and the one we'll adopt in this chapter, is to create our own vocabulary—one that is simple and limited to only the tags we need.

Listing 4.1 illustrates the vocabulary we'll use in this chapter. As you can see, it is almost trivial: It's just a list of news items.

Listing 4.1 index.xml

<?xml version="1.0"?>
<News>
  <URL>http://localhost:8080/publish/index</URL>
  <Item>
   <Title>Applied XML Solutions</Title>
   <Author>Beno&#238;t Marchal</Author>
   <Abstract>A new intermediate/advanced book for XML
     developers.</Abstract>
   <Para>Learn advanced XML programming with Applied XML
     Solutions. This hands-on teaching book is filled with
     practical examples.</Para>
   <Para>Applied XML Solutions is a great complement to XML by
     Example.</Para>
  </Item>
  <Item>
   <Title>Jetty</Title>
   <Author>Greg Wilkins</Author>
   <Abstract>Open Source Java Server.</Abstract>
   <Para>Jetty is a powerful, open-source Java web server. It
     supports standard Java servlets making it the ideal
     development environment.</Para>
   <Para>Jetty is also highly-configurable which helps custom
     developments.</Para>
  </Item>
  <Item>
   <Title>Hypersonic SQL</Title>
   <Author>Thomas M&#252;ller</Author>
   <Abstract>Open Source SQL Database.</Abstract>
   <Para>Hypersonic SQL is an open source database that
     supports the JDBC API.</Para>
   <Para>Hypersonic SQL is efficient and can run in three
     modes: in-memory, standalone or client/server. This
     provides lots of flexibility when writing
     software.</Para>
  </Item>

</News>

The list starts with a URL that points to the server where the document resides. The W3C suggests using the xml:base attribute for this purpose, but it turns out that Xalan, the XSLT processor I use, has a problem with the xml namespace, so I use a URL element as a workaround:


<URL>http://localhost:8080/publish/index</URL>
Each item has a title, author, abstract, and list of paragraphs:
<Item>
  <Title>Applied XML Solutions</Title>
  <Author>Beno&#238;t Marchal</Author>
  <Abstract>A new intermediate/advanced book for XML
   developers.</Abstract>
  <Para>Learn advanced XML programming with Applied XML
   Solutions. This hands-on teaching book is filled with
   practical examples.</Para>
  <Para>Applied XML Solutions is a great complement to XML by
   Example.</Para>
</Item>

Figure 4.2 illustrates the structure.


Figure 4.2: The document structure in XML.

How can you develop such a format? When should you use existing formats (such as DocBook) rather than develop your own? Unfortunately, there are no hard rules that you can follow to guarantee success.

As you develop your XML vocabulary, remember that a good vocabulary achieves a reasonable compromise between two opposite goals: On the one hand, it must mark up as much information as possible; on the other hand, it must be simple.

It is important to mark up as much data as is realistically possible because the markup drives the transformation to HTML, WML, and others. If something has not been marked up, transforming it will be difficult (or outright impossible).

Yet, as you define the vocabulary, be realistic. If you provide too many tags and too many options, you will confuse authors. This is particularly true if authors don't use the format regularly.

A format that is too complex can be dangerous because it gives the false impression that we're creating quality documents, whereas, in fact, authors usually ignore most of the markup. I am sure you have already encountered a database with a complex table organization. In most cases, developers have misused it and retrieving useful information is difficult. The same could happen with a markup vocabulary that is too complex.

Tip - Consider using an XML editor, as introduced in the previous chapter, to guide authors.

Copyright Sams Publishing. All rights reserved.

Part 2

Suits PonytailsPropheadsContact WDJDiscussWeb AudioSearch



JupiterOnlineMedia

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers