Web Publications

1. Introduction§

1.1 Background§

This section is non-normative.

For millenia now, the written word has been the primary means of encoding and sharing ideas and information. The publication as a bounded edition, made public, has been used to carry intellectual and artistic works of innumerable form: novels, plays, poetry, journals, magazines, newspapers, articles, laws, treatises, pamphlets, atlases, comics, manga, notebooks, memos, manuals, and albums of all sorts.

More recently, with the advent of the information age, print has been ceding ground to digital, and the Web has become a major forum for the public dissemination of ideas. But the Web is unbounded: information and resources are only loosely connected through hyperlinks. While this model has helped the Web thrive in many areas, it has proven problematic for traditional information publishing—users often cannot access works in their entirety, especially when offline, and have not been able to easily access, compile and download content for curation and personal use. That, in turn, has fed the continuing development of non-Web document formats to redress these problems, and made it necessary to create both Web-ready content and alternative offline renditions to ensure publications are fully available.

This specification aims to reduce these barriers and reinvigorate publishing by combining the best aspects of both models—the persistent availability and portability of bounded publications with the pervasive accessibility, addressability, and interconnectedness of the Open Web Platform.

1.2 Scope§

This section is non-normative.

This specification only defines requirements for the production and rendering of valid Web Publications. As much as possible, it leverages existing Open Web Platform technologies to achieve its goal—that being to allow for a measure of boundedness on the Web without changing the way that the Web itself operates.

Moreover, the specification is designed to adapt automatically to updates to Open Web Platform technologies in order to ensure that Web Publications continue to interoperate seamlessly as the Web evolves (e.g., by referencing the latest published versions instead of specific versions).

Further, this specification does not attempt to constrain the nature of a Web Publication: any type of work that can be represented on the Web constitutes a potential Web Publication.

1.3 Terminology§

Wherever appropriate, this document relies on terminology defined by the note on "Publishing and Linking on the Web" [publishing-linking], including, in particular, user, user agent, browser, and address.

Identifier: An identifier is metadata that can be used to refer to a Web Content in a persistent and unambiguous manner. URLs, URNs, DOIs, ISBNs, or PURLs are all examples of persistent identifiers frequently used in publishing.
Manifest: A manifest represents structured information about a Web Publication, such as informative metadata, a list of all resources, and a default reading order.
Non-empty: For the purposes of this specification, non-empty is used to refer to an element, attribute or property whose text content or value consists of one or more characters after whitespace normalization, where whitespace normalization rules are defined per the host format.
URL: In this specification, the general term URL is used as in other W3C specifications like HTML [ html], and is defined by URL Standard of the WhatWG [url]. In particular, such a URL allows for the usage of characters from Unicode following [rfc3987]. See the note in the HTML5 document for further details.
Web Publication: A Web Publication is a collection of one or more resources, organized together through a manifest into a single logical work with a default reading order. The Web Publication is uniquely identifiable and presentable using Open Web Platform technologies.

3. Information Set§

Editor's note

The name "infoset" may change depending on feedback. Although this term has a different meaning for individuals familiar with XML, alternatives such as "properties" and "metadata" do not fully capture the nature or purpose.

3.1 Overview§

This section is non-normative.

A Web Publication is defined by a set of properties and features known as its information set (infoset). The infoset is both abstract and concrete. It is abstract in the sense that it represents a set of information that a user agent has to be able to compile about the Web Publication, but it also becomes concrete when the user agent creates an internal representation of the information.

The infoset does not require a specific serialization. It is primarily compiled from a Web Publication's manifest, whose serialization requirements are defined in 4.4 Serialization. It is therefore possible to express the same infoset via different manifests, although a Web Publication will only have one manifest.

Although the manifest is the primary source of the infoset, some information may be obtained independent of it. For example, fallback rules for properties defined in the following subsections allow a user agent to compile information that the author has not provided in the manifest, whether as an intentional optimization or by accidental omission.

3.2 Requirements§

The Web Publication infoset MUST include the following information:

The address of the Web Publication.
The Web Publication resources.
The default reading order of the Web Publication.

In addition, the infoset SHOULD include the following information:

The title (or "name") of the Web Publication.
The default (natural) language of the Web Publication.
A canonical identifier.
The table of contents.

Editor's note

These requirements reflect the current minimum consensus, though a number of issues remain open that could change whether an item is required or recommended. See the following sections for more information.

Issue 15

Ignoring issues such as location, serialization, etc. What is the minimum viable manifest? (Note: this is now specifically related to the infoset.)

Issue 21: manifest: metadata

Whether the minimum manifest must include any metadata, or a specific slot to handle metadata. (Note: this is now more specifically related to the infoset.)

3.3 Title§

The title provides the human-readable name of the Web Publication.

When specified in the manifest, the title MUST be non-empty.

If a user agent requires a title and one is not available in the infoset, it MAY create one. This specification does not mandate how such a title is created. The user agent might:

use the first non-empty title element found in the default reading order;
provide a language-specific placeholder title (e.g., 'Untitled Publication');
use the URL of the manifest;
calculate a title using its own algorithm.

Note

A user agent is not expected to produce a meaningful title [WCAG20] for a Web Publication when one is not specified.

Issue 20

(See also issue #24.) The question is whether the manifest MUST include a title or not.

3.4 Language§

The language specified in the Web Publication's infoset identifies the natural language(s) of its content.

This language is not used in the processing or rendering of the Web Publication (including the manifest), and is not a replacement for identifying the language of each resource as defined by its format. It instead allows a user agent to ability to provide supplementary enhancements, such as the ability to download a custom dictionary or the preload a language-specific text-to-speech module.

When specified, the language MUST be a tag that conforms to [BCP47].

If a user agent requires the language and one is not available in the infoset, it MAY attempt to determine the language. This specification does not mandate how such a language tag is created. The user agent might:

use the non-empty language declaration of the manifest;
use the first non-empty language declaration found in the default reading order;
calculate the language using its own algorithm.

If a language tag cannot be determined, the value "und" (undetermined) MUST be used.

Issue 29

The question is whether the manifest MUST include the language(s) of the content or not.

3.5 Canonical Identifier§

A Web Publication's canonical identifier is a unique identifier that resolves to the preferred version of the Web Publication. The canonical identifier SHOULD be an address, but, if not, it MUST be possible to make a one-to-one mapping to an address (e.g., a DOI can be resolved to a URL via a DOI resolver).

If a Web Publication is hosted at more than one address, this identifier allows a user agent to identify the shared relationship between the versions and to determine which of the available addresses is primary.

The canonical identifier is also intended to provide a measure of permanence above and beyond the Web Publication's address. Even if a Web Publication is permanently relocated to a new address, for example, the canonical identifier will provide a way of locating the new location (e.g., a DOI registry could be updated with the new URL, or a redirect could be added to the URL of the canonical identifier).

When assigned, the canonical identifier needs to be unique to one and only one Web Publication, independent of its address(es). Ensuring uniqueness is outside the scope of this specification, however. The actual uniqueness achievable depends on such factors as the conventions of the identifier scheme used and the degree of control over assignment of identifiers.

Note

If the canonical identifier is a URL, it can be used as the target of a "canonical" link [ rfc6596] (e.g., a [html] link element whose rel attribute has the value canonical or a Link HTTP header field [rfc5988] similarly identified).

Issue 56: The canonical-ness of identification needs clarification

The question is whether a canonical identifier is necessary to call out explicitly in the infoset, or whether it is/can be handled by other metadata.

3.6 Address§

A Web Publication's address is a URL that refers to a Web Publication and enables the retrieval of a representation of the manifest of the Web Publication.

The availability of this address does not preclude the creation and use of other identifiers and/or addresses to retrieve a representation of a Web Publication in whole or part.

Note

The Web Publication's address can also be used as value for an identifier link relation [ link-relation].

3.7 Resources§

The infoset MUST include a list of the Web Publication's resources, although the list is not required to be exhaustive. Resources in the default reading order MUST be included in this list.

Issue 22: manifest: requirements for offline

The discussion led to the question whether the manifest/infoset MUST list all resources or not. In this sense, this became a duplicate of issue #23 ended up at the same question.

Issue 23: MUST the manifest include information about secondary resources or not?

The question is whether the manifest/infoset MUST list all resources or not.

3.8 Default Reading Order§

The default reading order is a specific progression through a set of Web Publication resources.

A user might follow alternative pathways through the content, but in the absence of such interaction the default reading order defines the expected progression from one resource to the next.

The default reading order MUST include at least one resource.

The default reading order is either specified directly in the manifest or a link is provided to an [ html] nav element whose list of links are processed to create one.

The process for extracting a default reading order from a nav element are as follows:

extract a list of resource paths referenced from the href attribute of all a elements;
strip any fragment identifiers from the references;
resolve all relative paths to full URLs;
remove all consecutive references to the same resource, leaving only the first.

If a user agent requires a default reading order and one is not provided in the infoset, it MAY attempt to construct one. This specification does not mandate how such a default reading order is created. The user agent might:

use only the resource the user accessed to reach the manifest;
search the list of resources for a nav element to use;
calculate the default reading order using its own algorithm.

Issue 26

Issue 35: Proposal: an HTML-first Table of Contents approach to Web Publication

Define the default reading order of a Web Publication to be the files referenced in the first

Issue 36

Issue 39: Do all documents in the reading order have to be reachable from the ToC

There is a consensus that a Web Publication must have a reading order and must/should have a table of contents (the main navigation entry point).

3.9 Table of Contents§

The table of contents provides access to major sections of the Web Publication. There are no requirements on the completeness of the table of contents, except that, when specified, it MUST link to at least one resource in the default reading order.

The table of contents is either specified directly in the manifest or a link is provided to an [html] nav element containing one.

If a user agent requires a table of contents and one is not specified, it MAY construct one. This specification does not mandate how such a table of contents is created. The user agent might:

attempt to locate a table of contents in the default reading order (e.g., an HTML document with a nav element that has the role attribute value doc-toc);
use the titles of resources in the default reading order;
calculate a table of contents using its own algorithms.

Issue

This question arises only if this mechanism is accepted: the question is whether a table of contents navigation element can refer, via links, to any resource that is not listed in the default reading order.

Editor's note

The issue of using the HTML nav element as a possible encoding of the table of contents is mentioned or explicitly addressed in a number of issues listed below.

Issue 26

Issue 35: Proposal: an HTML-first Table of Contents approach to Web Publication

Define the resources in the default reading order of a Web Publication to be the files referenced in the first

Issue 36

Issue 39: Do all documents in the reading order have to be reachable from the ToC

There is a consensus that a Web Publication must have a reading order and must/should have a table of contents (the main navigation entry point).

A. Acknowledgements§

This section is non-normative.

The following people contributed to the development of this specification:

Greg Albers (J. Paul Getty Trust)
Boris Anthony (The Rebus Foundation)
Christopher Auclair (VitalSource | Ingram Content Group)
Luc Audrain (Hachette Livre)
Baldur Bjarnason (The Rebus Foundation)
Nick Brown (VitalSource | Ingram Content Group)
Fred Chasen (Invited Experts without Member Access)
Timothy Cole (University of Illinois at Urbana-Champaign)
Jason Colman (University of Michigan Library)
Rachel Comerford (Macmillan Higher Education)
Garth Conboy (Google, Inc.)
Dave Cramer (Hachette Livre)
Romain Deltour (DAISY Consortium)
Marisa DeMeglio (DAISY Consortium)
Vagner Diniz (NIC.br - Brazilian Network Information Center)
Brady Duga (Google, Inc.)
Ben Dugas (Rakuten,Inc.)
Roger Espinosa (University of Michigan Library)
Reinaldo Ferraz (NIC.br - Brazilian Network Information Center)
Heather Flanagan (Invited Experts without Member Access)
Jun Gamo (Voyager Japan, Inc.)
Hadrien Gardeur (Feedbooks)
Matt Garrish (DAISY Consortium)
Harriett Green (University of Illinois at Urbana-Champaign)
Markku Hakkinen (Educational Testing Service)
Katie Haritos-Shea (Knowbility, Inc)
Ivan Herman (W3C Staff)
Leslie Hulse (HarperCollins Publishers)
Rick Johnson (VitalSource | Ingram Content Group)
Deborah Kaplan (Invited Experts without Member Access)
Bill Kasdorf (Book Industry Study Group)
George Kerscher (DAISY Consortium)
Yuri Khramov (Evident Point Software Corp.)
Toshiaki Koike (Voyager Japan, Inc.)
Peter Krautzberger (krautzource UG)
Matt Kuznicki (Datalogics, Inc.)
Charles LaPierre (Benetech)
Laurent Le Meur (EDRLab)
Vladimir Levantovsky (Monotype)
Mia Lipner (Pearson plc)
Edwina Lui (Kaplan Publishing)
Phil Madans (Hachette Livre)
Christopher Maden (University of Illinois at Urbana-Champaign)
Jia Ma (Kaplan Publishing)
Bill McCoy (W3C Staff)
Jonathan McGlone (University of Michigan Library)
Hugh McGuire (The Rebus Foundation)
Maureen McMahon (Kaplan Publishing)
Selma Morais (NIC.br - Brazilian Network Information Center)
Shinyu Murakami (Vivliostyle Inc.)
Makoto Murata (Vivliostyle Inc.)
Cristina Mussinelli (Associazione Italiana Editori)
Chris Powell (University of Michigan Library)
Jeff Printy (Macmillan Higher Education)
Ryan Pugatch (Hachette Livre)
Leonard Rosenthol (Adobe Systems Inc.)
Nicholas Ruffilo (VitalSource | Ingram Content Group)
Robert Sanderson (J. Paul Getty Trust)
Wolfgang Schindler (PONS GmbH)
Jodi Schneider (University of Illinois at Urbana-Champaign)
Ben Schroeter (Pearson plc)
Tzviya Siegman (Wiley)
Avneesh Singh (DAISY Consortium)
Susanna Skinner (HarperCollins Publishers)
David Stroup (Pearson plc)
Mateus Teixeira (W. W. Norton)
Jonathan Thurston (Pearson plc)
Daniel Weck (DAISY Consortium)
John Weise (University of Michigan Library)
Jason White (Educational Testing Service)
David Wood (Ephox Corporation)
Richard Wright (EDRLab)
Evan Yamanishi (W. W. Norton)
Maurice York (University of Michigan Library)
Benjamin Young (Wiley)

The Working Group would also like to thank the members of the Digital Publishing Interest Group for all the hard work they did paving the road for this specification.

Abstract

Status of This Document

1. Introduction§

1.1 Background§

1.2 Scope§

1.3 Terminology§

2. Conformance§

2.1 Conformance Classes§

3. Information Set§

3.1 Overview§

3.2 Requirements§

3.3 Title§

3.4 Language§

3.5 Canonical Identifier§

3.6 Address§

3.7 Resources§

3.8 Default Reading Order§

3.9 Table of Contents§

4. Manifest§

4.1 Overview§

4.2 Requirements§

4.3 Declaration§

4.4 Serialization§

4.5 Linking to a Manifest§

5. Web Publication Lifecycle§

5.1 Obtaining the manifest§

5.2 Processing the manifest§

5.3 Intiating the Web Publication§

5.4 Updating the Web Publication§

6. Reading Enhancements§

6.1 Navigation§

6.1.1 Reading Order§

6.1.2 Table of Contents§

6.2 Offline Reading§

6.3 Search§

6.4 Pagination§

7. Security§

8. Privacy§

A. Acknowledgements§

B. References§

B.1 Normative references§

B.2 Informative references§