Copyright © 2017 W3C® (MIT, ERCIM, Keio, Beihang). W3C liability, trademark and permissive document license rules apply.
This specification defines a collection of information that describes the structure of Web Publications, so that user agents or developers may create user experiences well-suited to reading publications, such as sequential navigation and offline reading. This information includes the default reading order, a list of resources, and publication-wide metadata.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
This first public working draft provides a preliminary outline of a Web Publication. Many details are under active consideration within the Publishing Working Group and are subject to change. The most prominent known issues have been identified in this document and links provided to comment on them.
In particular, the Working Group seeks feedback on the following issues:
This document was published by the Publishing Working Group as an Editor's Draft. Comments regarding this document are welcome. Please send them to public-publ-wg@w3.org (subscribe, archives).
Publication as an Editor's Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is governed by the 1 March 2017 W3C Process Document.
This section is non-normative.
For millenia now, the written word has been the primary means of encoding and sharing ideas and information. The publication as a bounded edition, made public, has been used to carry intellectual and artistic works of innumerable form: novels, plays, poetry, journals, magazines, newspapers, articles, laws, treatises, pamphlets, atlases, comics, manga, notebooks, memos, manuals, and albums of all sorts.
More recently, with the advent of the information age, print has been ceding ground to digital, and the Web has become a major forum for the public dissemination of ideas. But the Web is unbounded: information and resources are only loosely connected through hyperlinks. While this model has helped the Web thrive in many areas, it has proven problematic for traditional information publishing—users often cannot access works in their entirety, especially when offline, and have not been able to easily access, compile and download content for curation and personal use. That, in turn, has fed the continuing development of non-Web document formats to redress these problems, and made it necessary to create both Web-ready content and alternative offline renditions to ensure publications are fully available.
This specification aims to reduce these barriers and reinvigorate publishing by combining the best aspects of both models—the persistent availability and portability of bounded publications with the pervasive accessibility, addressability, and interconnectedness of the Open Web Platform.
This section is non-normative.
This specification only defines requirements for the production and rendering of valid Web Publications. As much as possible, it leverages existing Open Web Platform technologies to achieve its goal—that being to allow for a measure of boundedness on the Web without changing the way that the Web itself operates.
Moreover, the specification is designed to adapt automatically to updates to Open Web Platform technologies in order to ensure that Web Publications continue to interoperate seamlessly as the Web evolves (e.g., by referencing the latest published versions instead of specific versions).
Further, this specification does not attempt to constrain the nature of a Web Publication: any type of work that can be represented on the Web constitutes a potential Web Publication.
Wherever appropriate, this document relies on terminology defined by the note on "Publishing and Linking on the Web" [publishing-linking], including, in particular, user, user agent, browser, and address.
An identifier is metadata that can be used to refer to Web Content in a persistent and unambiguous manner. URLs, URNs, DOIs, ISBNs, or PURLs are all examples of persistent identifiers frequently used in publishing.
A manifest represents structured information about a Web Publication, such as informative metadata, a list of all resources, and a default reading order.
For the purposes of this specification, non-empty is used to refer to an element, attribute or property whose text content or value consists of one or more characters after whitespace normalization, where whitespace normalization rules are defined per the host format.
In this specification, the general term URL is used as in other W3C specifications like HTML [ html], and is defined by URL Standard of the WhatWG [url]. In particular, such a URL allows for the usage of characters from Unicode following [rfc3987]. See the note in the HTML5 document for further details.
A Web Publication is a collection of one or more resources, organized together through a manifest into a single logical work with a default reading order. The Web Publication is uniquely identifiable and presentable using Open Web Platform technologies.
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MAY, MUST, SHOULD, and SHOULD NOT are to be interpreted as described in [RFC2119].
This specification defines two conformance classes: one for Web Publications and one for user agents that process them.
A Web Publication is conformant to this specification if it meets the following criteria:
A user agent is conformant to this specification if it meets the following criteria:
The name "infoset" may change depending on feedback. Although this term has a different meaning for individuals familiar with XML, alternatives such as "properties" and "metadata" do not fully capture the nature or purpose. See issue #63 for discussion.
A Web Publication is defined by a set of properties known as its information set (infoset). The infoset is both abstract and concrete. It is abstract in the sense that it represents a set of information that a user agent has to be able to compile about the Web Publication, but it also becomes concrete when the user agent creates an internal representation of the information.
The infoset does not require a specific serialization. It is primarily compiled from a Web Publication's manifest, whose serialization requirements are defined in 4.4 Serialization. It is therefore possible to express the same infoset via different manifests, although a Web Publication will only have one manifest.
Although the manifest is the primary source of the infoset, some information may be obtained independent of it. For example, fallback rules for properties defined in the following subsections allow a user agent to compile information that the author has not provided in the manifest, whether as an intentional optimization or by accidental omission.
The Web Publication infoset MUST include the following information:
In addition, the infoset SHOULD include the following information:
These requirements reflect the current minimum consensus, though a number of issues remain open that could change whether an item is required or recommended. See the following sections for more information.
Whether the minimum manifest must include any metadata, or a specific slot to handle metadata. (Note: this is now more specifically related to the infoset.)
The title provides the human-readable name of the Web Publication.
When specified in the manifest, the title MUST be non-empty.
If a user agent requires a title and one is not available in the infoset, it MAY create one. This specification does not mandate how such a title is created. The user agent might:
title element found in the default reading order;A user agent is not expected to produce a meaningful title [WCAG20] for a Web Publication when one is not specified.
The language specified in the Web Publication's infoset identifies the natural language(s) of its content.
This language is not used in the processing or rendering of the Web Publication (including the manifest), and is not a replacement for identifying the language of each resource as defined by its format. It instead allows a user agent to ability to provide supplementary enhancements, such as the ability to download a custom dictionary or the preload a language-specific text-to-speech module.
When specified, the language MUST be a tag that conforms to [BCP47].
If a user agent requires the language and one is not available in the infoset, it MAY attempt to determine the language. This specification does not mandate how such a language tag is created. The user agent might:
If a language tag cannot be determined, the value "und" (undetermined) MUST be used.
The question is whether the language declared for the manifest content is the same as the language of the publication, and how to deal with multilingual publications.
A Web Publication's canonical identifier is a unique identifier that resolves to the preferred version of the Web Publication. The canonical identifier SHOULD be an address, but, if not, it MUST be possible to make a one-to-one mapping to an address (e.g., a DOI can be resolved to a URL via a DOI resolver).
If a Web Publication is hosted at more than one address, this identifier allows a user agent to identify the shared relationship between the versions and to determine which of the available addresses is primary.
The canonical identifier is also intended to provide a measure of permanence above and beyond the Web Publication's address. Even if a Web Publication is permanently relocated to a new address, for example, the canonical identifier will provide a way of locating the new location (e.g., a DOI registry could be updated with the new URL, or a redirect could be added to the URL of the canonical identifier).
When assigned, the canonical identifier needs to be unique to one and only one Web Publication, independent of its address(es). Ensuring uniqueness is outside the scope of this specification, however. The actual uniqueness achievable depends on such factors as the conventions of the identifier scheme used and the degree of control over assignment of identifiers.
If the canonical identifier is a URL, it can be used as the target of a "canonical" link [
rfc6596] (e.g., a [html] link element whose rel attribute has the value canonical or
a Link HTTP header field [rfc5988] similarly identified).
The question is whether a canonical identifier is necessary to call out explicitly in the infoset, or whether it is/can be handled by other metadata.
A Web Publication's address is a URL that refers to a Web Publication and enables the retrieval of a representation of the manifest of the Web Publication.
The availability of this address does not preclude the creation and use of other identifiers and/or addresses to retrieve a representation of a Web Publication in whole or part.
The infoset MUST include a list of the Web Publication's resources, although the list is not required to be exhaustive. Resources in the default reading order MUST be included in this list.
The discussion led to the question whether the manifest/infoset MUST list all resources or not. In this sense, this became a duplicate of issue #23 ended up at the same question.
The question is whether the manifest/infoset MUST list all resources or not.
The question is whether the manifest MUST list resources in the default reading order or whether this can be inferred.
The default reading order is a specific progression through a set of Web Publication resources.
A user might follow alternative pathways through the content, but in the absence of such interaction the default reading order defines the expected progression from one resource to the next.
The default reading order MUST include at least one resource.
The default reading order is either specified directly in the manifest or a link is provided to an [
html] nav element whose list of links are processed to create one.
The process for extracting a default reading order from a nav element are as follows:
href attribute of all
a elements;If a user agent requires a default reading order and one is not provided in the infoset, it MAY attempt to construct one. This specification does not mandate how such a default reading order is created. The user agent might:
nav element to use;Define the default reading order of a Web Publication to be the files referenced in the first
There is a consensus that a Web Publication must have a reading order and must/should have a table of contents (the main navigation entry point).
The table of contents provides access to major sections of the Web Publication. There are no requirements on the completeness of the table of contents, except that, when specified, it MUST link to at least one resource in the default reading order.
The table of contents is either specified directly in the manifest or a link is provided to an [html]
nav element containing one.
If a user agent requires a table of contents and one is not specified, it MAY construct one. This specification does not mandate how such a table of contents is created. The user agent might:
nav element that has the role attribute value
doc-toc);This question arises only if this mechanism is accepted: the question is whether a table of contents navigation element can refer, via links, to any resource that is not listed in the default reading order.
The issue of using the HTML nav element as a possible encoding of the table of contents is mentioned or explicitly addressed in a number of issues listed below.
Define the resources in the default reading order of a Web Publication to be the files referenced in the first
There is a consensus that a Web Publication must have a reading order and must/should have a table of contents (the main navigation entry point).
In addition the stated requirements the Web Publication Infoset SHOULD also provide the following publication metadata:
Identifiers and modification dates are an important aid in the distribution and storage of portable formats (past, current, and present) so authors are urged to include this information whenever they can.
If we are going to have a modification date, especially one that is a "SHOULD", we need to be extremely clear what it means.
The infoset MAY also provide:
These items apply to the Web publication in its entirety. Metadata for individual content documents should be embedded in or linked from the content documents themselves.
The Web Publication Infoset SHOULD NOT include complex, specialised, and industry-specific metadata and authors should limit the metadata included in the manifest to the items above.
If the information set does not cover the publication's needs, authors should link to external metadata files in whichever formats and schema are most commonly accepted as the authoritative metadata format for their intended audiences. They can include multiple metadata links in multiple formats, one for each intended audience, if needed.
User Agents MAY decide to support metadata extensions, other metadata schemes, or additional metadata properties if they so choose. User Agents must ignore metadata properties that they do not recognise.
This section is non-normative.
A manifest is a specific serialization of a Web Publication's infoset.
The requirements for a conforming Web Publication manifest are as follows:
Should the manifest be in an external file, embedded in a specified manner, or should either option be allowed?
Should the table of contents be a separate HTML file or is the listing of resources in the default reading order an implicit table of contents?
In case the (concrete) manifest is expressed in JSON (see issue #7), should it be defined “on top” (i.e., as some form of an extension) of the Web Application Manifest specification, or should it be a fully separate specification?
If we have a collection of information about a web publication as a whole ("manifest") that exists separately from most of the publication's resources, we need to find a way to associate the manifest with the other publication resources.
The manifest serialization MUST provide a general linking mechanism for defining a relationship between the Web Publication and other resources on the Web as well as the type of those relationships.
This mechanism is used in to express many parts of the Web Publication's infoset, including but not limited to:
rel='canonical' link relation. [rfc6596]There are some overlaps between this list and, e.g., the separate section on canonical identifiers.
This linking mechanism may also be used to express other common link structures on the open Web. For example:
Some of these link structures, such as dynamic search links, may require support for URI templates [rfc6570] to be meaningfully useful in the context of a Web Publications. User agents should(?) support URI templates in order to make it easier for publishers to integrate dynamic server-side features into their publications with minimal coding and effort.
Need to determine intersection between web pubs and the lifecycle of a web app.
Placeholder for UA discovering and obtaining a manifest.
Placeholder for UA processing of manifest and creation of infoset.
Placeholder for UA initiating an enhanced reading experience.
Placeholder for UA updating manifest and/or WP content.
This section contains placholders for possible reading enhancements the UA may/should/must provide. The list is subject to addition, modification and removal as the enhancements get discussed in more detail.
Placeholder for offline reading of a publication.
Placeholder for inter-publication search.
Placeholder for paginated reading experience.
This section is non-normative.
In addition to a Web Publication's address(es), bookmarking, annotation, and other use cases require internal locators which can be used to identify, locate, retrieve, and/or reference locations and content fragments within a Web Publication (cf. Web Publications Use Cases and Requirements [pwp-ucr] and Digital Publishing Annotation Use Cases [dpub-annotation-uc]).
In choosing to organize a publication into multiple, individually addressable resources, each with its own URL, the creator(s) of the publication provide locators for individual objects within the publication. Thus Web Publication Resource URLs serve as Web Publication internal locators for directly identifying, retrieving, and/or referencing each constituent resource of a Web Publication in its entirety.
Within an individual Web Publication Resource, the creator of the resource may further provide anchors or other structures (e.g., the value of an id attribute of a <p> element in an HTML constituent resource,
the Document Object Model (DOM) of an XML constituent resource) as a way to facilitate identifying, locating, retrieving, and/or referencing locations and more granular content fragments within that individual constituent resource
and thereby within the Web Publication. Intra-resource, publisher-provided anchors and structures of this sort are often used to mint fragment identifiers and Web Publication Locators, as described below.
The generic syntax for fragment identifiers is defined in RFC 3986 [rfc3986]. The fragment identifier component of a URL comes at the end of the URL and is preceded by a
# character. The fragment identifier component of a URL is separated prior to dereferencing and is not sent to the server. The identifying information within the fragment identifier component itself is dereferenced solely
by the user agent. Interpretation and resolution of a fragment identifier may be dependent on the media-type of the resource retrieved when the preceding part of the URL is dereferenced.
A broad range of fragment identifier formats, each specific to resources of one or more media-types, have been defined in various IETF RFCs, W3C Recommendations, etc. These are typically included by reference in IANA-registered media-type specifications [ iana-media-types]. A fragment identifier of a format appropriate for a Web Publication Resource's media type may be used to identify, retrieve, and/or reference content fragments within that resource.
Could reference or duplicate here with modifications the second table from 4.2.1 of Web Anno Data Model Rec. [annotation-model].
User agents minting fragment identifiers often take advantage of publisher-provided anchors and structures. For example, the value of the id attribute of a
<p> element in an HTML Web Publication Resource can be used to mint a fragment identifier linking to that <p> element (i.e., a paragraph). This type of fragment identifier is illustrated in Example
1 which assumes the existence within the HTML resource of an element with id p33, e.g., <p
id="p33">.
https://dauwhe.github.io/html-first/HeatRadiation/OPS/s009-Chapter-001.html#p33
Using other fragment identifier formats, user agents may mint fragment identifiers to serve as Web Publication internal locators even in the absence of any explicit publisher-provided anchor or structure. For example, a media fragment identifier [media-frags] for a still image embedded in a Web Publication may be minted without reference to any publisher-provided anchor or structure, as shown in Example 2. In this example, the fragment is a rectangular image segment that is 200x150 pixels with its upper left corner at 100 pixels in from the left edge and 500 pixels down from the top edge of the full image.
http://zebu.uoregon.edu/hudf/hudf_300dpi.jpg#xywh=100,500,200,150
For some use cases it is essential to identify and reference a Web Publication Resource or a location in or a segment of a Web Publication Resource in the scope or context of a Web Publication which includes this resource. The fragment identifier approaches described above do not satisfy this requirement since only the URL of the constituent Web Publication Resource containing the location or content fragment of interest is expressed. Web Publication Locators address this issue by providing the means to express both the URL of the Web Publication Resource and the URL of the Web Publication.
TO DO: Need to update title and insert here reference to this now separate document. Do we also need an informative reference here to EBPU CFI document?
TO DO: illustrate with example of simple, easy to understand Web Publication Locator such as might be used in annotating a simple Web Publication. More complicated examples can be left to the external "Locators for Web Publications" document.
The semantics of Web Publication Locators are a mapping and extension of the Web Annotation Data Model [annotation-model] and Vocabulary [annotation-vocab] for describing and referencing a segment of a web resource. As a result Web Publication Locators provide the expressiveness needed for a broad range of annotation and bookmarking use cases. Additionally Web Publication Locators provide a way to identify and reference a location within a Web Publication, i.e., as distinct from identifying and referencing a content fragment consisting of a span of characters or bytes. A Web Publication Locator can be used to identify, retrieve and/or reference a fragment of a Web Publication that spans multiple Web Publication Resources.
The separate document will need to illustrate these use cases - i.e., identifying a location as disticnt from a fragment and identifying a fragment that spans multiple Web Publication Resources. If not, we'll need to illustrate here.
In composing a Web Publication Locator, the canonical identifier of the Web Publication should be used in preference to any alternative addresses. This facilitates the collation of Web Publication Locators associated with a particular Web Publication. URLs of Web Publication Resources appearing in a Web Publication Locator should match the URL for of the Web Publication Resource provided in the Web Publication Infoset.
This section is non-normative.
The following people contributed to the development of this specification:
The Working Group would also like to thank the members of the Digital Publishing Interest Group for all the hard work they did paving the road for this specification.