Internet Engineering Task Force E. Zierau, Ed. Internet-Draft The Royal Danish Library Intended status: Informational June 9, 2017 Expires: December 11, 2017 Scheme Specification for the pwid URI draft-pwid-uri-specification-02 Abstract This document specifies a Uniform Resource Identifier (URI) for Persistent Web IDentifiers to web material in web archives using the 'pwid' scheme name. The purpose of the standard is to support general, global, sustainable, humanly readable, technology agnostic, persistent and precise web references for such web materials. The PWID URI ca assist in two ways: First, by providing potential resolvable precise and persistent reference scheme for documents, which is not sufficiently covered by existing web reference practices. Second, by providing a standardized way to specify web elements in a web collection also known as web corpus. Definitions of web collections are often needed for extraction of data used in production of research results, e.g. for evaluations in the future. Current practices today are not persistent as they often use some CDX version, which vary for different implementations. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on December 11, 2017. Zierau Expires December 11, 2017 [Page 1] Internet-Draft Scheme Specification for the pwid URI June 2017 Copyright Notice Copyright (c) 2017 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 2. Demonstrable, New, Long-Lived Utility . . . . . . . . . . . . 4 3. Syntactic Compatibility . . . . . . . . . . . . . . . . . . . 4 4. Well Defined . . . . . . . . . . . . . . . . . . . . . . . . 6 5. Definition of Operations . . . . . . . . . . . . . . . . . . 8 6. Context of Use . . . . . . . . . . . . . . . . . . . . . . . 9 7. Internationalization and Character Encoding . . . . . . . . . 9 8. Scheme Name Considerations . . . . . . . . . . . . . . . . . 10 9. Interoperability Considerations . . . . . . . . . . . . . . . 10 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 12. Clear Security and Privacy Considerations . . . . . . . . . . 10 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 13.1. Normative References . . . . . . . . . . . . . . . . . . 10 13.2. Informative References . . . . . . . . . . . . . . . . . 11 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 12 1. Introduction The purpose of the PWID URI is to represent general, global, sustainable, humanly readable and technology agnostic web archive resource references - in a scheme that can be used for technical solutions. The motivation for defining a PWID URI scheme is the growing challenge of references to web resources, - both regarding referencing web resources from papers and regarding definition of web collection/corpus. o Citation guidelines generally do not cover general and persistent referencing techniques for web resources that are not registered by Persistent Identifier systems (like DOI [DOI]). However, an Zierau Expires December 11, 2017 [Page 2] Internet-Draft Scheme Specification for the pwid URI June 2017 increasing number of references point to resources that only exist on the web, e.g. blogs that turned out to have a historical impact. In order to obtain persistency for a reference, the target need to be stable. As the live web is 'alive' and in constant change, persistency can only be obtained by referring to archived snapshots of the web. The PWID URI is therefore focused on referencing archived web material in a technology agnostic way (research documented in [IPRES] and [ResawRef]). o There are many different requirements for construction of collection definitions for web material besides precision and persistency. Recent research have found that various legal and sustainability issues leads to a need for a collection to be defined by references to the web parts in the collection. The PWID URI is needed in such definitions in order to fulfil these requirements and to enable a collection to cover web materials from more archives (Research documented in and [ResawColl]). For the sake of usability and sustainability, the definition of the PWID URI scheme is focused on only having the minimum required information to make a precise identification of a resource in an arbitrary web archive. Resent research have found that this is obtain by the following information [ResawRef]: o Identification of web archive o Identification of source: * Archived URI * Archival timestamp o Intended coverage (page, part, subsite etc.) The PWID URI scheme represents this information in an unambiguous way, and thus enabling technical solutions to be defined based on this scheme. 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. Zierau Expires December 11, 2017 [Page 3] Internet-Draft Scheme Specification for the pwid URI June 2017 2. Demonstrable, New, Long-Lived Utility The purpose of the PWID URI is to represent needed referencing information (as listed in the introduction) in a scheme that can be used for technical solutions. As described in [ResawColl] such references can be represented in a textual way. However, strict unambiguous syntax is needed in order to ensure that it can be used for computational purposes. This is relevant for web collection definitions, which will need a strict scheme in order to be a basis for automatic extraction. Furthermore, readers of research papers are today expecting to be able to access a referenced resource by clicking an actionable URI, therefore a similar facility will be expected for references to available archived web material. The interest for this new PWID URI scheme has already been shown, a paper about the invention of the PWID URI "Persistent Web References - Best Practices and New Suggestions" [IPRES] was accepted for the iPres 2016 conference and nominated as best paper. At the RESAW 2017 conference there are two related papers: One on referencing practices [ResawRef] and one on research data management practices [ResawColl]. The interest for the PWID URI so far indicates that this is a recognized issue, and that the PWID URI can fill a gap. The PWID URI could function as a URN RFC 2141 [RFC2141], but is not defined as such as the ambition is to make an easyly understandable and technology independent persistent identifier, where the prefixing of "urn:" will be desturbing. At the same time the PWID definition can enjoy the same common syntactic, semantic, and shared language benefits that the URI presentation confers. It should be noted that for closed web archives, the PWID URI can be used to resolve within a closed environment. Likewise, the PWID can be resolved within coming web archive research infrastructure, which is currently being proposed in the RESAW community [RESAW]. 3. Syntactic Compatibility The syntax of the PWID URI Scheme is specified below in Augmented Backus-Naur Form (ABNF) RFC 5234 [RFC5234] and it conforms to URI syntax defined in RFC 3986 [RFC3986]. The syntax definition of the PWID URI is: pwid-uri = pwid-scheme ":" pwid-spec pwid-scheme = "pwid" pwid-spec = archive-id ":" archival-time ":" coverage-spec ":" archived-item Zierau Expires December 11, 2017 [Page 4] Internet-Draft Scheme Specification for the pwid URI June 2017 archive-id = +( unreserved ) archival-time = full-date datetime-delim full-pwid-time datetime-delim = "_" / "T" full-pwid-time = time-hour ["."] time-minute ["."] time-second "Z" coverage-spec = "part" / "page" / "subsite" / "site" / "collection" / "recording" / "snapshot" / "other" archived-item = URI / archived-item-id archived-item-id = +( unreserved ) where o 'unreserved' is defined as in RFC 3986 [RFC3986] o 'coverage-spec' values are not case sensitive (i.e. "PAGE" / "PART" / "PaGe" / ... are valid values as well.) o 'archival-time' is a UTC timestamp conforming to the W3C profile ISO8601 ISO 8601 [ISO8601] (also defined in RFC 3339 [RFC3339]), with a few exception for the 'datetime-delim' and 'full-pwid- time', as well as using "." is used instead of ":" in order not to collide with ":" used for delimitation of URI parts. The 'full- date' is defined as in RFC 3339 [RFC3339]. The 'archival-time' must represent the time specified in the archive, and can therefore be specified at any of the levels of granularity as described in [W3CDTF] and in accordance with teh WARC standard ISO 28500 [ISO28500]. The 'datetime-delim' "_" is accepted in order to make it more readable, in the same way as the W3C profile accepts " ", but where "_" is used here in order to use allowed URI characters in an URI. In line with RFC 3339 [RFC3339] the "T" may alternatively be lower case "t". 'time-hour', 'time-minute' and 'time-second' are defined as in RFC 3339 [RFC3339]. In line with RFC 3339 [RFC3339] the "Z" may alternatively be lower case "z". o 'URI' is defined as in RFC 3986 [RFC3986] The 'coverage-spec' defines the type of archived item, serving as a precision to what is referred: Zierau Expires December 11, 2017 [Page 5] Internet-Draft Scheme Specification for the pwid URI June 2017 o part the single archived element, e.g. a pdf, a html text, an image o page the full context as a page, e.g. a html page with referred images o subsite the full context as a subsite within its domain, e.g. a document represented in a web structure o site the full context as a site within its domain o collection a collection/corpora definition, e.g. defined as descibed in [ResawColl] o snapshot a snapshot (image) representation of web material, e.g. a web page o recording a recording of a web browsing o other if something else Note that the 'coverage-spec' is a parameter that could have been specified as a query. However, since the 'pwid-uri' can include an URI as 'archived-item', it would introduce ambiguities if the 'coverage-spec' was specified as a query, since it would not be clear whether the query belonged to the 'pwid-uri' or the 'archived-item'. 4. Well Defined The information in a PWID URI can be used for locating a web archive resource, for any kind of web archive. It includes the minimum information for web archive materials, which enables resolvability, manually or by a resolver. One of the reasons for defining PWID as a URI is to enable a general, technology agnostic, persistent representation to be resolvable at any time. The information needed is: o Web archive identification to find the archive holding the material o Archived URI or identifier of item as part of identifying the material Zierau Expires December 11, 2017 [Page 6] Internet-Draft Scheme Specification for the pwid URI June 2017 o Date and time associated with the archived URI/item as part of precise identification of the material o Coverage of what is referred as part of clarification of what the referred material covers (page, part etc.) For example the PWID URI: pwid:archive.org:2016-01-22_11.20.29Z:page:http://www.dr.dk has the information: o archive.org currently known identifier in form of the Internet Archive domian name for their open access web archive o 2016-01-22_11.20.29Z date and time associated with the archived URI o page clarification that the reference cover the full web page with all its inherited parts selected by the web archive o http://www.dr.dk archived URI of item With knowledge of the current (2017) Internet Archive open access web interface having the form: https://web.archive.org/web/