File Based URN Resolver

Discussion Draft v1.0

Last Updated: 8 Feb, 1999


Abstract


Introduction

URNs define a way of resolving a given name to a resource without specifying the location of it. The idea of this is to give a resource the ability to move around without effecting the calling application. Efforts to date have been focused on both defining the syntax of a URN and getting one implementation together that allows for resolving a URN to a real resource. This implementation has been based around DNS and uses the experimental NAPTR [1] and SVR [2] records.

In many environments, DNS will not be available. The lowest common denominator is to specify exactly how to resolve a URN based on information in a text configuration file. Because of the variety of scope of URN references, this config file requires extreme amounts of flexibility. Many of the cues for this have been taken directly from the work involved in the DNS based NAPTR [3, 4] record.


File Format Outline

The basic file format is standard ASCII text and is called urn_bindings. The contents of this file enable the system to parse a URN based on the namespace identifier and provide a resolution of the URN to a URL specifying the exact resource. Given this URL, the internals of the URI resolution may construct a resource stream as needed.

The file consists of groupings of resolution services that are delimited by namespace identifiers. Each logical group starts with the namespace ID (NID) and is followed by the resources needed to resolve that namespace. The declaration of a new NID or the end of file, terminates the current NID services.

The first step in the resolution service is to extract some meaningful information from the raw URN. In the DNS based system, there are a set of rewrite rules that are used to extract the next level domain to query for further resolution. For the file based resolver, everything is resolved within the context of the one file. There is no need to consult any further services. Instead, we make use of the same regular expression syntax to allow us to grab the name of a group that may be able to provide further resolution. The aim of this group is to provide the next level of resolution.

At the group level, we are now committed to resolving the URN to a particular resource. Within this group are the list of resources that we may resolve the URN to. Because there will be a necessary rewriting of the URN to the resource, we also need another regular expression to extract the required information and add it to the raw resource string.

All resources are specified as URLs because we must give an exact location of where to locate the resource. URLs are the most compact expression available to us in a text file.

Resolving a URN

Based on the above outline and the example file below, we can parse the file to dealing with URN resolving as follows.

At the top level, we know exactly what URN NIDs may be handled. Anything not defined in this file cannot be resolved. Resolvable namespaces are indicated using the "NID:" keyword:

  NID: vrml
The namespace identifier may be treated as case-insensitive when comparing it during a resolution request.

Next we need to specify a group of resources that we think are appropriate to resolve the namespace with. To do this, we specify a regular expression that will extract a group name from the URN. For simplisity and commonality, we use exactly the same format of regular expression as what may be specified in the replacement field of an NAPTR record. The output of applying this regexp to the urn should give us the group name to look at.

The regular expression is specified in the file using the "REGEXP:" keyword and shall immediately follow the namespace identifier that it is being used on. For example, using the following definition:

    NID: vrml
    REGEXP: /urn:vrml:([^\/:]+)/\1/i
on the urn
    urn:vrml:umel:texture/wood.gif
would result in the group name umel being generated.

Groups are specified in the file using the "GRP:" keyword. Under each group is a list of the resources that may be used to complete the resolution and potential access of the named object. These are specified in order of preference. The highest preference first to the lowest preference last. The preference list is terminated when a new group or namespace is started.

Specifying a resource is by use of the "RES:" keyword. Following this is a single fully specified URL that is quoted. Whitespace is then used to delimit the start of the regular expression. The regular expression is used to extract information from the full, original, URN and append it to the URL specified in the resource. For example, given the following specification:

  GRP: umel
    RES: "file:///c:/urn/media/"  /urn:vrml:umel:([^\/])\/(.*)/\1/i
    RES: "http://urn.vrml.org/umel/" /urn:vrml:umel:([^\/])\/(.*)/\1/i
onto our previously defined urn would lead to the production of the URLs:
    file:///c:/urn/media/texture/wood.gif
    http://urn.vrml.org/umel/texture/wood.gif
However, by changing the production rules for the second resource to:
    RES: "http://urn.vrml.org/umel/fetch_resource.pl"  /urn:vrml:umel:([^\/])\/(.*)/?category=\1+object=\2/i
should result in the full URL of
  http://urn.vrml.org/umel/fetch_resource.pl?category=texture+object=wood.gif


File Syntax

file          = namespace_ids
namespace_ids = namespace_id | namespace_id namespace_ids
namespace_id  = "NID:" namespace_str grp_regexp groups
namespace_str = any valid namespace identifier string (see [6])

grp_regexp    = "REGEXP:" grp_exp
grp_exp       = (see [1], NAPTR RR Format, Replacement)

groups        = group | group groups
group         = "GRP:" grp_str resources
grp_str       = 1*GRP_CHAR
GRP_CHAR      = "-" | "." | "a"  | ... | "z" | "A" | ... | "Z" | "0" | ... | "9"

resources     = resource | resource resources
resource      = "RES:" <">url<"> res_regexp
url           = Any valid fully qualified URL (see [7])

res_regexp    = delim_char ere delim_char repl delim_char *flags
delim_char    = "/" | "!" ... (Any non-digit or non-flag character other than
                backslash '\'. All occurances of a delim_char in a res_regexp
                must be the same character.)
ere           = POSIX Extended Regular Expression (see [5],
                section 2.8.4)
repl          = repl_str | backref | repl repl_str | repl backref
repl_str      = 1*REPL_CHAR
backref       = "\" 1POS_NUMBER
flags         = "i"
REPL_CHAR     = "-" | "?" | "+" | "%" | "." | ":" | "#" |
                "a"  | ... | "z" | "A" | ... | "Z" | "0" | ... | "9"
                 (see [7] for full list)
POS_NUMBER    = "1" | "2" | ... | "9" | "10" | ... ; 0 is not an
                allowed backref value domain name.
The following notes appl to the res_regexp regular expression substitution.

The hash '#' character shall be treated as the start of a comment. Anything following the character shall be ignored. The comment is terminated by an end of line character.

Backref expression in the repl portion of the substitution expression are replaced by the (possibly empty) string of characters enclosed by '(' and ')' in the ERE portion of the substitution expression. N may be any positive digit and specifies the N'th backref expression. The N'th backref expression is the one that begins with the N'th '('and continues to the matching ')'. For example, the ERE:

    (A(B(C)DE)(F)G)
has backref expressions The "i" flag indicates that the ERE matching shall be performed in a case-insensitive fashion. Furthermore, any backref replacements may be normalised to lower case when the "i" flag is given.

The first character in the substitution expression shall be used as the character that delimmits the components of the substitution expression There must be exactly three non-escaped occurrences of the delimiter character in a substitution expression. Since escaped occurrences of the delimiter character will beinterpreted as occurrences of that character, digits shall not be used as delimiters. Backrefs would be confused with literal digits if this were allowed. Similarly, if flags are specified in the substitution expression, the delimiter character must not also be a flag character.

The URL of the resource shall always be quoted. This is to avoid confusion with the boundary between the URL and the beginning of the regular expression. Under some operating systems, file: URL types may include spaces in the directory names. By quoting the string, it is easier to delimit the extent of the URL.

Example File

# vrml name spaces:
# urn:vrml:umel:/some/dir/file.ext
NID: vrml
  REGEXP: /urn:vrml:([^\/:]+)/\1/i
  GRP: umel
    RES: "file:///c:/urn/media/"  /urn:vrml:umel:([^\/])\/(.*)/\1/i
    RES: "http://urn.vrml.org/umel/" /urn:vrml:umel:([^\/])\/(.*)/\1/i
  GRP eai
    RES: "http://urn.vrml.org/eai/" /urn:vrml:eai:([^\/])\/(.*)/\1/i

# Experimental CID namespace (from draft NAPTR spec)
# urn:cid:199606121851.1@mordred.gatech.edu
NID: cid
  REGEXP: /urn:cid:.+@([^\.]+\.)(.*)$/\2/i
  GRP: gatech.edu
    RES: "http://www.gatech.edu/cgi-bin/resources.pl" /urn:cid:.+@([^\.]+\.)(.*)$/\?uid=\1/i


References

[1]Mealing, M. & Daniel, R. "The Naming Authority (NAPTR) DNS Resource Record", IETF Internet Draft 0 (draft-ietf-urn-naptr-rr-00.txt), Nov 1998

[2]Gulbrandsen, A. & Vixie, P. "A DNS Resource Record for Specifying the Location of Services", RFC2052, Oct 1996

[3]Daniel, R. & Mealing, M. "Resolution of Uniform Resource Identifiers using the Domain Name System", RFC2168, June 1997

[4]Mealing, M. & Daniel, R. "URI Resolutions Services Necessary for URN Resolution", IETF Internet Draft 7 (draft-ietf-urn-resolution-services-07.txt), Nov 1998

[5]IEEE Standard for Information Technology - Portable Operating System Interface (POSIX) - Part 2: Shell and Utilities (Vol 1), IEEE Std 1003.2-1992, The Institute Of Electrical Engineers, New York. 1993. ISBN:1-55937-255-9.

[6]Moats, R. "URN Syntax", RFC 2141, May 1997.

[7]Berners-Lee, T., Masinter, L., and M. McCahill, Editors, "Uniform Resource Locators (URL)", RFC 1738, December 1994.


http://www.vlc.com.au/~justin/java/urn/file_based_resolver.html
Comments: couch@ccis.adisys.com.au