MODULE IRI

 

 

 

(c) Carlos Viegas Damásio, October 2003

 

1. Description
This module implements a set of library predicates for parsing and working with IRI references, according to RFC 2396 and RFC 2732 and the draft proposals of RFC 2396 bis and Internationalized Resource Identifiers:
  • This module implements an IRI parser and resolution of IRI references.
  • It also provides conversion predicates from atoms and strings to IRI refs, and vice-versa.
  • The mapping of IRIs to ordinary URIs is also supported.
  • Resolves IRI references according to RFC 2396 bis.

Currently, the parser does not implement a full parser of IPv4 and IPv6 addresses, i.e. some invalid IPv4 and IPv6 addressed may be recognized. Since this module depends on draft specifications, the user is advised to restrict the usage of this module to the parsing of ordinary URI references.

 

2. Internationalized Resource Identifiers References Representation

The user can use this module to construct a Prolog term representation of Internationalized Resource Identifiers References and to resolve them. Basically, the IRIs can be parsed from -1 terminated lists of Unicode character codes or from atom names encoding IRIs in UTF-8. In both situations, the following term is constructed when a syntactically correct IRI is provided:

Internationalized Resource Identifiers References Representation
iriref(Scheme,Authority,Path,Query,Fragment)
Scheme: The term scheme( ListOfCodes ) represents an existing scheme component part in the IRI reference, where ListOfCodes is a list of Unicode character codes.
The empty list [] if there is no scheme component in the IRI reference.
Authority: The term authority( UserInfo, Host, Port ), where UserInfo, Host, and Port are (possibly empty) lists of Unicode character codes.
The empty list [] if there is no authority component part in the IRI reference.
Path: The term path( rel, Segments ) or path( abs, Segments) represents either an relative or absolute path. The Segments are (possibly empty)  lists of terms of the form segment(ListOfCodes), where ListOfCodes is a (possibly empty) list of Unicode character codes.
The empty list [] if there is no path component.
Query: The term query( ListOfCodes ) represents an existing query component part in the IRI reference, where ListOfCodes is a list of Unicode character codes.
The empty list [] if there is no query component in the IRI reference.
Fragment: The term fragment( ListOfCodes ) represents an existing fragment component in the IRI reference, where ListOfCodes is a list of Unicode character codes.
The empty list [] if there is no fragment part in the IRI.

The main predicates are parseIRIref/2,  parseIRIref/3 and atom2iriref/2, for parsing and construction of IRI Reference term representation, and resolveIRIref/3 for resolution of relative references with respect to a base IRI. The separators of the several IRI components are not mantained in the IRI reference term representation, i.e. ':','@','/','&', and '#'.

 
3. Installation of the IRI Module
  1. Unpack the package containing the source files to a library directory. This package should contain the following files:
    •  iri.P and iriparse.P. The latest version of utilities.P and builtins.P should also be available, and are included in the package.
    • The file iriparse.G, containing the source code for generating iriparse.P, if necessary. To generate iriparse.P the user should use our lookup DCG parser generator.
    • This user's manual and the file testiri.P are also provided. The test file illustrates the parsing and resolution of IRIs.
  2. Compile the main file with the goal ?-[iri].
  3. The module can be tested by compiling the testiri.P. and executing the goals ?- testiris. and ?- testresolution.
  4. The module predicates can be used resorting to import declarations. The full set of predicates is described in the following section.
 
4. Usage of the IRI Module

4.1 Parsing of IRI references:

IRI references can be parsed using parseIRIref/2 and parseIRIref/3. For efficiency, the programmer should use parseIRIref/2 which requires terminated lists of Unicode character codes.

  • parseIRIref( + TermUCSList, IRIref )

Given a list of Unicode character codes, terminated with -1, parseIRIref/2 returns the term representation of the IRI reference. Fails if the first argument is not a syntactically correct IRI reference. Notice that separators of the several  IRI components do not appear in the IRI reference term representation.
The production ihostname of Internationalized Resource Identifiers is not fully implemented: it is only checked if the ihostname part does not contain illegal characters. The syntax of IPv6 addresses is not checked, and IPv4 addresses are checked only if UserInfo is present.

Example:

| ?- append( "http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING", [-1], _Codes ),
     parseIRIref( _Codes, Ref ).

Ref = iriref(scheme([104,116,116,112]),
             authority([],[119,119,119,46,105,99,115,46,117,99,105,46,101,100,117],[]),
             path(abs,[segment([112,117,98]),
                       segment([105,101,116,102]),
                       segment([117,114,105]),
                       segment([104,105,115,116,111,114,105,99,97,108,46,104,116,109,108])
                      ]
                 ),
             [],
             fragment([87,65,82,78,73,78,71])
            );

  • parseIRIref( + Terminated, + ListOfCodes, IRIref )

The first argument Terminated may take the values yes or no, indicating respectively whether the 2nd argument list of Unicode character codes is terminated or not. A call of the form parseIRIref( yes, ListOfCodes, IRIref ) is equivalent to parseIRIref( ListOfCodes, IRI ). If the first argument is no, then the symbol -1 is appended to the 2nd argument and parseIRIref/2 is called. Thus, this second form should be used sparingly.

 

4.2 Testing and Inspection of IRI reference terms:

The following set of predicates determine the type of IRI reference parsed or constructed:

  • isIRIref(+ IRIref )

    This predicates succeeds when its argument is an IRI term function symbol. For efficiency, it does not check if its component arguments are correct.
     
  • isIRI(+ IRIref )

    This predicates succeed when its argument is an IRI, i.e. an IRI reference with a non-empty scheme component.
     
  • isAbsoluteIRI(+ IRIref )

    This predicates succeed when its argument is an absolute IRI, i.e. an IRI without fragment part.
     
  • isRelativeIRI(+ IRIref )

    This predicates succeed when its argument is a relative IRI, i.e. an IRI reference with an empty scheme component.

To obtain the several components of an IRI reference term, the following predicates may be used:

  • getIRIrefScheme(+ IRIref,Scheme)

    Obtains the scheme component of a given IRI reference. The scheme component is a term of the form scheme( ListOfCodes ) or an empty list, as described in Section 2.
     
  • getIRIrefAuthority(+ IRIref,Authority)

    Obtains the authority component of a given IRI reference. The authority component is a term of the form authority( UserInfo, Host, Port) or an empty list, as described in Section 2.
     
  • getIRIrefPath(+ IRIref,Path)

    Obtains the path component of a given IRI reference. The path component is a term of the form path( AbsRel, Segments ) or an empty list, as described in Section 2.
     
  • getIRIrefQuery(+ IRIref,Query)

    Obtains the query component of a given IRI reference. The query component is a term of the form query( ListOfCodes ) or an empty list, as described in Section 2.
     
  • getIRIrefFragment(+ IRIref,Fragment)

    Obtains the fragment component of a given IRI reference. The fragment component is a term of the form fragment( ListOfCodes ) or an empty list, as described in Section 2.
     

4.3 Construction of IRI references:

The next predicates provide mechanisms to dynamically construct IRI references. The advised method to construct IRIs is to parse them from lists of Unicode character codes. The predicates described in this section should be used with care since no checking of arguments is performed.

  • createEmptyIRIref( IRIref )

    This predicate creates an empty IRI reference
     
  • createIRIref( + Scheme, + Authority, + Path, + Query, + Fragment, IRIref )

    This predicate creates an IRI reference from the several components of the IRI reference. The input arguments are either empty lists or component terms as described in Section 2 above.

     
  • setIRIrefScheme( + OldIRIref, + Scheme, NewIRIref )

    The predicate setIRIrefScheme/3 replaces  the scheme component in the IRI reference term OldIRIref by the list of Unicode character codes in argument Scheme, returning the new IRI reference term in the last argument NewIRIref.
     
  • setIRIrefAuthority( + OldIRIref, + UserInfo, + Host, + Port, NewIRIref )

    The predicate setIRIrefAuthority/5 replaces the authority component in the IRI term OldIRIref by the authority term constructed from the lists of Unicode character codes arguments UserInfo, Host and Port. The new IRI reference term is returned in the last argument NewIRIref.
     
  • setIRIrefPath( + OldIRIref, + AbsRel, + Path, NewIRIref )

    The predicate setIRIrefPath/5 replaces  the path component in the IRI term OldIRIref by the path term constructed from the list of segments in argument Path, and the flag AbsRel, which may take the values abs or rel. The new IRI reference term is returned in the last argument NewIRIref.
     
  • setIRIrefQuery( + OldIRIref, + Query, NewIRIref )

    The predicate setIRIrefQuery/3 replaces the query component in the IRI reference term OldIRIref by the list of Unicode character codes in argument Query, returning the new IRI reference term in the last argument NewIRIref.
     
  • setIRIrefFragment( + OldIRIref, + Query, NewIRIref )

    The predicate setIRIrefFragment/3 replaces the fragment component in the IRI reference term OldIRIref by the list of Unicode character codes in argument Query, returning the new IRI reference term in the last argument NewIRIref.

4.4 Resolution of IRI references:

The IRI module implements resolution of IRI references according to the algorithms described in RFC 2396 bis. Therefore, empty references are allowed and abnormal relative path ".." segments are removed from the resulting IRI.

  • resolveIRIref( + IRIref, + BaseIRI, ResIRI)

    The first argument of resolveIRIref/3 is an arbitrary IRI reference term, while the BaseIRI should be an IRI term, i.e. with scheme component part. The resolved IRI is returned in the last argument.

    Example:

    | ?- atom2iriref( 'http://www.example.com:8080/a/b/c', BaseIRI ),
         atom2iriref( '../x/y&query#123', RelIRI ),
         resolveIRIref( RelIRI, BaseIRI, ResIRI ),
         iriref2atom( ResIRI, Resolved ).

    BaseIRI = iriref(scheme([104,116,116,112]),
                     authority([],[119,119,119,46,101,120,97,109,112,108,101,46,99,111,109],[56,48,56,48]),
                     path(abs,[segment([97]),segment([98]),segment([99])]),
                     [],
                     []
                    )
    RelIRI  = iriref([],
                     [],
                     path(rel,[segment([46,46]),segment([120]),segment([121,38,113,117,101,114,121])]),
                     [],
                     fragment([49,50,51])
                    )
    ResIRI  = iriref(scheme([104,116,116,112]),
                     authority([],[119,119,119,46,101,120,97,109,112,108,101,46,99,111,109],[56,48,56,48]),
                     path(abs,[segment([97]),segment([120]),segment([121,38,113,117,101,114,121])]),
                     [],
                     fragment([49,50,51])
                    )
    Resolved = http://www.example.com:8080/a/x/y&query#123;

 

4.5 Conversion and mapping of IRI references

  • atom2iriref( + AtomInUTF8, IRIref).

    This predicate converts an IRI reference represented as an UTF8 sequence of octets to the IRI ref term representation. It fails if the atom is not a syntactically correct IRI reference.
     
  • iriref2atom( + IRIref, AtomInUTF8 ).

    Predicate iriref2atom/2 converts the IRI reference term representation to an Atom in UTF-8 encoding.

    Example:

    | ?- atom2iriref( 'mailto:Carlos.Damasio@di.fct.unl.pt', IRIref ),
         iriref2atom(IRIref, Atom ).

    IRIref = iriref(scheme([109,97,105,108,116,111]),
                           [],
                           path(rel,[segment([67,97,114,108,111,115,46,68,97,109,97,115,105,111,
                                              64,100,105,46,102,99,116,46,117,110,108,46,112,116])]
                               ),
                           [],
                           [])
    Atom = mailto:Carlos.Damasio@di.fct.unl.pt;

     
  • iriref2string( + IRIref, StringInUTF8)
    iriref2string( + IRIref, StringInUTF8, RestStringInUTF8 ).

    Predicates iriref2string convert an IRI refererence term representation to a list of Unicode characters in UTF-8 encoding. The three argument version returns an incomplete list, where RestStringInUTF8 is the variable tail.
     
  • iri2uri( + UCSList, URIList )
    iri2uri( + UCSList, URIList, RestURIList )


    Predicates iri2uri convert an IRI reference represented by a list of Unicode character codes to a proper Universal Resource Identifier, using the algoritm described in Internationalized Resource Identifiers.  The three argument version returns an incomplete list, where RestURIList is the variable tail.

    Example:

    | ?- iri2uri( "mailto://Carlos.Damásio@di.fct.unl.pt", L ),
         atom_codes( URI, L ).

    L = [109,97,105,108,116,111,58,47,47,67,97,114,108,111,115,46,68,97,109,37,67,
         50,37,65,48,115,105,111,64,100,105,46,102,99,116,46,117,110,108,46,112,116]
    URI = mailto://Carlos.Dam%C2%A0sio@di.fct.unl.pt;

     
  • filename2uri( + UCSList, URIList )
    filename2uri( + UCSList, URIList, RestURIList )


    Predicates filename2uri assume that an absolute file path, represented by a list of ASCII character codes to a Universal Resource Identifier, escaping excluded charactes. The three argument version returns an incomplete list, where RestURIList is the variable tail. This predicate uses specific built-in XSB predicates to be able to detect the unerlying operating system in order to recognize path separators: ''\' in Windows-based.
    In the case of Windows operating systems, the absolute file path must contain the drive letter. For non-windows operating systems, the path must start with '/'.

    Example (Windows):

    | ?- filename2uri( "C:\My Documents\Jo%A0o", L, [-1] ),
         parseIRIref( L, _IRI ),
         iriref2atom( _IRI, FilePath ).

    L = [102,105,108,101,58,47,47,67,58,47,77,121,37,50,48,68,111,99,117,109,101,110,116,115,47,74,111,37,65,48,111,-1]
    FilePath = file://C:/My%20Documents/Jo%A0o;

    no
    | ?- filename2uri( "C:/My Documents/Jo%A0o", L, [-1] ),
         parseIRIref( L, _IRI ),
         iriref2atom( _IRI, FilePath ).

    L = [102,105,108,101,58,47,47,67,58,47,77,121,37,50,48,68,111,99,117,109,101,110,116,115,47,74,111,37,65,48,111,-1]
    FilePath = file://C:/My%20Documents/Jo%A0o

Example (Non-Windows):

| ?- filename2uri( "/My Documents/Jo%A0o", L, [-1] ),
     parseIRIref( L, _IRI ),
     iriref2atom( _IRI, FilePath ).

L = [102,105,108,101,58,47,77,121,37,50,48,68,111,99,117,109,101,110,116,115,47,74,111,37,65,48,111,-1]
FilePath = file:/My%20Documents/Jo%A0o;

no

 

 

5. Copyright

This is an academic and experimental tool. It cannot be used for commercial purposes without explicit consent of the author.

 

6. Disclaimer

This is an academic and experimental tool. I do not give any guarantee of any form regarding the use of this tool.

 

Last update: October 30th, 2003