<?xml version="1.0" ?>

<!DOCTYPE TEI.2 PUBLIC "-//TEI//DTD TEI Lite XML ver. 1//EN"
"teixlite.dtd" []>

<TEI.2>
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Migrating TEI DTD extensions to XML</title>
        <author>Tobias Rischer</author>
      </titleStmt>
      <publicationStmt>
        <publisher>published by the TEI; part of MIW03</publisher>
      </publicationStmt>
      <sourceDesc>
        <p>this electronic form is original</p>
      </sourceDesc>
    </fileDesc>
  </teiHeader>
  <text>
    <body>
      <div type="section">
        <head>Migrating TEI DTD extensions to XML</head>

        <div type="subsection">
          <head>General remarks</head>

          <p>This section shall support people who have modified the TEI DTD
          and want to migrate these modifications from SGML to XML, i.e., who
          want to use the XML-based P4 DTD with equivalent modifications.  We
          begin with some general remarks, then describe an example DTD
          modification that covers the most important issues, outline a
          recommended migration procedure and carry out the key step hands-on
          on the example.</p>

          <p>If the elements or content models that the TEI provides don't
          quite meet the requirements of your project, there is an official
          esacpe route: you can modify the DTD in a number of well-defined ways
          and your documents still remain <q>TEI conformant.</q> This involves
          creating two extension files, setting some parameter entities,
          possibly defining new elements or redefining existing ones, and
          making these modifications known to the parser in the DTD subset at
          the beginning of the document.</p>

          <p>Although the process is a lot simpler than it looks at first
          glance, many people have taken unofficial escape routes, especially
          the users of the TEI Lite DTD, who would have been required to first
          switch to a full TEI DTD before applying local extensions.  It is
          admittedly simpler to just open your local copy of <q>teilite.dtd</q>
          and change a few lines.  Only later will you find out why the TEI
          Guidelines don't advertise this, and one of those moments could be
          the migration of your customized DTD to XML.</p>

	  <p>If you are in this situation now, there are two and a half ways 
	  to proceed:</p>
	  
	  <list> 
	  
	    <item n="1">Redo your modifications the official way for the P4 DTD,
	    using extension files: find out what is changed in your local copy of the
	    TEI Lite DTD, and create proper extension files for the TEI P4 DTD to the
	    same effect.  You will find it is less hard than you thought, and the
	    next migration will be so much easier.  You'll find useful advice for
	    this process in the Guidelines, maybe also in the rest of this section.
	    This is the way we would advocate.</item>
	    
	    <item n="2">Redo your modifications as before: find out what you changed
	    in the TEI Lite DTD, and apply the same changes to your local copy of the
	    TEI XLite DTD.  We do <hi>not</hi> advocate this procedure, but it is, of
	    course, a practical possibility.</item>
	    
	    <item n="3">Take a step back: are those modifications really still
	    needed?  Maybe they were made to work around a bug of TEI P3, and
	    this bug has gone?  Or they were intended for a feature that was
	    never used?</item>
	    
	  </list>

          <p>This being said, the rest of this section shall support you in migrating
	  DTD extensions made using the official procedures.  So: what types of TEI
	  extensions exist and what is involved in migrating them from SGML to XML?
	  The guidelines know four kinds of modification:</p>

          <cit>
            <q>
              <list>
                <item n="1">deletion of elements;</item>
                <item n="2">renaming of elements;</item>
                <item n="3">extension of classes;</item>
                <item n="4">modification of content models or attribute lists.</item>
              </list>
            </q>
            <bibl>TEI Guidelines P4, section 29.1</bibl>
          </cit>

          <p>For practical purposes, the fourth item can be subdivided
          into:</p>

          <list>
            <item n="4a">redefinition of attribute lists;</item>
            <item n="4b">modification of existing content models</item>
            <item n="4c">definition and integration of new elements (i.e.,
            hanging the new elements into the existing tree)</item>
          </list>

          <p>The first three cases are extremely easy, the items of group 4
          require more detailed attention.  The following is a short list of
          some critical issues involved.  In the following subsections, we will
          work through a fictitious example that covers most of these issues.
          </p>

          <list>

            <item n="1">Case of element and attribute names is important in
            XML, you have to be consistent now.</item>

            <item n="2">It is likely that some of the modifications in your existing
	    P3 extension files involved copying (and then perhaps modifying) pieces
	    of the TEI DTD files ; you should check whether those DTD pieces have
	    changed from P3 to P4.</item>

            <item n="3">Some people have made modifications to work around
            problems in the TEI P3 DTD; if they are fixed in P4, the workaround
            could cause errors (a notorious example seems to be <gi>placeName</gi>).</item>

            <!-- TODO: example !?  table/figure still isn't fixed. -->

            <item n="4">The SGML DTD syntax for element declarations requires
            two characters of <q>-</q> or <q>O</q> that indicate whether start
            and end tag are required or can be omitted.  These indicators don't
            exist anymore in XML DTDs and your private DTD snippets need to be
            modified.</item>

            <item n="5">The content model for XML elements is more restricted
            than for SGML elements.  We won't go into fine detail, but the
            following two points deserve attention:

              <list>

                <item n="5a">The only type of character data is PCDATA, you
                cannot define CDATA content to bypass the parser.</item>

                <item n="5b">The inclusion exception syntax does not exist in
                XML DTDs. In SGML, you could specify an element to be legal
                everywhere within element <gi>X</gi> <emph>and its children
                </emph> in a single line by using the inclusion exception
                syntax.  This is not possible in XML, you have to add
                <gi>X</gi> to all content models individually.</item>

              </list>
            </item>
          </list>

        </div>

        <div type="subsection">
          <head>A tutorial example</head>

          <p>In this subsection, we will do some simple TEI DTD modifications
          in SGML.  This will then serve as a tutorial example for the
          migration to XML.  While working on this example, the main problems
          in converting DTD extensions should be covered.  Not everyone will
          need everything treated here, and some needs might not be covered,
          but this should be an easy, hands-on starting point for most
          projects. <note>As an additional benefit, this small tutorial might
          induce people to do their modifications the proper way instead of
          hacking TEI Lite.</note></p>

          <p>Let's assume that five years ago, we wanted a TEI P3 DTD for prose
          that meets the following extra requirements (these requirements are
          tutorial examples only, this is no statement on whether they are
          recommendable TEI practice):</p>

          <list>

            <item n="1">the <gi>pb</gi> element shall get an extra attribute
            <q>imageurl</q> that contains an URL for an image of the page;</item>

            <item n="2">there shall be a new element <gi>ps</gi> for the
            postscriptum of letters, containing normal phrase level
            content;</item>

            <item n="3">the elements <gi>div1</gi> and <gi>div2</gi>shall be renamed
	    to <gi>volume</gi> and <gi>letter</gi> because our source material
	    is a collection of letters organized that way, and we want to keep that
	    structure and make it explicit;</item>

            <item n="4">an element <gi>toDo</gi> shall be available everywhere in
            the text for editorial meta-comments on the ongoing encoding.
            Therefore, the content shall be CDATA to allow easy typing of element
            names and entities that are talked about in these notes.</item>

          </list>

          <p>These requirements can be cast into TEI SGML by creating two
          files, <q>my_sgml.ent</q> and <q>my_sgml.dtd</q> that look as
          follows:</p>

          <eg>
<![CDATA[
<!-- file "my_sgml.ent" -->

    <!-- suppress "pb" so it can be redefined (add attribute "imageUrl") -->
    <!ENTITY % pb        'IGNORE' >

    <!-- add new element "ps" to class for div-bottom -->
    <!ENTITY % x.divbot  'ps |'   >

    <!-- rename "div1" and "div2" to "volume" and "letter" -->
    <!ENTITY % n.div1    'volume'   >
    <!ENTITY % n.div2    'letter'   >

    <!-- suppress TEI.2 so it can be redefined (inclusion of "toDo") -->
    <!ENTITY % TEI.2     'IGNORE' >

<!-- file "my_sgml.dtd" -->


    <!-- modified copy of "pb" element: "imageUrl" added           -->
    <!ELEMENT %n.pb;     - O       EMPTY                             >
    <!ATTLIST %n.pb;
        id               ID        #IMPLIED
        lang             IDREF     %INHERITED
        rend             CDATA     #IMPLIED
        ed               CDATA     #IMPLIED
        n                CDATA     #IMPLIED
        imageUrl         CDATA     #IMPLIED
        TEIform          CDATA     'pb'                              >

    <!-- new element "ps"                                            >
    <!ELEMENT PS         - -       (%paraContent)                    >
    <!ATTLIST ps         %a.global;                                  >

    <!-- new element "toDo"                                          >
    <!ELEMENT toDo       - -       CDATA                             >

    <!-- redefined TEI.2: added "toDo" as inclusion                -->
    <!ELEMENT %n.TEI.2;  - O       (%n.teiHeader;, %n.text;) +(toDo) >
    <!ATTLIST %n.TEI.2;  %a.global;
       TEIform           CDATA     'TEI.2'                           >
]]>
          </eg>

          <p>A sample document <q>godot.sgml</q> using these extensions would
          look like this:</p>

          <eg>
<![CDATA[
<!DOCTYPE TEI.2 SYSTEM "tei2.dtd" [

<!ENTITY % TEI.prose          "INCLUDE"  >
<!ENTITY % TEI.extensions.dtd SYSTEM "my_sgml.dtd">
<!ENTITY % TEI.extensions.ent SYSTEM "my_sgml.ent">

]>
<TEI.2>
... (header omitted) ...
  <text>
    <body>
      <volume n="1">
         <letter n="1">
           <pb n="1" imageUrl="godot.tif">
           <salute>Dearest Doug,</salute>
           <p>I will come really soon now.</p>
           <signed>Godot</signed>
           <ps>PS: <hi>Thanks</hi> for all the fish
             <toDo>is this <hi> really correct tagging?</toDo>
           </ps>
         </letter>
         <toDo>Encode next letter</toDo>
       </volume>
    </body>
  </text>
</TEI.2>
]]>
          </eg>

          <p>This example shall be migrated to TEI P4 XML in the next two
          subsections</p>

        </div>

        <div type="subsection">
        <head>Suggested migration procedure</head>

        <p>Although the following step-by-step list may sound over-protective,
        this approach is recommended to help you keep a clear head while you
        are converting DTD and documents.  You are switching from P3 to P4,
        from SGML to XML DTDs, from SGML to XML documents and from SGML to XML
        parsers at the same time, and it can be difficult to find one&apos;s
        way through these many potential pitfalls.</p>

        <list>

          <item n="1">Pick an interesting test document from your repository
	  and make sure you can parse it as it is (in SGML form) against your
	  current DTD setup (TEI P3 with your extensions). Maybe make up an
	  example like we did above.</item>

          <item n="2">Set up a parallel SGML parser environment to parse
against the TEI P4 DTD.  Before you try to parse your sample file, make
sure it works with <hi>very</hi> simple standard TEI files.</item>

          <item n="3">Now try to parse your sample document against P4
(in SGML mode). Theoretically, this shouldn't be a problem, but the
chance is high you are getting errors.  Create an intermediate version
of your SGML extension files that fix the problems (the cause is most
likely that your extensions work around a bug of the P3 DTD that was
repaired in P4).</item>

          <item n="4">Create new XML extension files based on the SGML
ones. We will do this for our example in the next subsection. If you
want to continue supporting SGML documents, consider making the
extension files <soCalled>dual-use</soCalled> (which shall mean:
compatible with XML and SGML, see below) so you need to maintain only
one set.</item>

          <item n="5">If you have created dual-use extension files, use them
          to parse your SGML document against P4 in SGML mode.  Don't proceed
          until they are correct.</item>

          <item n="6">Make sure that your XML parser is set up properly by
          parsing a minimal TEI XML document without extensions.</item>

          <item n="7">Convert the test document to XML, as described elsewhere
          in this paper. To have a short made-up example makes this step
          easier.  It may be necessary to adapt the tools used so they know
          about additional tags and attributes in your documents (case
          normalization!).</item>

          <item n="8">Try to parse your XML-converted test document.  Errors
          that you see could come from the document conversion or from the
          migrated DTD extensions.  Fix them all.</item>

          <item n="9">When you can parse in XML-mode (hooray!) and you have
          dual-use extensions, go back and try the SGML parse again.  Then
          try more documents.  Now you could also consult the Pizza Chef and
          get yourself a flat XML DTD for greater convenience.</item>

        </list>

      </div> <!-- subsection procedure -->

      <div type="subsection">
      <head>Migrating the example DTD</head>

      <p>Of the procedure recommended above, we will now focus on rewriting the
      DTD extensions in XML, with the example DTD modification described
      earlier as a basis.  We will be creating the files <q>my_xml.ent</q> and
      <q>my_xml.dtd</q></p>
      
      <p>Before we start out, though, a strategic decision has to be
made: Shall we burn the bridges and support only XML in the future?  The
P4 DTD provides mechanisms to parse both XML and SGML, and we can do the
same for our customized DTD, if we will need to support SGML in parallel
for some while.  It takes a little more thought and effort, but in
return you get the comfort of a safe transition period.</p>
      
      <p>In this document, we shall call this
<soCalled>dual-use</soCalled> extensions, and in our example, we shall
demonstrate both dual-use and pure XML extensions.</p>

      <p>One obvious syntactic difference between SGML and XML DTDs are
the <soCalled>omitted tag minimization parameters</soCalled> that appear
as <q>-</q> and <q>O</q> in SGML element declarations and indicate
whether start and end tags need to be present or not. They are
superfluous and gone in XML, where minimization is not allowed. The TEI
P4 DTD provides and uses parameter entities <q>%om.RO</q> and
<q>%om.RR</q> to be used in their place.  For SGML parsing, they expand
to <q>- -</q> and <q>- O</q> respectively, for XML parsing they expand
to nothing. <q>om.RO</q> is used for elements that require only a start
tag (mostly empty elements), <q>om.RR</q> is used for elements that
require start and end tag (non-empty elements should be defined that
way).  We can make use of this mechanism for our dual-use DTD
extensions; another useful parameter entity is <q>%TEI.XML</q> that will
expand to <q>IGNORE</q> for SGML parsing and to <q>INCLUDE</q> for XML
parsing. <note>If you feel that this discussion is too DTD-technical for
you, don't worry.  You probably don't need to understand the background
if you follow the examples.</note></p>
      
      <p>So let's go to work:</p>
      
      <p>If we first check for consistent case of the element and
attribute names in our DTD, we discover that element <gi>ps</gi> was
once written in uppercase and once in lowercase.  We decide that
lowercase shall be the correct spelling.</p>

      <p>Some things are easy: the renaming of <gi>div1</gi> and
<gi>div2</gi> to <gi>volume</gi> and <gi>letter</gi> and the <gi>ps</gi>
tag can remain untouched, except that the <q>- -</q> in the definition
of <gi>ps</gi> needs to be either removed or (if we aim at a dual-use
DTD) replaced by <q>%om.RR;</q>.</p>

      <p>This decision comes up again with the <gi>pb</gi> tag.  In XML, we
      can just write an ATTLIST containing only <q>imageUrl</q> and it will be
      merged with the existing ATTLIST in the TEI DTD files; there is no need
      to suppress and copy the definition of <gi>pb</gi>.</p>

      <p>For continuing support of SGML, we have to suppress and
redefine the element as before.  We find it in the TEI DTD files, copy
the P4 definition and modify it for our extra attribute, also replacing
<q>- -</q> with <q>%om.RR;</q> in the element definition.</p>

      <p>The most difficult problem is the <gi>toDo</gi> tag.  For one, the
      content model CDATA needs to become PCDATA.  This means that existing
      documents will most probably break, but there is no choice.  It might be
      a solution to turn all the <gi>toDo</gi> content into CDATA marked
      sections with an automated search and replace as part of the document
      conversion, or to escape the contained markup to entity references using
      a similar procedure.</p>

      <p>Also, the simple way of allowing <gi>toDo</gi> everywhere is no longer
      possible.  This could be a good occasion to check how that element is
      actually used in practice and where it is really needed.  A compromise in
      our example could be to add it to the class <q>Incl</q> that is part of
      every content model within <gi>text</gi>.  Sometimes, a more complex
      redefinition of content models could be necessary.  If that is
your situation, you may want to consult the inofficial TEI document
<ref>ED W 69, chapter 8</ref> for in-depth coverage.</p>

      <p>The first runs with the XML parser result in many warnings
because of redefined parameter entities; this is normal.  Some syntax
correction is required where XML is more strict than SGML: we forgot a
semicolon in a parameter entity reference, <q>%paraContent</q> must not
be in parentheses while the <q>#PCDATA</q> for <gi>toDo</gi> has to be.
When you don't know how to get rid of an error, it can be useful to
browse the TEI DTD files and compare with your own usage.</p>

      <p>The cross-check of the dual-use version with the SGML parser
      exposes a little additional problem: the document now uses the
      character entities &amp;lt; and &amp;gt; which are predefined in XML,
      but not in SGML; once discovered this is easily fixed.  In the
example file, you will find a solution that looks a little complicated
but works flawlessly with SGML and XML.</p>

      <p>The reworked extension files in the XML-only form look like this:</p>

      <eg>
<![CDATA[
<!-- file "my_xml.ent" -->

    <!-- add new element "ps" to class for div-bottom -->
    <!ENTITY % x.divbot  'ps |'    >

    <!-- rename "div1" and "div2" to "volume" and "letter" -->
    <!ENTITY % n.div1    'volume'   >
    <!ENTITY % n.div2    'letter'   >

    <!-- make new element "toDo" available everywhere in "text" -->
    <!ENTITY % x.Incl    ' toDo |' >

<!-- file "my_xml.dtd" -->

    <!-- additional attribute for "pb" element.                    -->
    <!ATTLIST %n.pb;
        imageUrl         CDATA     #IMPLIED                          >

    <!-- new element "ps"                                          -->
    <!ELEMENT ps                   %paraContent;                     >
    <!ATTLIST ps         %a.global;                                  >

    <!-- new element "toDo"                                        -->
    <!ELEMENT toDo                 (#PCDATA)                         >
]]>
      </eg>

      <p>In the dual-use form, the DTD extension file comes out a little longer (<q>my_dual.ent</q> is identical to <q>my_xml.ent</q> above):</p>

      <eg>
<![CDATA[
<!-- file "my_dual.dtd" -->

    <!-- modified copy of "pb" element: "imageUrl" added           -->
    <!ELEMENT %n.pb;     %om.RO;   EMPTY                             >
    <!ATTLIST %n.pb;
          %a.global;
          ed             CDATA     #IMPLIED
          imageUrl       CDATA     #IMPLIED
          TEIform        CDATA     'pb'                              >

    <!-- new element "ps"                                          -->
    <!ELEMENT ps         %om.RR;   %paraContent;                     >
    <!ATTLIST ps         %a.global;                                  >

    <!-- new element "toDo"                                        -->
    <!ELEMENT toDo       %om.RR;   (#PCDATA)                         >

    <!-- entity definitions for SGML, XML has them predefined.     -->
    
    <![%TEI.XML;[
      <!ENTITY % TEI.SGML 'IGNORE'>
    ] ] >
    <!ENTITY % TEI.SGML 'INCLUDE'>
    
    <![%TEI.SGML;[
       <!ENTITY amp "&#038;" >
       <!ENTITY  lt "&#060;" >
       <!ENTITY  gt "&#062;" >
    ] ] >
]]>
    <!-- HELP! How do I escape marked section ends within a marked section
         in this ridiculous language ?! -->

      </eg>

      <p>We can easily convert our short test document manually.  All that
      needs to change is the initial XML declaration, the XML-specific
      parameter entity, the empty-tag syntax for <gi>pb</gi> and the escaping
      of the content of <gi>toDo</gi>.  These steps could serve as models for
      automated conversion of large documents.</p>

      <eg>
<![CDATA[
<?xml version="1.0" ?>
<!DOCTYPE TEI.2 SYSTEM "tei2.dtd" [

<!ENTITY % TEI.XML            "INCLUDE"  >
<!ENTITY % TEI.prose          "INCLUDE"  >
<!ENTITY % TEI.extensions.dtd SYSTEM "my_dual.dtd">
<!ENTITY % TEI.extensions.ent SYSTEM "my_dual.ent">

]>
... (header omitted) ...
  <text>
    <body>
      <volume n="1">
        <letter n="1">
          <pb n="1" imageUrl="godot.tiff"/>
          <salute>Dearest Doug,</salute>
          <p>I will come really soon now.</p>
          <signed>Godot</signed>
          <ps>PS: <hi>Thanks</hi> for all the fish
            <toDo>is this &lt;hi&gt; really correct tagging?</toDo>
          </ps>
	</letter>
        <toDo>Encode next letter</toDo>
      </volume>  
    </body>
  </text>
</TEI.2>
]]>
      </eg>

    </div>
  </div>
</body>
</text>
</TEI.2>

