<?xml version="1.0" ?><!--
Date:         Mon, 16 Dec 2002 10:01:57 +0100
From:         Tobias Rischer <tobias@RISCHER.COM>
Subject:      first version of my section
To:           TEI-MIGR-E@LISTSERV.BROWN.EDU
Content-Type: text/plain; CHARSET=us-ascii

Hello all,

here is the first version of my section.
I am looking forward to all form of feedback - it is fresh from
the keyboard.
Please excuse my delay and the non-responsiveness, but I have been
working hrad in California (still am) and it is difficult to get
my head free for XML (or just write decent emails, it is past midnight
in a pretty wild internet cafe now...)

Anyway, hope to hear from you and all the best 

   Tobias

.............................................
       (_)                     Tobias Rischer
        "==='              tobias@rischer.com
         " "
...still.loving.GNU..........................

-->

<!DOCTYPE TEI.2 PUBLIC "-//TEI//DTD TEI Lite XML ver. 1//EN"
"/home/lou/TEI/web/Software/tei-emacs/xml/dtds/tei/teixlite.dtd" []>

<TEI.2>
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Migrating TEI DTD extensions to XML</title>
        <author>Tobias Rischer</author>
      </titleStmt>
      <publicationStmt>
        <publisher>published by the TEI; part of MIW03</publisher>
      </publicationStmt>
      <sourceDesc>
        <p>this electronic form is original</p>
      </sourceDesc>
    </fileDesc>
  </teiHeader>
  <text>
    <body>
      <div type="section">
        <head>Migrating TEI DTD extensions to XML</head>

        <div type="subsection">
          <head>General remarks</head>

          <p>This section shall support people who have modified the TEI DTD
          and want to migrate these modifications from SGML to XML, i.e., who
          want to use the XML-based P4 DTD with equivalent modifications.  We
          begin with some general remarks, then describe an example DTD
          modification that covers the most important issues, outline a
          recommended migration procedure and carry out the key step hands-on
          on the example.</p>

          <p>If the elements or content models that the TEI provides don't
          quite meet the requirements of your project, there is an official
          esacpe route: you can modify the DTD in a number of well-defined ways
          and your documents still remain <q>TEI conformant.</q> This involves
          creating two extension files, setting some parameter entities,
          possibly defining new elements or redefining existing ones, and
          making these modifications known to the parser in the DTD subset at
          the beginning of the document.</p>

          <p>Although the process is a lot simpler than it looks at first
          glance, many people have taken unofficial escape routes, especially
          the users of the TEI Lite DTD, who would have been required to first
          switch to a full TEI DTD before applying local extensions. It is
          admittedly simpler to just open your local copy of <q>teilite.dtd</q>
          and change a few lines.  Only later will you find out why the TEI
          Guidelines don't advertise this, and one of those moments could be
          the migration of your customized DTD to XML.</p>

          <p>This section shall support you in migrating DTD extensions made
          using the official procedures.  If you had decided to go your own way
          when you hacked your DTD, there is little general advice to give, but
          one:  You can, at this point, either take the new XML-based TEI DTDs
          and reapply your hacks, or you can think twice and on this occasion
          become conformant by remaking your modifications the official way.
          We recommend that this is what you should do.  The hassle just of
          finding the modifications in a patched DTD should convince you that
          the little extra effort needed for conformance is worthwhile.</p>

          <p>This being said: what types of TEI extensions exist and what is
          involved in migrating them from SGML to XML?  The guidelines know
          four kinds of modification:</p>

          <cit>
            <q>
              <list>
                <item n="1">deletion of elements;</item>
                <item n="2">renaming of elements;</item>
                <item n="3">extension of classes;</item>
                <item n="4">modification of content models or attribute lists.</item>
              </list>
            </q>
            <bibl>TEI Guidelines P4, section 29.1</bibl>
          </cit>

          <p>For practical purposes, the fourth item can be subdivided
          into:</p>

          <list>
            <item n="4a">redefinition of attribute lists;</item>
            <item n="4b">modification of existing content models</item>
            <item n="4c">definition and integration of new elements (i.e.,
            hanging the new elements into the existing tree)</item>
          </list>

          <p>The first three cases are extremely easy , the items of group 4
          require more detailed attention.  The following is a short list of
          some critical issues involved.  In the following subsections, we will
          work through a fictitious example that covers most of these issues.
          </p>

          <list>

            <item n="1">Case of element and attribute names is important in
            XML, you have to be consistant now.</item>

            <item n="2">Some modifications require copying and modifying pieces
            of the TEI DTD files; you should check whether those DTD pieces
            have changed from P3 to P4.</item>

            <item n="3">Some people have made modifications to work around
            problems in the TEI P3 DTD; if they are fixed in P4, the workaround
            could cause errors.</item>

            <!-- TODO: example !?  table/figure still isn't fixed. -->

            <item n="4">The SGML DTD syntax for element declarations requires
            two characters of <q>-</q> or <q>O</q> that indicate whether start
            and end tag are required or can be omitted.  These indicators don't
            exist anymore in XML DTDs and your private DTD snippets need to be
            modified.</item>

            <item n="5">The content model for XML elements is more restricted
            than for SGML elements.  We won't go into fine detail, but the
            following two points deserve attention:

              <list>

                <item n="5a">The only type of character data is PCDATA, you
                cannot define CDATA content to bypass the parser.</item>

                <item n="5b">The inclusion exception syntax does not exist in
                XML DTDs. In SGML, you could specify an element to be legal
                everywhere within element <gi>X</gi> <emph>and its children
                </emph> in a single line by using the inclusion exception
                syntax.  This is not possible in XML, you have to add
                <gi>X</gi> to all content models individually.</item>

              </list>
            </item>
          </list>

        </div>

        <div type="subsection">
          <head>A tutorial example</head>

          <p>In this subsection, we will do some simple TEI DTD modifications
          in SGML.  This will then serve as a tutorial example for the
          migration to XML.  While working on this example, the main problems
          in converting DTD extensions should be covered.  Not everyone will
          need everything treated here, and some needs might not be covered,
          but this should be an easy, hands-on starting point for most
          projects. <note>As an additional benefit, this small tutorial might
          induce people to do their modifications the proper way instead of
          hacking TEI Lite.</note></p>

          <p>Let's assume that five years ago, we wanted a TEI P3 DTD for prose
          that meets the following extra requirements (these requirements are
          tutorial examples only, this is no statement on whether they are
          recommendable TEI practice):</p>

          <list>

            <item n="1">the <gi>pb</gi> element shall get an extra attribute
            <q>imageurl</q> that contains an URL for an image of the page;</item>

            <item n="2">there shall be a new element <gi>ps</gi> for the
            postscriptum of letters, containing normal phrase level
            content;</item>

            <item n="3">the element <gi>p</gi> shall be renamed to <gi>para</gi>
            for some strange reason;</item>

            <item n="4">an element <gi>toDo</gi> shall be available everywhere in
            the text for editorial meta-comments on the ongoing encoding.
            Therefore, the content shall be CDATA to allow easy typing of element
            names and entities that are talked about in these notes.</item>

          </list>

          <p>These requirements can be cast into TEI SGML by creating two
          files, <q>my_sgml.ent</q> and <q>my_sgml.dtd</q> that look as
          follows:</p>

          <eg>
<![CDATA[
<!-- file "my_sgml.ent" -->

    <!-- suppress "pb" so it can be redefined (add attribute "imageUrl") -->
    <!ENTITY % pb        'IGNORE' >

    <!-- add new element "ps" to class for div-bottom -->
    <!ENTITY % x.divbot  'ps |'   >

    <!-- rename "p" to "para" -->
    <!ENTITY % n.p       'para'   >

    <!-- suppress TEI.2 so it can be redefined (inclusion of "toDo") -->
    <!ENTITY % TEI.2     'IGNORE' >

<!-- file "my_sgml.dtd" -->


    <!-- modified copy of "pb" element: "imageUrl" added           -->
    <!ELEMENT %n.pb;     - O       EMPTY                             >
    <!ATTLIST %n.pb;
        id               ID        #IMPLIED
        lang             IDREF     %INHERITED
        rend             CDATA     #IMPLIED
        ed               CDATA     #IMPLIED
        n                CDATA     #IMPLIED
        imageUrl         CDATA     #IMPLIED
        TEIform          CDATA     'pb'                              >

    <!-- new element "ps"                                            >
    <!ELEMENT PS         - -       (%paraContent)                    >
    <!ATTLIST ps         %a.global;                                  >

    <!-- new element "toDo"                                          >
    <!ELEMENT toDo       - -       CDATA                             >

    <!-- redefined TEI.2: added "toDo" as inclusion                -->
    <!ELEMENT %n.TEI.2;  - O       (%n.teiHeader;, %n.text;) +(toDo) >
    <!ATTLIST %n.TEI.2;  %a.global;
       TEIform           CDATA     'TEI.2'                           >
]]>
          </eg>

          <p>A sample document <q>godot.sgml</q> using these extensions would
          look like this:</p>

          <eg>
<![CDATA[
<!DOCTYPE TEI.2 SYSTEM "tei2.dtd" [

<!ENTITY % TEI.prose          "INCLUDE"  >
<!ENTITY % TEI.extensions.dtd SYSTEM "my_sgml.dtd">
<!ENTITY % TEI.extensions.ent SYSTEM "my_sgml.ent">

]>
<TEI.2>
... (header omitted) ...
  <text>
    <body>
      <div type="letter">
        <pb n="1" imageUrl="godot.tif">
        <salute>Dearest Doug,</salute>
        <para>I will come really soon now.</para>
        <signed>Godot</signed>
        <ps>PS: <hi>Thanks</hi> for all the fish
          <toDo>is this <hi> really correct tagging?</toDo>
        </ps>
      </div>
      <toDo>Encode next letter</toDo>
    </body>
  </text>
</TEI.2>
]]>
          </eg>

          <p>This example shall be migrated to TEI P4 XML in the next two
          subsections</p>

        </div>

        <div type="subsection">
        <head>Suggested migration procedure</head>

        <p>Although the following step-by-step list may sound over-protective,
        this approach is recommended to help you keep a clear head while you
        are converting DTD and documents.  You are switching from P3 to P4,
        from SGML to XML DTDs, from SGML to XML documents and from SGML to XML
        parsers at the same time, and it can be difficult to find one&apos;s
        way through these many potential pitfalls.</p>

        <list>

          <item n="1">Pick an interesting test document from your repository
          and make sure you can parse it as it is (in SGML form). Maybe make
          up an example like we did above.</item>

          <item n="2">Make sure you can parse the same document against P4 (in
          SGML mode). This shouldn't be a problem normally, so it is mostly
          (but not only) a test for your P4 parser environment.</item>

          <item n="3">Create new XML extension files based on the SGML ones.
          We will do this for our example in the next subsection. If you want
          to continue supporting SGML documents, consider making the extension
          files <soCalled>double-use</soCalled> (which shall mean: compatible
          with XML and SGML) so you need to maintain only one set.</item>

          <item n="4">If you have created double-use extension files, use them
          to parse your SGML document against P4 in SGML mode.  Don't proceed
          until they are correct.</item>

          <item n="5">Make sure that your XML parser is set up properly by
          parsing a minimal TEI XML document without extensions.</item>

          <item n="6">Convert the test document to XML, as described elsewhere
          in this paper. To have a short made-up example makes this step
          easier.  It may be necessary to adapt the tools used so they know
          about additional tags and attributes in your documents (case
          normalization!).</item>

          <item n="7">Try to parse your XML-converted test document.  Errors
          that you see could come from the document conversion or from the
          migrated DTD extensions.  Fix them all.</item>

          <item n="8">When you can parse in XML-mode (hooray!) and you have
          double-use extensions, go back and try the SGML parse again.  Then
          try more documents.  Now you could also consult the Pizza Chef and
          get yourself a flat XML DTD for greater convenience.</item>

        </list>

      </div> <!-- subsection procedure -->

      <div type="subsection">
      <head>Migrating the example DTD</head>

      <p>Of the procedure recommended above, we will now focus on rewriting the
      DTD extensions in XML, with the example DTD modification described
      earlier as a basis.  We will be creating the files <q>my_xml.ent</q> and
      <q>my_xml.dtd</q></p>

      <p>If we first check for consistent case of the element and attribute
      names, we discover that element <gi>ps</gi> was once written in uppercase
      and once in lowercase.  We decide that lowercase shall be the correct
      spelling.</p>

      <p>Some things are easy: the renaming of <gi>p</gi> to <gi>para</gi> and
      the <gi>ps</gi> tag can remain untouched.</p>

      <p>We have to decide now what to do with the <gi>pb</gi> tag.  In XML, we
      can just write an ATTLIST containing only <q>imageUrl</q> and it will be
      merged with the existing ATTLIST in the TEI DTD files; there is no need
      to suppress and copy the definition of <gi>pb</gi>.</p>

      <p>On the other hand, the P4 DTD can be used to parse XML and SGML, and
      we can do the same for our customized DTD, if we will need to support
      SGML in parallel for some while.  If we want that, we need a different
      solution that makes use of the parameter entities <q>om.RO</q> etc. that
      the P4 DTD introduces to write DTDs that are parseable as SGML and XML.
      <q>om.RO</q> is used for elements that require only a start tag (mostly
      empty elements), <q>om.RR</q> is used for elements that require start and
      end tag (non-empty elements should be defined that way).  See the example
      files for the application of these parameter entities.</p>

      <p>The most difficult problem is the <gi>toDo</gi> tag.  For one, the
      content model CDATA needs to become PCDATA.  This means that existing
      documents will most probably break, but there is no choice.  It might be
      a solution to turn all the <gi>toDo</gi> content into CDATA marked
      sections with an automated search and replace as part of the document
      conversion, or to escape the contained markup to entity references using
      a similar procedure.</p>

      <p>Also, the simple way of allowing <gi>toDo</gi> everywhere is no longer
      possible.  This could be a good occasion to check how that element is
      actually used in practice and where it is really needed.  A compromise in
      our example could be to add it to the class <q>Incl</q> that is part of
      every content model within <gi>text</gi>.  Sometimes, a more complex
      redefinition of content models could be necessary.</p>

      <p>The first runs with the XML parser result in many warnings because of
      redefined parameter entities; this is normal.  Some syntax correction is
      required where XML is more strict than SGML: we forgot a semicolon in a
      parameter entity reference, <q>%paraContent</q> must not be in brackets
      while the <q>#PCDATA</q> for <gi>toDo</gi> has to be.  When you don't
      know how to get rid of an error, it can be useful to browse the TEI DTD
      files and compare with your own usage.</p>

      <p>The cross-check of the double-use version with the SGML parser
      exposes a little additional problem: the document now uses the
      character entities &amp;lt; and &amp;gt; which are predefined in XML,
      but not in SGML; once discovered this is easily fixed.</p>

      <p>The reworked extension files in the XML-only form look like this:</p>

      <eg>
<![CDATA[
<!-- file "my_xml.ent" -->

    <!-- add new element "ps" to class for div-bottom -->
    <!ENTITY % x.divbot  'ps |'    >

    <!-- rename "p" to "para" -->
    <!ENTITY % n.p       'para'    >

    <!-- make new element "toDo" available everywhere in "text" -->
    <!ENTITY % x.Incl    ' toDo |' >

<!-- file "my_xml.dtd" -->

    <!-- additional attribute for "pb" element.                    -->
    <!ATTLIST %n.pb;
        imageUrl         CDATA     #IMPLIED                          >

    <!-- new element "ps"                                          -->
    <!ELEMENT ps                   %paraContent;                     >
    <!ATTLIST ps         %a.global;                                  >

    <!-- new element "toDo"                                        -->
    <!ELEMENT toDo                 (#PCDATA)                         >
]]>
      </eg>

      <p>In the double-use form, the DTD extension file comes out a little
      longer (<q>my_double.ent</q> stays the same):</p>

      <eg>
<![CDATA[
<!-- file "my_double.dtd" -->

    <!-- modified copy of "pb" element: "imageUrl" added           -->
    <!ELEMENT %n.pb;     %om.RO;   EMPTY                             >
    <!ATTLIST %n.pb;
          %a.global;
          ed             CDATA     #IMPLIED
          imageUrl       CDATA     #IMPLIED
          TEIform        CDATA     'pb'                              >

    <!-- new element "ps"                                          -->
    <!ELEMENT ps         %om.RR;   %paraContent;                     >
    <!ATTLIST ps         %a.global;                                  >

    <!-- new element "toDo"                                        -->
    <!ELEMENT toDo       %om.RR;   (#PCDATA)                         >

    <!-- entity definitions for SGML, XML has them predefined.     -->
    <!ENTITY amp "&#038;" >
    <!ENTITY  lt "&#060;" >
    <!ENTITY  gt "&#062;" >

]]>
      </eg>

      <p>We can easily convert our short test document manually.  All that
      needs to change is the initial XML declaration, the XML-specific
      parameter entity, the empty-tag syntax for <gi>pb</gi> and the escaping
      of the content of <gi>toDo</gi>.  These steps could serve as models for
      automated conversion of large documents.</p>

      <eg>
<![CDATA[
<?xml version="1.0" ?>
<!DOCTYPE TEI.2 SYSTEM "tei2.dtd" [

<!ENTITY % TEI.XML            "INCLUDE"  >
<!ENTITY % TEI.prose          "INCLUDE"  >
<!ENTITY % TEI.extensions.dtd SYSTEM "my_double.dtd">
<!ENTITY % TEI.extensions.ent SYSTEM "my_double.ent">

]>
... (header omitted) ...
  <text>
    <body>
      <div type="letter">
        <pb n="1" imageUrl="godot.tiff"/>
        <salute>Dearest Doug,</salute>
        <para>I will come really soon now.</para>
        <signed>Godot</signed>
        <ps>PS: <hi>Thanks</hi> for all the fish
          <toDo>is this &lt;hi&gt; really correct tagging?</toDo>
        </ps>
      </div>
      <toDo>Encode next letter</toDo>
    </body>
  </text>
</TEI.2>
]]>
      </eg>

    </div>
  </div>
</body>
</text>
</TEI.2>

