<?xml version="1.0" ?>

<!DOCTYPE TEI.2 PUBLIC "-//TEI//DTD TEI Lite XML ver. 1//EN"
"teixlite.dtd" [

      <!ATTLIST xref url CDATA #REQUIRED >
      <!ATTLIST xptr url CDATA #REQUIRED >
]>

<TEI.2>
  <!-- TODO: add version info? This is v3 (June 2003) -->
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Migrating TEI DTD extensions to XML</title>
        <author>Tobias Rischer</author>
      </titleStmt>
      <publicationStmt>
        <publisher>published by the TEI; part of MIW03</publisher>
      </publicationStmt>
      <sourceDesc>
        <p>this electronic form is original</p>
      </sourceDesc>
    </fileDesc>
  </teiHeader>
<text>
<body>
<div type="section">
<head>Migrating TEI DTD extensions to XML</head>

<div type="subsection">
<head>General remarks</head>

<p>This section shall support people who have modified the TEI DTD
and want to migrate these modifications from SGML to XML, i.e., who
want to use the XML-based P4 DTD with equivalent modifications.  We
begin with some general remarks, then describe an example DTD
modification that covers the most important issues, outline a
recommended migration procedure and carry out the key step hands-on
on the example.</p>

<p>If the elements or content models that the TEI provides don't
quite meet the requirements of your project, there is an official
esacpe route: you can modify the DTD in a number of well-defined ways
and your documents still remain <q>TEI conformant.</q> This involves
creating two extension files, setting some parameter entities,
possibly defining new elements or redefining existing ones, and
making these modifications known to the parser in the DTD subset at
the beginning of the document.</p>

<p>Although the process is a lot simpler than it looks at first
glance, many people have taken unofficial escape routes, especially
the users of the TEI Lite DTD, who would have been required to first
switch to a full TEI DTD before applying local extensions.  It is
admittedly simpler to just open your local copy of <q>teilite.dtd</q>
and change a few lines.  Only later will you find out why the TEI
Guidelines don't advertise this, and one of those moments could be
the migration of your customized DTD to XML.</p>

<p>If you are in this situation now, there are three ways 
to proceed:</p>

<list type="ordered"> 

<item n="1">Redo your modifications the official way for the P4 DTD,
using extension files: find out what is changed in your local copy of the
TEI Lite DTD, and create proper extension files for the TEI P4 DTD to the
same effect.  You will find it is less hard than you thought, and the
next migration will be so much easier.  You'll find useful advice for
this process in the Guidelines, maybe also in the rest of this section.
This is the way we would advocate.</item>

<item n="2">Redo your modifications as before: find out what you changed
in the TEI Lite DTD, and apply the same changes to your local copy of the
TEI XLite DTD.  We do <hi>not</hi> advocate this procedure, but it is, of
course, a practical possibility.</item>

<item n="3">Take a step back: are those modifications really still
needed?  Maybe they were made to work around a bug of TEI P3, and
this bug has gone?  Or they were intended for a feature that was
never used?</item>

</list>

<p>This being said, the rest of this section shall support you in migrating
DTD extensions made using the official procedures.  So: what types of TEI
extensions exist and what is involved in migrating them from SGML to XML?
The guidelines know four kinds of modification: 
<xref url="http://www.tei-c.org/P4X/MD.html#MDMD">TEI Guidelines P4, section 29.1</xref></p>
<list type="ordered">
<item n="1">deletion of elements;</item>
<item n="2">renaming of elements;</item>
<item n="3">extension of classes;</item>
<item n="4">modification of content models or attribute lists.</item>
</list>

<p>The first three cases are extremely easy, the fourth item
requires more detailed attention.  For practical purposes, it can 
be subdivided into:</p>

<list type="ordered">
<item n="4a">redefinition of attribute lists;</item>
<item n="4b">modification of existing content models</item>
<item n="4c">definition and integration of new elements (i.e.,
hanging the new elements into the existing tree)</item>
</list>

<p>The following is a short list of
some critical issues involved.  In the following subsections, we will
work through a fictitious example that covers most of these issues.
</p>

<list type="ordered">

<item n="1">Case of element and attribute names is important in
XML, you have to be consistent now.</item>

<item n="2">It is likely that some of the modifications in your existing
P3 extension files involved copying (and then probably modifying) pieces
of the TEI DTD files ; you should check whether those DTD pieces have
changed from P3 to P4.</item>

<item n="3">Some people have made modifications to work around
problems in the TEI P3 DTD; if they are fixed in P4, the workaround
could cause errors (a notorious example is <gi>persName</gi>).</item>

<item n="4">The SGML DTD syntax for element declarations requires
two characters of <q>-</q> or <q>O</q> that indicate whether start
and end tag are required or can be omitted.  These indicators don't
exist anymore in XML DTDs and your private DTD snippets need to be
modified.</item>

<item n="5">The content model for XML elements is more restricted
than for SGML elements.  We won't go into fine detail, but the
following two points deserve attention:

<list type="ordered">

<item n="5a">The only type of character data is PCDATA, you
cannot define CDATA content to bypass the parser.</item>

<item n="5b">The inclusion exception syntax does not exist in
XML DTDs. In SGML, you could specify an element to be legal
everywhere within element <gi>X</gi> <emph>and its children
</emph> in a single line by using the inclusion exception
syntax.  This is not possible in XML, you have to add
<gi>X</gi> to all content models individually.</item>

</list>
</item>
</list>

</div>

<div type="subsection">
<head>A tutorial example</head>

<p>In this subsection, we will do some simple TEI DTD modifications
in SGML.  This will then serve as a tutorial example for the
migration to XML.  While working on this example, the main problems
in converting DTD extensions should be covered.  Not everyone will
need everything treated here, and some needs might not be covered,
but this should be an easy, hands-on starting point for most
projects. <note place="foot">As an additional benefit, this small tutorial might
induce people to do their modifications the proper way instead of
hacking TEI Lite.</note></p>

<p>Let's assume that five years ago, we wanted a TEI P3 DTD for prose
that meets the following extra requirements (these requirements are
tutorial examples only, this is no statement on whether they are
recommendable TEI practice):</p>

<list type="ordered">

<item n="1">personal names shall be marked with the <gi>persName</gi> tag
(this requires the TEI extensions for names and dates and a workaround
for a bug in the P3 DTD).</item>

<item n="2">the <gi>pb</gi> element shall get an extra attribute
<q>imageurl</q> that contains an URL for an image of the page;</item>

<item n="3">there shall be a new element <gi>ps</gi> for the
postscriptum of letters, containing normal phrase level
content;</item>

<item n="4">the elements <gi>div1</gi> and <gi>div2</gi>shall be renamed
to <gi>volume</gi> and <gi>letter</gi> because our source material
is a collection of letters organized that way, and we want to keep that
structure and make it explicit;</item>

<item n="5">an element <gi>toDo</gi> shall be available everywhere in
the text for editorial meta-comments on the ongoing encoding.
Therefore, the content shall be CDATA to allow easy typing of element
names and entities that are talked about in these notes (in CDATA content
tags and entity references are not recognized by the SGML parser).</item>

</list>

<p>These requirements can be cast into TEI SGML by creating two
files, <q>my_sgml.ent</q> and <q>my_sgml.dtd</q> that look as
follows:</p>

<eg>
<![CDATA[
<!-- file "my_sgml.ent" -->

    <!-- fix persName problem in TEI P3 -->
    <!ENTITY % x.data    'persName |'>

    <!-- suppress "pb" so it can be redefined (add attribute "imageUrl") -->
    <!ENTITY % pb        'IGNORE'>

    <!-- add new element "ps" to class for div-bottom -->
    <!ENTITY % x.divbot  'ps |'>

    <!-- rename "p" to "para" -->
    <!ENTITY % n.p       'para'>

    <!-- suppress TEI.2 so it can be redefined (inclusion of "toDo") -->
    <!ENTITY % TEI.2     'IGNORE'>

<!-- file "my_sgml.dtd" -->

    <!-- modified copy of "pb" element: "imageUrl" added           --> 
    <!ELEMENT %n.pb;     - O       EMPTY                             >
    <!ATTLIST %n.pb;
        id               ID        #IMPLIED
        lang             IDREF     %INHERITED
        rend             CDATA     #IMPLIED
        ed               CDATA     #IMPLIED
        n                CDATA     #IMPLIED
        imageUrl         CDATA     #IMPLIED
        TEIform          CDATA     'pb'                              >

    <!-- new element "ps"                                          -->
    <!ELEMENT PS         - -       (%paraContent)                    >
    <!ATTLIST ps         %a.global;                                  >

    <!-- new element "toDo"                                        -->
    <!ELEMENT toDo       - -       CDATA                             >

    <!-- redefined TEI.2: added "toDo" as inclusion                -->
    <!ELEMENT %n.TEI.2;  - O       (%n.teiHeader;, %n.text;) +(toDo) >
    <!ATTLIST %n.TEI.2;  %a.global;
       TEIform           CDATA     'TEI.2'                           >
]]>
</eg>

<p>A sample document <q>godot.sgml</q> using these extensions would
look like this:</p>

<eg>
<![CDATA[
<!DOCTYPE TEI.2 SYSTEM "tei2.dtd" [

<!ENTITY % TEI.prose          "INCLUDE"  >
<!ENTITY % TEI.names.dates    "INCLUDE"  >
<!ENTITY % TEI.extensions.dtd SYSTEM "my_sgml.dtd">
<!ENTITY % TEI.extensions.ent SYSTEM "my_sgml.ent">

]>
<TEI.2>
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Letter from Godot</title>
        <author>John Doe</author>
      </titleStmt>
      <publicationStmt>
        <publisher>published by the TEI</publisher>
      </publicationStmt>
      <sourceDesc>
        <para>paper original lost after encoding</para>
      </sourceDesc>
    </fileDesc>
  </teiHeader>
  <text>
    <body>
      <div type="letter">
        <pb n="1" imageUrl="godot.tiff">
        <salute>Dearest <persName>Doug</persName>,</salute>
        <para>I will come really soon now.</para>
        <signed><persName>Godot</persName></signed>
        <ps>PS: <hi>Thanks</hi> for all the fish
          <toDo>is this &lt;hi&gt; really correct tagging?</toDo>
        </ps>
      </div>
      <toDo>Encode next letter</toDo>
    </body>
  </text>
</TEI.2>
]]>
</eg>

<p>This example shall be migrated to TEI P4 XML in the next two
subsections</p>

</div>

<div type="subsection">
<head>Suggested migration procedure</head>

<p>Although the following step-by-step list may sound over-protective,
this approach is recommended to help you keep a clear head while you
are converting DTD and documents.  You are switching from P3 to P4,
from SGML to XML DTDs, from SGML to XML documents and from SGML to XML
parsers at the same time, and it can be difficult to find one&apos;s
way through these many potential pitfalls.</p>

<list type="ordered">

<item n="1">Pick an interesting test document from your repository
and make sure you can parse it as it is (in SGML form) against your
current DTD setup (TEI P3 with your extensions). Maybe make up an
example like we did above.</item>

<item n="2">Set up a parallel SGML parser environment to parse
against the TEI P4 DTD.  Before you try to parse your sample file, make
sure it works with <hi>very</hi> simple standard TEI files.</item>

<item n="3">Now try to parse your sample document against P4
(in SGML mode). Theoretically, this shouldn't be a problem, but the
chance is high you are getting errors.  Create an intermediate version
of your SGML extension files that fix the problems (the cause is most
likely that your extensions work around a bug of the P3 DTD that was
repaired in P4).</item>

<item n="4">Create new XML extension files based on the SGML
ones. We will do this for our example in the next subsection. If you want
to continue supporting SGML documents, consider making the extension
files <soCalled>dual-use</soCalled> (which shall mean: compatible with
XML and SGML, see below) so you need to maintain only one set. If you had
to make changes for the previous step (moving from P3 to P4 SGML), you
have to decide whether the SGML part of your dual-use setup shall be P3
compatible or P4 compatible.  Normally, you should be fine using P4 SGML,
and this is the assumption in the following text.</item>

<item n="5">If you have created dual-use extension files, use them
to parse your SGML document against P4 in SGML mode.  Don't proceed
until they are correct.</item>

<item n="6">Make sure that your XML parser is set up properly by
parsing a minimal TEI XML document without extensions.</item>

<item n="7">Convert the test document to XML, as described elsewhere
in this paper. To have a short made-up example makes this step
easier as you avoid the extra confusion of errors from a large-scale text
migration.  It may be necessary to adapt the tools used so they know
about additional tags and attributes in your documents (case
normalization!).</item>

<item n="8">Try to parse your XML-converted test document.  Errors
that you see could come from the document conversion or from the
migrated DTD extensions.  Fix them all.</item>

<item n="9">When you can parse in XML-mode (hooray!) and you have
dual-use extensions, go back and try the SGML parse again.  Then
try more documents.  Now you could also consult the Pizza Chef and
get yourself a flat XML DTD for greater convenience.</item>

</list>

</div> <!-- subsection procedure -->

<div type="subsection">
<head>Migrating the example DTD</head>

<p>Of the procedure recommended above, we will now focus on rewriting the
DTD extensions in XML, with the example DTD modification described
earlier as a basis.  We will be creating the files <q>my_xml.ent</q> and
<q>my_xml.dtd</q></p>

<p>Before we start out, though, a strategic decision has to be
made: Shall we burn the bridges and support only XML in the future?  The
P4 DTD provides mechanisms to parse both XML and SGML, and we can do the
same for our customized DTD, if we will need to support SGML in parallel
for some while.  It takes a little more thought and effort, but in
return you get the comfort of a safe transition period.  The key to this
is the parameter entity <q>TEI.XML</q> that is defined as <q>INCLUDE</q>
in XML mode, and as <q>IGNORE</q> otherwise.</p>

<p>In this document, we shall call this
<soCalled>dual-use</soCalled> extensions, and in our example, we shall
demonstrate both dual-use and pure XML extensions. <note>The mechanism
of using the <q>TEI.XML</q> parameter entity for dual-use extensions
can also be used to select between character entity sets for SGML and XML;
see the relevant section of this document.</note></p>

<p>One obvious syntactic difference between SGML and XML DTDs are
the <soCalled>omitted tag minimization parameters</soCalled> that appear
as <q>-</q> and <q>O</q> in SGML element declarations and indicate
whether start and end tags need to be present or not. They are
superfluous and gone in XML, where minimization is not allowed. The TEI
P4 DTD provides and uses parameter entities <q>%om.RO</q> and
<q>%om.RR</q> to be used in their place.  For SGML parsing, they expand
to <q>- -</q> and <q>- O</q> respectively, for XML parsing they expand
to nothing. <q>om.RO</q> is used for elements that require only a start
tag (mostly empty elements), <q>om.RR</q> is used for elements that
require start and end tag (non-empty elements should be defined that
way).  We can make use of this mechanism for our dual-use DTD
extensions; another useful parameter entity is <q>%TEI.XML</q> that will
expand to <q>IGNORE</q> for SGML parsing and to <q>INCLUDE</q> for XML
parsing. It can be used to create marked sections for entity and element definitions
that will only be seen for SGML or XML respectively. <note place="foot">If you 
feel that this discussion is too DTD-technical for
you, don't worry.  You probably don't need to understand the background
if you follow the examples.</note></p>

<p>So let's go to work:</p>

<p>We suggested above that before thinking about P4 XML, we should make sure 
that our document
can be parsed with the P4 DTD in SGML mode.  When doing this for our example, we
get a frightening number of errors like <q>nsgmls:my_sgml.dtd:14:45:E: 
content model is ambiguous: when the current token is the 1st occurrence 
of "JOIN", both the 1st and 2nd occurrences of "PERSNAME" are possible</q>.
Two occurences of <gi>persName</gi>?  It turns out that the workaround
that was necessary to use <gi>persName</gi> with the P3 DTD is no longer
needed and causes trouble instead.  So we can remove it from the working
copy of our modification file <q>my_sgml.ent</q>. <note place="foot">A similar
problem would have occured if we had used <q>x.globincl</q> to implement
our <gi>toDo</gi> element.  Since the inclusion mechanism is gone, the 
<q>globincl</q> class is gone as well. Instead, TEI P4 has a new class 
<q>Incl</q> and elements that shall be available everywhere within <gi>text</gi>
need to be added to <q>x.Incl</q>.</note></p>

<p>Now let's move towards XML: if we first check for consistent case of the element and
attribute names in our DTD, we discover that element <gi>ps</gi> was
once written in uppercase and once in lowercase.  We decide that
lowercase shall be the correct spelling.</p>

<p>Some things are easy: the renaming of <gi>div1</gi> and
<gi>div2</gi> to <gi>volume</gi> and <gi>letter</gi> and the <gi>ps</gi>
tag can remain untouched, except that the <q>- -</q> in the definition
of <gi>ps</gi> needs to be either removed or (if we aim at a dual-use
DTD) replaced by <q>%om.RR;</q>.</p>

<p>The dual-use decision comes up again with the <gi>pb</gi> tag.  In XML, we
can just write an ATTLIST containing only <q>imageUrl</q> and it will be
merged with the existing ATTLIST in the TEI DTD files; there is no need
to suppress and copy the definition of <gi>pb</gi>.</p>

<p>For continuing support of SGML, we have to suppress and
redefine the element as before.  We find it in the TEI DTD files (teicore2.dtd),
copy the P4 definition and modify it for our extra attribute, also replacing
<q>- -</q> with <q>%om.RR;</q> in the element definition.</p>

<p>The most difficult problem is the <gi>toDo</gi> tag.  For one, the
content model CDATA needs to become PCDATA.  This means that existing
documents will most probably break, but there is no choice.  It might be
a solution to turn all the <gi>toDo</gi> content into CDATA marked
sections with an automated search and replace as part of the document
conversion, or to escape the contained markup to entity references using
a similar procedure.</p>

<p>Also, the simple way of allowing <gi>toDo</gi> everywhere is no longer
possible.  This could be a good occasion to check how that element is
actually used in practice and where it is really needed.  A compromise in
our example could be to add it to the class <q>Incl</q> that is part of
every content model within <gi>text</gi>.  Sometimes, a more complex
redefinition of content models could be necessary.  If that is
your situation, you may want to consult the inofficial TEI document
<xref url="http://www.tei-c.org/Vault/ED/edw69.htm">ED W 69, chapter 8</xref> 
for in-depth coverage.</p>

<p>The first runs with the XML parser result in many warnings
because of redefined parameter entities; this is normal.  Some syntax
correction is required where XML is more strict than SGML: we forgot a
semicolon in a parameter entity reference, <q>%paraContent</q> must not
be in parentheses while the <q>#PCDATA</q> for <gi>toDo</gi> has to be.
When you don't know how to get rid of an error, it can be useful to
browse the TEI DTD files and compare with your own usage.</p>

<p>The cross-check of the dual-use version with the SGML parser
exposes a little additional problem: the document now uses the
character entities &amp;lt; and &amp;gt; which are predefined in XML,
but not in SGML; once discovered this is easily fixed.  In the
example file, you will find a solution that looks a little complicated
but works flawlessly with SGML and XML.</p>

<p>The reworked extension files in the XML-only form look like this:</p>

<eg>
<![CDATA[
<!-- file "my_xml.ent" -->

    <!-- persName fix removed for TEI P4 -->

    <!-- add new element "ps" to class for div-bottom -->
    <!ENTITY % x.divbot  'ps |'    >

    <!-- rename "div1" and "div2" to "volume" and "letter" -->
    <!ENTITY % n.div1    'volume'   >
    <!ENTITY % n.div2    'letter'   >

    <!-- make new element "toDo" available everywhere in "text" -->
    <!ENTITY % x.Incl    ' toDo |' >

<!-- file "my_xml.dtd" -->

    <!-- additional attribute for "pb" element.                    -->
    <!ATTLIST %n.pb;
        imageUrl         CDATA     #IMPLIED                          >

    <!-- new element "ps"                                          -->
    <!ELEMENT ps                   %paraContent;                     >
    <!ATTLIST ps         %a.global;                                  >

    <!-- new element "toDo"                                        -->
    <!ELEMENT toDo                 (#PCDATA)                         >
]]>
</eg>

<p>In the dual-use form, the DTD extension file comes out a little longer 
(<q>my_dual.ent</q> is identical to <q>my_xml.ent</q> above):</p>

<eg>
<![CDATA[
<!-- file "my_dual.dtd" -->

    <!-- modified copy of "pb" element: "imageUrl" added           -->
    <!ELEMENT %n.pb;     %om.RO;   EMPTY                             >
    <!ATTLIST %n.pb;
          %a.global;
          ed             CDATA     #IMPLIED
          imageUrl       CDATA     #IMPLIED
          TEIform        CDATA     'pb'                              >

    <!-- new element "ps"                                          -->
    <!ELEMENT ps         %om.RR;   %paraContent;                     >
    <!ATTLIST ps         %a.global;                                  >

    <!-- new element "toDo"                                        -->
    <!ELEMENT toDo       %om.RR;   (#PCDATA)                         >

    <!-- entity definitions for SGML, XML has them predefined.     -->
    
    <![%TEI.XML;[
      <!ENTITY % TEI.SGML 'IGNORE'>
    ] ] >
    <!ENTITY % TEI.SGML 'INCLUDE'>
    
    <![%TEI.SGML;[
       <!ENTITY amp "&#038;" >
       <!ENTITY  lt "&#060;" >
       <!ENTITY  gt "&#062;" >
    ] ] >
]]>
<!-- HELP! How do I escape marked section ends within a marked section
     in this ridiculous language ?! -->

</eg>

<p>We can easily convert our short test document manually.  All that
needs to change is the initial XML declaration, the XML-specific
parameter entity, the empty-tag syntax for <gi>pb</gi> and the escaping
of the content of <gi>toDo</gi>.  These steps could serve as models for
automated conversion of large documents.</p>

<eg>
<![CDATA[
<?xml version="1.0" ?>
<!DOCTYPE TEI.2 SYSTEM "tei2.dtd" [

<!ENTITY % TEI.XML            "INCLUDE"  >
<!ENTITY % TEI.prose          "INCLUDE"  >
<!ENTITY % TEI.names.dates    "INCLUDE"  >
<!ENTITY % TEI.extensions.dtd SYSTEM "my_dual.dtd">
<!ENTITY % TEI.extensions.ent SYSTEM "my_dual.ent">

]>
... (header omitted) ...
  <text>
    <body>
      <volume n="1">
        <letter n="1">
          <pb n="1" imageUrl="godot.tiff"/>
          <salute>Dearest <persName>Doug</persName>,</salute>
          <p>I will come really soon now.</p>
          <signed><persName>Godot</persName></signed>
          <ps>PS: <hi>Thanks</hi> for all the fish
            <toDo>is this &lt;hi&gt; really correct tagging?</toDo>
          </ps>
        </letter>
        <toDo>Encode next letter</toDo>
      </volume>  
    </body>
  </text>
</TEI.2>
]]>
</eg>

</div>
</div>
</body>
</text>
</TEI.2>

