User:Jhellingman/Philippine Works in Progress/Wolff CED

Introduction

John U. Wolff spent about 10 years producing his Cebuano-English dictionary. The result is one of the best dictionaries ever made, not just for Cebuano, but for any Philippine language. Starting from scratch, John U. Wolff and his team collected samples of language usage, and compiled a dictionary with a wealth of information. Not just an English gloss that translates the head word, but also information on allowed affixes for each word, and fully-translated sample sentences with each word. All this combined makes it a great resource for people trying to learn Cebuano. (Almost all other Cebuano dictionaries are made to serve a Cebuano speaking public when learning English.)

The publisher, The Southeast Asia Program of Cornell University, has dedicated this important work to the Public Domain, and the author, John U. Wolff has offered his help digitizing this work, and making it available on-line under the condition that we maintain the work's integrity.

The work is available in 12 projects of 200 single column pages each (corresponding to roughly 100 pages of the original dictionary)

The previews are of the processed parts on DP, in various states of completion. Note that these previews are made without human intervention using scripts. The HTML shows the text rendered for human reading; the XML is tagged to identify what each part of the text represents. As this is done automatically, based on textual hints, this is not always accurate, but can be improved manually after completion of all volumes.

Part 1 proj_submit_pgposted HTML | XML | XML with formatting
Part 2 project_delete HTML | XML
Part 3 project_delete HTML | XML
Part 4 project_delete HTML | XML
Part 5 project_delete HTML | XML
Part 6 project_delete HTML | XML
Part 7 project_delete HTML | XML
Part 8 project_delete HTML | XML
Part 9 project_delete HTML | XML
Part 10 project_delete HTML | XML
Part 11 project_delete HTML | XML
Part 12 project_delete HTML | XML

All parts in a searchable database. Note that this is work in progress, and the interface may still have some bugs.

All work-in-progress files and related processing and presentation tools are maintained using Google Code, in the phildict project.

All parts that completed F1 combined and corrected:

Note that the author of this dictionary sometimes has thrown in some funny examples (see for example the entry for bákud or dígù, and imagine the scenes unfolding), and does not hesitate to use American idiom at times. So you can expect to learn some English as well. Furthermore, unlike older dictionaries, this dictionary is not censored. All vulgar words and expressions the author could collect have found their place, including examples of reversing slang.

Good Words | Bad Words

Proofreading Instructions

These instructions need to be followed during the proofing rounds.

Orthography. The orthography of this book mostly follows the (de-facto) standard orthography of Cebuano, except for two importation deviations:

The letters e and o are not used, as they are not used distinctively. The only vowels found in Cebuano words in this dictionary are thus a, i, and u.

A system of diacritics is used to indicate stress, vowel length, and syllable-final glottal stop.

The following diacritics are used:

acute: to indicate the stress on a word. á, í, ú
grave: to indicate a following glottal stop. à, ì, ù
wedge: to indicate stress on a short ultimate syllable. (Type as [va], [vi], [vu])
macron and acute: to indicate a long syllable, combined with a potential stress-shift. (Type as [=á], [=í], [=ú])

We keep the orthography exactly as in the source.

The italic font used in this book unfortunately has a b and h that are easily confused. This needs special attention.

Affixes Cebuano uses a complex system of affixes, that are placed before, after, around, or inside root-words to form new words. This dictionary is organized by root, listing possible affixes under the root entry. Those affixes are given as patterns, as follows:

-x: x is placed after the root.
x-: x is placed before the root.
-x-: x is placed after the first consonant of the root; when the root starts with a vowel, it is placed before the root. For example, -in- with the entry balun produces binalun.
x-y: x is placed before the root, and y after the root.

Since I will use these patterns to generate the full forms for an index, it is important that these patterns are proofed exactly as they are.

Verb patterns. The dictionary includes codes in square brackets that indicate how a verb can be used. Typically, these codes are a letter followed by numbers. Often, two groups are shown, separated by a semicolon, the first with mostly capital letters, and the second with lower case letters. Examples: [A12; b3], [A3; b(1)], etc. The 1 (one) in those groups is often misread as l (letter el), please look carefully and correct them. Please insert a space after the semicolon if it is not present.

Note that we will use these patterns to generate the inflected forms of verbs to be used in the on-line index to the dictionary.

Special Symbols The dictionary includes arrows to indicate stress-shifts that are applied during affixation or may change the meaning of a root-word.

Left arrow: stress shifted towards beginning of word. (Always in parentheses, type as (<-))
Right arrow: stress shifted towards end of word. (Always in parentheses, type as (->))
Dagger: to indicate additional information is available in the supplement (Type as [+])
em-dash: when used to indicate the occurrence of the head word in a phrase. (Type as --, but do not close up the spaces)

In the preface, the following symbols also occur:

Phonetic symbol for glottal stop. (Type as [?], note that this symbol sometimes occurs in the text within brackets, which should be retained as well)
Phonetic symbol for ng sound (n with tail on right side). (Type as [ng])

Hyphenation. I have purposely not dehyphenated the text file. End-of-line hyphenation needs to be undone. For Cebuano, this can be easily done in most cases, when taking into consideration the following rules.

if the end-of-line hyphen occurs between two vowels, it can be dropped. (vowels are always pronounced separately)
if the end-of-line hyphen occurs between two consonants, it can be dropped (except between n and g).
if the end-of-line hyphen occurs after a consonant and the first letter on the next line is a vowel, it should stay. (In this case, the hyphen indicates a syllable-initial glottal stop.)
if the end-of-line hyphen occurs after a vowel and the first letter on the next line is a consonant, it can be dropped.

These rules only apply to Cebuano words. English words should be dehyphenated following the normal rules for English. Use -* when in doubt.

Note that a mid-word hyphen in Cebuano indicates a syllable starting with a glottal stop, that is, tan-aw, to see, is pronounced /tan[?]aw/. Such hyphens should not be removed.

Corrections. If you find anything that may need correction, please use the following pattern to indicate that: original[**typo:suggested correction].

Things to look at.

New entries start with a lowercase letter. Also, when the entry continues for a different part of speech, the POS code is a lower case letter. In both those cases, the OCR has often changed the preceding period into a comma. Please change it back to a period.
The number one (1) is often mixed up with the letter ell (l) or the capital letter I. Please double check. In the verb classification codes between brackets, it is always a number one.
The period after the letter t, in particular in the abbreviation s.t. (something) is often missing.
The letters h and b, especially in italics, are easily confused.
The accent above the i is often hard to read. Since the grave accent only appears at the end of words (except in reduplicated words, such as hulìhúlì), mid-word you either have an i, or an i with acute accent, í. It is never a circumflex, î, even though the blot might look like it.

No need to read beyond this point, if you are just helping in the proofing rounds.

Formatting Instructions

These instructions need to be followed during the formatting rounds.

Proper formatting of this text will be very important to be able to parse the text and turn it into an easy-to-use linguistic resource.

Since most entries are well-structured, and the structure is indicated with typographic features, it should not be too difficult to obtain the structure from the formatting. It is important though that the formatting is correct. After the F-rounds, I will use some specifically tuned programs to add semantic tags to it, based on the typography of this dictionary. After fine-tuning, this should be able to get at least 90% of the structure of this dictionary explicitly coded.

Typically, a head word (root entry) is followed by a number of meanings, sorted by part-of-speech, then numbered meanings, and followed by one or more examples. After that, a range of sub-entries (derived words, often only indicated by an affixation pattern), with the same structure.

Headwords We rely on headwords being bold to recognize them. This includes patterns and phrases used in sub-entries. For this reason, the stress-shift symbols (<-) and (->), and dash representing the headword in a sub-entry should be within the bold markup.

Homonym Numbers We rely on the Homonym numbers being sub-script to recognize them. The subscript formatting should have already been added during the P* rounds. Note that these numbers are to be included in the bold formatting of the headword.

Part-of-Speech. We rely on the PoS abbreviations being in italic to recognize them. Please wrap them in their own set of italic tags, even if they are preceded or followed by italics.

Sense Numbers. We rely on the sense numbers being bold to recognize them. Please wrap them in their own set of bold tags, even if they are preceded or followed by bold.

Cross References. All cross references are small-caps; we just rely on the typography to identify them (That is, an equals sign, followed by a phrase in small caps.) Note that although they are bold small-caps, it is sufficient to tag them as small-caps only, saving some clicks.

Binomial Scientific Names. All binomial scientific names are in bold italic. We rely on this formatting to recognize them as such.

English Equivalents. One feature that would make the dictionary considerably more useful would be the identification of direct translations or equivalents of the head word. These typically come quite early on in a meaning. The idea is that if you use the dictionary in reverse (that is, looking for a Cebuano equivalent of an English word), you want to find entries where that English word is given as an equivalent, not all entries where an English word just happens to be present in an example.

All English equivalents should be preceded by an at-sign, for example: @yellow. If an equivalent contains more than one word, group them with braces, as follows: @{wine bottle}

Note that this requires some editorial judgment, but no knowledge of Cebuano will be required. Prefer adding @ to single words, as they appear, rather than phrases, unless such a phrase as a whole is an appropriate English equivalent. Do not include the word to with verbs, just place an @ on the verb. If you feel unsure, then better not place an at-sign.

Sample Entries

dábuk, page 187

dábuk₁ v [A; a] 1 make a fire. Pagdábuk dihà kay magdigámu ta, Make a fire because we're going to fix dinner. 2 fumigate an area. Dabúkan ta ang mangga arun mudaghan ang búnga, Let's subject the mango tree to smoke so that there will be lots of fruit. (->) n 1 fire in an open place. 2 place where an open fire is built. Duul ra sa balay ang dabuk (dabukan), They built the fire too close to the house. -an(->) = dabuk, 2.

dábuk₂ v [A; a] crush by pounding. Dabúka ang mani pára sa kaykay, Pound the peanuts for the cookies. (->) n crushed to fine bits, crumbled. Dabuk sa pán, Bread crumbs.

<b>dábuk_1</b> <i>v</i> [A; a] <b>1</b> make a @fire. <i>Pagdábuk dihà 
kay magdigámu ta,</i> Make a fire because
we're going to fix dinner. <b>2</b> @fumigate an area.
<i>Dabúkan ta ang mangga arun mudaghan ang 
búnga,</i> Let's subject the mango tree to
smoke so that there will be lots of fruit. <b>(->)</b>
<i>n</i> <b>1</b> @fire in an open place. <b>2</b> place where an
open fire is built. <i>Duul ra sa balay ang dabuk
(dabukan),</i> They built the fire too close 
to the house. <b>-an(->)</b> = <sc>DABUK</sc>, <i>2.</i>

<b>dábuk_2</b> <i>v</i> [A; a] @crush by @pounding. <i>Dabúka
ang mani pára sa kaykay,</i> Pound the peanuts
for the cookies. <b>(->)</b> <i>n</i> @crushed to fine
bits, @crumbled. <i>Dabuk sa pán,</i> Bread crumbs.

These are actually two entries, to demonstrate the use of the subscript numbers. Note how the Cebuano in the sample sentences ends with a comma, but the following English translation starts with a capital letter. This need not be marked or changed. Actually, I will use this feature to extract sample sentences.

Please note the following:

The pattern (->) (indicating a stress change, that is, dábuk becomes dabuk, default stress on the ultimate syllable not being indicated) is bold, including the parenthesis.
The parentheses in the example are italic.
Punctuation follows the style of the preceding text, so comma's, periods, etc. go inside the bold or italic mark-up. The same is true for the arrows in patterns, such as (->), which should be part of the bold tagging.
The cross reference is bold small-caps, but it is enough to mark just the small-caps and omit the bold. This should NOT be converted to lowercase. (note that the cross-reference is internal to the entry, and actually points to the second meaning. Expanded it says dabukan = dabuk, 2.

ispisiyal, page 399

ispisiyal a 1 special, particularly good. Lutúan níyag sud-an nga ispisiyal ang íyang bisíta, She will fix special food for her guests. 2 especial, out of the ordinary. v 1 [APB12; c1] be, become special, particularly good. 2 [A; c1] do s.t. speical, out of the ordinary. Ispisyalun (iispisiyal) ta kag tawag, I'll mention your name espcially, apart from the others. 2a [A] do a particular dance at a ball where only certain people are invited to dance. Mag-ispisiyal run ug bayli, pára sa mga upisyális, The next number is a special number for the officers only. -- dilibiri n special delivery. v [A13; c6] send s.t. special delivery. †

<b>ispisiyal</b> <i>a</i> <b>1</b> special, particularly good. <i>Lutúan
níyag sud-an nga ispisiyal ang íyang bisíta,</i>
She will fix special food for her guests. <b>2</b>
especial, out of the ordinary. <i>v</i> <b>1</b> [APB12; 
c1] be, become special, particularly good.
<b>2</b> [A; c1] do s.t. speical, out of the ordinary.
<i>Ispisyalun (iispisiyal) ta kag tawag,</i> I'll
mention your name espcially, apart from
the others. <b>2a</b> [A] do a particular dance at 
a ball where only certain people are invited
to dance. <i>Mag-ispisiyal run ug bayli, pára sa
mga upisyális,</i> The next number is a special 
number for the officers only. <b>-- dilibiri</b> <i>n</i>
special delivery. <i>v</i> [A13; c6] send s.t. special 
delivery. [+]

The em-dash here stands for the full entry word, and should be separated from the following word by a space. It is to be included in the bold mark-up.
The dagger indicates that there is some information on this entry in the supplement. Type it as [+].

ispísu a 1 for liquids to be thick, of great density. Ispísu kaáyung sikwáti kay gidaghan níyag tablíya, The chocolate drink is thick because he puts lots of chocolate on it. 2 for colors to be intense as if thickly laid on. Ispísu kaáyu ang kaitum sa balhíbu sa ákung iring, My cat's hair is a deep black. 3 -- nga [noun] a diehard, fanatic follower or believer of. Ispísu giyud nang Katuliku, He is a devout Catholic. Ispísung Usminyista, Diehard follower of Osmeña. 4 in phrases: -- ug apdu brave (lit. having thick bile). -- ug dugù a having guts. b heartless, merciless. -- ug hambug laying bragging on thick. v [A B; c1] for liquids to become thick, cause them to do so. Muispísu (maispísu) an sabaw ug butangag úbi, The soup will become thick if you add yams.

<b>ispísu</b> <i>a</i> <b>1</b> for liquids to be thick, of great 
density. <i>Ispísu kaáyung sikwáti kay gidaghan 
níyag tablíya,</i> The chocolate drink is
thick because he puts lots of chocolate on it.
<b>2</b> for colors to be intense as if thickly laid
on. <i>Ispísu kaáyu ang kaitum sa balhíbu sa
ákung iring,</i> My cat's hair is a deep black. <b>3</b>
<b>-- nga [<i>noun</i>]</b> a diehard, fanatic follower or
believer of. <i>Ispísu giyud nang Katuliku,</i> He
is a devout Catholic. <i>Ispísung Usminyista,</i> 
Diehard follower of Osmeña. <b>4</b> <i>in phrases:</i>
<b>-- ug apdu</b> brave (lit. having thick bile). <b>--
ug dugù</b> <b>a</b> having guts. <b>b</b> heartless, merciless.
<b>-- ug hambug</b> laying bragging on thick. <i>v</i> [A 
B; c1] for liquids to become thick, cause 
them to do so. <i>Muispísu (maispísu) an sabaw
ug butangag úbi,</i> The soup will become
thick if you add yams.

Here, a meaning number is directly followed by a phrase in bold, and later on, a meaning is directly followed by a meaning counting letter. Please always format such meaning numbers separately from all other parts of the text.
The part [noun] is part of a (sub-entry) head word. Keep it within a set of bold-markers, even though the word noun is in italics.

No need to read beyond this point if you are just helping in the formatting rounds.

Post-Processing

This information is provided for those who are curious, and to allow for collaboration between those involved in post-processing.

Deliverables.

From the dictionary data, I plan to produce the following

A monolithic plain text and HTML file, as per Project Gutenberg standards, about 5-6 megabytes of text.
A printable PDF file, formatted for A4 print-out in two columns.
A TEI tagged SGML/XML master file.
An XDXF formatted dictionary file, so it can be used on various applications and gadgets using that format.
A searchable database to be published on www.bohol.ph and www.gutenberg.ph, to provide students and speakers with an interactive learning and reference resource, similar to Kaufmann's Visayan-English Dictionary currently available.

Steps

Note that I start post-processing after completion of round F1. The reason for this is that my intensive post-processing steps will catch most, if not all remaining formatting issues, often by automated tools, and that the interest for this type of work is rather low, so allowing this work to complete F2 will take considerable time.

Uptag1

Perl script to add semantic tagging to various elements of the dictionary, based on the typographic tagging added during formatting.

This script is run several times, fixing the various issues found in the PGDP output, until the output is reasonably clean.

This script has the following assumptions (with some refinements):

single number or number followed by letter formatted bold: <number>.
single a, n, v formatted italic: <pos>.
bold formatted words: <form>.
small-caps formatted items: <xref>.

Things that will need manual intervention here:

sense-numbers not marked bold (common).
sense-numbers marked bold, but as part of a following or preceding headword.
pos-codes not marked italic (common).
pos-codes marked italic as part of a preceding italic word.
sense-numbers and pos-codes that are supposed to be part of a cross reference (common).
head-words that are actually used as part of a sense (occasionally, but wrecks havoc on the tagging algorithm).
italic words that are not examples (common).

Uptag2

XSLT style-sheets to add high-level structure to the entries, based on the semantic tagged elements of the previous round.

This style-sheet derives the dictionary structure in the following way.

groups elements starting with <form> into sub-entries.
within each sub-entry: groups elements starting with <pos> into homonyms.
within each homonym: groups elements starting with <number> into senses.
within each sense: groups elements starting with <i> into examples.

During this process, in some cases, the various formatting may not be relevant for the structure. In those cases, the formatting element-name is temporarily changed to the element-name followed by an x, e.g., <i> tags that do not start an example sentence are changed to <ix>.

Tools.

For post-processing and further uses of the dictionary data, the following will be needed.

0. Syllabification

   Syllabify("bábuy") -> { "bá", "buy" }

A tool is need to split Cebuano words in syllables. Since the writing used in this dictionary is highly regular, this should not be too cumbersome, and can be based on research done by Jesse S. Banks in his Lingua-Phonology package. All what is needed is a definition of the a the sonority scale, and valid onsets and codas for Cebuano.

Based on this tool, other tasks become easier, for example, shifting stress, finding default stress, etc.

Database Structure

The database structure for Wolff's dictionary will be fairly simple. Three main tables will be used:

CREATE TABLE  ceb_entry 
(
  entryid   int unsigned NOT NULL,
  entry     text NOT NULL
);

which contains all the entries (as well-formed XML fragments, with the <entry> tag as root element).

CREATE TABLE ceb_wordentry 
(
  wordid     int unsigned NOT NULL,
  entryid    int unsigned NOT NULL,
);

linking the words and entries together, and

CREATE TABLE ceb_word 
(
  wordid     int unsigned NOT NULL,
  word       varchar(32) NOT NULL,
  normalized varchar(32) NOT NULL,
  type       tinyint unsigned NOT NULL default '0'
);

Which serves as an index into the entries.

word is the word as it appears in the dictionary,

normalized is the word normalized to a simplified spelling, for Cebuano words that means all accents and hyphens removed. When queries are made, the query words are also simplified, and matched against this field. Exact matches are an option in the advanced search interface.

The type indicates how the word is used in the entry, with the following values.

Cebuano headword
Cebuano used otherwise
English equivalents
English word used otherwise
Cebuano word derived from pattern (e.g., if the headword for a sub-entry is paN- and the main headword in abla, pangabla will be added as this type.)
Cebuano word derived from verb codes (including those derived from patterns).
l-dropped variant. (e.g., balay becoming báy)

I consider using a bitfield for this field, such that the following coding can be used:

1 Cebuano headword
2 Cebuano used otherwise
4 English equivalents
8 English word used otherwise
16 Cebuano word derived from pattern (e.g., if the headword for a sub-entry is paN- and the main headword in abla, pangabla will be added as this type.)
32 Cebuano word derived from verb codes (including those derived from patterns).
64 l-dropped variant. (e.g., balay becoming báy)

And queries can be implemented for combinations of any type. (And words can have more than one flag set, for example, an l-dropped derived form.

TEI Tagging Conventions.

TEI tags used in the dictionary will derive from the TEI P4 Guidelines. Tags will be selected from the TEI lite subset when possible, and the additions for printed dictionaries when needed.

The following principles will be followed:

Tagging is extra. Information in tags and attributes will be used to make explicit and supplement the information in the dictionary. No information will be removed or moved into tags.
Language will be tagged only when a language change is present. The top level element lang attribute will have the value en. The attribute lang will be set to en or ceb (or other appropriate languages) at the highest level possible. Biological scientific names will have the lang attribute set to la (for Latin).
Default language for elements:
- entry: en
- form, q: ceb

Intermediate tagging

The `uptag` scripts used to add tagging, use existing (typographic) tags combined with some heuristics to determine the structure of entry. This will not always work, as sometimes, the typographic tags are not used in the expected way. To resolve this, "offending" tags are changed to the same tag followed by an 'x', for example `<i>` becomes `<ix>` if the italics do not represent a sample, and `<form>` becomes `<formx>` when the recognized form is not in a normal position of a head-word.

Later in the process, such tags will be changed back to the original tag.

Cross references

Cross references are encoded using `<xr>` for the entire cross-referencing phrase, and `<ref>` for the exact word. When there is no cross reference phrase, the `<xr>` is not required. Note that small-caps are used to formally indicate a cross reference, but many more cross references are implied by words in italics.

The formatting of cross-references to entries, pos, and meanings is highly inconsistent. I will attempt to normalize this, and link to the proper element directly.

TEI P4 versus TEI P5

The latest TEI P5 standard removes a number of tags proposed below. To accommodate TEI P5, the following changes can be made:

   <eg>
      Ispísung Usminyista,
      <trans>Diehard follower of Osmeña.</trans>
   </eg>

Becomes:

   <cit type="example">
      Ispísung Usminyista,
      <cit type="trans">Diehard follower of Osmeña.</cit>
   </cit>

The transform can be applied with XSLT automatically.

References

A collection of references to projects that use TEI for the encoding of dictionaries.

Adaptive Transformation-based Learning for Improving Dictionary Tagging.

Ways of automatically tagging a dictionary, includes a discussion on Wolff's dictionary.

Making Dictionaries.
Corpus building for minority languages by Kevin P. Scannell.

An interesting approach to building a corpus.

TEI to DICT Howto.

Tools to convert a TEI dictionary to the DICT format.

Representing dictionaries with the TEI.

An interesting discussion on the ways to represent a dictionary in TEI (Powerpoint file).

Maori Dictionary Samoan Dictionary

TEI tagged dictionaries, unfortunately just using a typographic representation.

Strongs Dictionary

Another TEI tagged dictionary, using entryFree.

Using the TEI Scheme in Compiling a Korean Dictionary by Beom-mo Kang

Jaslo, A Japanese-Slovene Learners' Dictionary: Methods for Dictionary Enhancement by Tomaž Erjavec et al.

FreeDict

The FreeDict project uses TEI as their core format, and has a collection of very interesting Perl scripts to deal with TEI dictionaries.

Ang Dila Natong Bisaya, by Manuel Yap.

A Cebuano grammar written in Cebuano.

Sample TEI Tagged Texts

Single Entry

A possible way of tagging the sample entry would be (not yet validated against TEI!):

   <?xml version="1.0" encoding="utf-8"?>
   <entry>
       <form>ispísu</form>
       <hom>
           <gramGrp>
               <pos>a</pos>
           </gramGrp>
           <sense n="1">
               <number>1</number>
               <trans>for liquids to be thick, of great density.</trans>
               <eg>
                   Ispísu kaáyung sikwáti kay gidaghan níyag tablíya,
                   <trans>The chocolate drink is thick because he puts lots of chocolate on it.</trans>
               </eg>
           </sense>
           <sense n="2">
               <number>2</number>
               <trans>for colors to be intense as if thickly laid on.</trans>
               <eg>
                   Ispísu kaáyu ang kaitum sa balhíbu sa ákung iring,
                   <trans>My cat's hair is a deep black.</trans>
               </eg>
           </sense>
           <sense n="3">
               <number>3</number>
               <form>-- nga [noun]</form>
               <trans>a diehard, fanatic follower or believer of.</trans>
               <eg>
                   Ispísu giyud nang Katuliku,
                   <trans>He is a devout Catholic.</trans>
               </eg>
               <eg>
                   Ispísung Usminyista,
                   <trans>Diehard follower of Osmeña.</trans>
               </eg>
           </sense>
           <sense n="4">
               <number>4</number>
               <note>in phrases:</note>
               <sense>
                   <form>-- ug apdu</form>
                   <trans>brave (lit. having thick bile).</trans>
               </sense>
               <sense>
                   <form>-- ug dugù</form>
                   <sense n="a">
                       <number>a</number>
                       <trans>having guts.</trans>
                   </sense>
                   <sense n="b">
                       <number>b</number>
                       <trans>heartless, merciless.</trans>
                   </sense>
               </sense>
               <sense>
                   <form>-- ug hambug</form>
                   <trans>laying bragging on thick.</trans>
               </sense>
           </sense>
       </hom>
       <hom>
           <gramGrp>
               <pos>v</pos>
               <itype>[AB; c1]</itype>
           </gramGrp>
           <trans>for liquids to become thick, cause them to do so.</trans>
           <eg>
               Muispísu (maispísu) an sabaw ug butangag úbi,
               <trans>The soup will become thick if you add yams.</trans>
           </eg>
       </hom>
   </entry>

Double Entry

Another sample tagged according to TEI:

   <superEntry>
       <entry>
           <form>dábuk</form>
           <number>1</number>
           <hom>
               <gramGrp>
                   <pos>v</pos>
                   <itype>[A; a]</itype>
               </gramGrp>
               <sense>
                   <number>1</number>

<trans>make a fire.</trans> <eg> Pagdábuk dihà kay magdigámu ta, <trans>Make a fire because we're going to fix dinner.</trans> </eg> </sense> <sense> <number>2</number> <trans>fumigate an area.</trans> <eg> Dabúkan ta ang mangga arun mudaghan ang búnga, <trans>Let's subject the mango tree to smoke so that there will be lots of fruit.</trans> </eg> </sense> </hom> <hom> <form>(->)</form> <gramGrp> <pos>n</pos> </gramGrp> <sense> <number>1</number> <trans>fire in an open place.</trans> </sense> <sense> <number>2</number> <trans>place where an open fire is built.</trans> <eg> Duul ra sa balay ang dabuk (dabukan), <trans>They built the fire too close to the house.</trans> </eg> </sense> </hom> <hom> <form>-an(->)</form> <xr>= <ref>dabuk, 2.</ref></xr> </hom> </entry> <entry> <form>dábuk</form> <number>2</number> <hom> <gramGrp> <pos>v</pos> <itype>[A; a]</itype> </gramGrp> <trans>crush by pounding.</trans> <eg> Dabúka ang mani pára sa kaykay, <trans>Pound the peanuts for the cookies.</trans> </eg> </hom> <hom> <form>(->)</form> <gramGrp> <pos>n</pos> </gramGrp> <trans>crushed to fine bits, crumbled.</trans> <eg> Dabuk sa pán, <trans>Bread crumbs.</trans> </eg> </hom> </entry> </superEntry>

Cross References

Cross references are treated as additional senses. They can appear as part of a larger entry, or stand-alone, as in this example.

   <entry>
       <form>nan</form>
       <number>1</number>
       <hom>
           <sense>
               <number>1</number> <xr>= <ref>DAN</ref></xr>. 
           </sense>
           <sense>
               <number>2</number> 
               <trans>in narrations, particle preceding a statement that is off the subject but important for the course of the story. 
                   <eq>
                       Nan, kadtu si Antunyu, palainan ta lang, ákù tung ámu, 
                       <trans>Now, this Antonio, to change the subject, was my employer.</trans>
                   </eq>
           </sense>
       </hom>
   </entry>

The stand-alone cross reference.

   <entry>
       <form>nan</form>
       <number>2</number>
       <hom>
           <sense>
               <xr>= <ref>UG</ref>, 1 (dialectal).</xr>
           </sense>
       </hom>
   </entry>

Phrasal Sub-entries

Sometimes, phrases are entered as sub-entries.

Issues in the dictionary structure

In a few cases, the structure of dictionary entries has some complexities.

cross references sometimes point to an entry, sometimes to an entry in a certain role, and sometimes to a specific meaning. This is normally not an issue, except that correctly matching the typographic layout with the intended meaning is sometimes complex, and the typography is inconsistent. Note that cross references can point to sub-entries as well as other entries.
sub-entries under a sense. Sometimes sub-entries (in bold) are part of a sense, whereas the regular location would be at the end of the entry. This typically happens with short phrases entered as entries. They are sometimes numbered, sometimes not.
position of verb conjugation patterns. These sometimes appear before the sense numbers, and sometimes after. I take this to mean that in the first case the scope is all senses, and in the second case only the indicated sense (overriding any higher level pattern).
numbering of senses. The numbering of senses is not always strictly 1, 2, 3, ..., but the exact semantics of the numbering system are not always clear. For verbs, they probably help to identify the scope of a verb conjugation pattern.
additional parts of speech. Besides nouns, verbs, and adverbs/adjectives, the dictionary contains entries for particles and affixes and phrases. These are either not identified as such, or in prose.

Verb Conjugation Codes

Wolff's dictionary uses short-hand codes to indicate how a verb can be used. Those codes take some time to get used to. For a computer based dictionary, we do not have the space-constrains of a printed dictionary, so can present this information in a more comprehensive, template-based style.

The meaning of these codes is explained in the preface, section 7.1 and further.

	Future	Past	Subjunctive
Active
Punctual	mu-	mi-, ni-, ning-, ming-	mu-
Durative	mag-, maga-	nag-, naga-, ga-	mag-, maga-
Potential	maka-, ka-	naka-, ka-	maka-, ka-
Direct Passive
Punctual	-un	gi-	-a
Durative	paga-un*	gina-*	paga-a*
Potential	ma-	na-	ma-
Local Passive
Punctual	-an	gi-an	-i
Durative	paga-an*	gina-an*	paga-i*
Potential	ma-an, ka-an	na-an	ma-i, ka-i
Instrumental Passive
Punctual	i-	gi-	i-
Durative	iga-*	gina-*	iga-*
Potential	ma-, ika-	na-, gika-	ma-, ika-

Verb conjugation codes can have two parts. The letters have the following meanings

First part: active verbs

A - Action Verbs
- 1 without punctual
- 2 without durative
- 3 without potential
- S stress shift in indicated class
  - 1S stress shift in punctual
- P pa- can be added without change of meaning in indicated class
  - 3P maka = makapa-
- N paN- can be added to root, mu- and maka- forms.
B - Stative Verbs
- 1 without mu-
- 2 without mag-
- 3 maka- has meaning "become [so-and-so]"
- 3(1) maka- has meaning "become [so-and-so]" and "cause to become [so-and-so]"
- 4 without na-
- 5 without naka-
- 6 without magka-
- S, N as with A.
C - Mutual Verbs
- 1 without mag-
- 2 without magka-
- 3 without makig-

Second part: passive verbs

a - verbs with direct passive affixes (focus is recipient of action)
- 1 - without local passive
- 2 - without instrumental passive
- 3 - only potential passive
- 4 - focus is suffering from or affected by thing referred to
b - verbs with local passive affix (focus is recipient of action) and instrumental passive affixes
- (1) - without instrumental passive affixes (except -i)
- 1 - focus is place of action or recipient of action
- 2 - focus is place of action; hi-an(->), hi-i also refers to accidental recipient of action.
- 3 - reason of action
- 3(1) - as 3, but only with potential affixes
- 4 - focus is thing affected
- 4(1) - as 4, but only with potential affixes
- 5 - local and direct passive are synonymous
- 6 - only local passive and instrumental -i (focus is place or beneficiary of action)
- 7 - focus is diminished or added to
- 8 - only potential local passives
c - verbs with instrumental passive (focus is thing conveyed or recipient)
- 1 - direct and instrumental passive are synonymous
- 2 - local and instrumental passive are synonymous (focus is recipient of action)
- 3 - only with potential affixes -ika, -gika
- 4 - optionally take ig-
- 5 - focus is reason for agent to become in certain state
- 6 - without local passive affixes

Encarnacion's Diccionario Español-Bisaya

Juan Felis de la Encarnacion's Diccionario Español-Bisaya first appeared in 1866, and went through several reprints. Although mostly of historical interest, we are also processing this dictionary through this site.

To translate the Spanish based orthography to the style used by John U. Wolff, you'll need to apply the following replacements:

Encarnacion	Wolff	Note
ao	aw
ai	ay
c	k
e	i
gui	gi
j	h
ñ	ny
n[~g]	ng	depends on context
ng	ngg	depends on context, not before consonants.
o	u
oa	wa	depends on context, typically not when one of the pair is accented.
ua	wa	depends on context, typically not when one of the pair is accented.
qu	k

Note that the accents are used differently in this dictionary.

The glottal stop is not indicated by Encarnacion.

Encarnacion is highly inconsistent in his use of o versus u, often using different spellings for the same word in a single entry. To remedy this, I propose to do the following during PP:

1. Follow the Spanish orthography for words derived from Spanish. 2. Follow the rules as given in Cabonce's dictionary for other words.

User:Jhellingman/Philippine Works in Progress/Wolff CED

Contents

Introduction

Proofreading Instructions

Formatting Instructions

Sample Entries

Post-Processing

Deliverables.

Steps

Uptag1

Uptag2

Tools.

Database Structure

TEI Tagging Conventions.

Intermediate tagging

Cross references

TEI P4 versus TEI P5

References

Sample TEI Tagged Texts

Single Entry

Double Entry

Cross References

Phrasal Sub-entries

Issues in the dictionary structure

Verb Conjugation Codes

Encarnacion's Diccionario Español-Bisaya

Navigation menu

User:Jhellingman/Philippine Works in Progress/Wolff CED

Introduction

Proofreading Instructions

Formatting Instructions

Sample Entries

Post-Processing

Deliverables.

Steps

Uptag1

Uptag2

Tools.

Database Structure

TEI Tagging Conventions.

Intermediate tagging

Cross references

TEI P4 versus TEI P5

References

Sample TEI Tagged Texts

Single Entry

Double Entry

Cross References

Phrasal Sub-entries

Issues in the dictionary structure

Verb Conjugation Codes

Encarnacion's Diccionario Español-Bisaya

Navigation menu

Search