User:Jhellingman/Philippine Works in Progress/Wolff CED
Introduction
John U. Wolff spent about 10 years producing his Cebuano-English dictionary. The result is one of the best dictionaries ever made, not just for Cebuano, but for any Philippine language. Starting from scratch, John U. Wolff and his team collected samples of language usage, and compiled a dictionary with a wealth of information. Not just an English gloss that translates the head word, but also information on allowed affixes for each word, and fully-translated sample sentences with each word. All this combined makes it a great resource for people trying to learn Cebuano. (Almost all other Cebuano dictionaries are made to serve a Cebuano speaking public when learning English.)
The publisher, The Southeast Asia Program of Cornell University, has dedicated this important work to the Public Domain, and the author, John U. Wolff has offered his help digitizing this work, and making it available on-line under the condition that we maintain the work's integrity.
The work is available in 12 projects of 200 single column pages each (corresponding to roughly 100 pages of the original dictionary)
The previews are of the processed parts on DP, in various states of completion. Note that these previews are made without human intervention using scripts. The HTML shows the text rendered for human reading; the XML is tagged to identify what each part of the text represents. As this is done automatically, based on textual hints, this is not always accurate, but can be improved manually after completion of all volumes.
- Part 1 proj_submit_pgposted HTML | XML | XML with formatting
- Part 2 project_delete HTML | XML
- Part 3 project_delete HTML | XML
- Part 4 project_delete HTML | XML
- Part 5 project_delete HTML | XML
- Part 6 project_delete HTML | XML
- Part 7 project_delete HTML | XML
- Part 8 project_delete HTML | XML
- Part 9 project_delete HTML | XML
- Part 10 project_delete HTML | XML
- Part 11 project_delete HTML | XML
- Part 12 project_delete HTML | XML
All parts in a searchable database. Note that this is work in progress, and the interface may still have some bugs.
All work-in-progress files and related processing and presentation tools are maintained using Google Code, in the phildict project.
All parts that completed F1 combined and corrected:
Note that the author of this dictionary sometimes has thrown in some funny examples (see for example the entry for bákud or dígù, and imagine the scenes unfolding), and does not hesitate to use American idiom at times. So you can expect to learn some English as well. Furthermore, unlike older dictionaries, this dictionary is not censored. All vulgar words and expressions the author could collect have found their place, including examples of reversing slang.
Proofreading Instructions
These instructions need to be followed during the proofing rounds.
Orthography. The orthography of this book mostly follows the (de-facto) standard orthography of Cebuano, except for two importation deviations:
- The letters e and o are not used, as they are not used distinctively. The only vowels found in Cebuano words in this dictionary are thus a, i, and u.
- A system of diacritics is used to indicate stress, vowel length, and syllable-final glottal stop.
The following diacritics are used:
- acute: to indicate the stress on a word.
á, í, ú
- grave: to indicate a following glottal stop.
à, ì, ù
- wedge: to indicate stress on a short ultimate syllable. (Type as
[va], [vi], [vu]
) - macron and acute: to indicate a long syllable, combined with a potential stress-shift. (Type as
[=á], [=í], [=ú]
)
We keep the orthography exactly as in the source.
The italic font used in this book unfortunately has a b and h that are easily confused. This needs special attention.
Affixes Cebuano uses a complex system of affixes, that are placed before, after, around, or inside root-words to form new words. This dictionary is organized by root, listing possible affixes under the root entry. Those affixes are given as patterns, as follows:
- -x: x is placed after the root.
- x-: x is placed before the root.
- -x-: x is placed after the first consonant of the root; when the root starts with a vowel, it is placed before the root. For example, -in- with the entry balun produces binalun.
- x-y: x is placed before the root, and y after the root.
Since I will use these patterns to generate the full forms for an index, it is important that these patterns are proofed exactly as they are.
Verb patterns. The dictionary includes codes in square brackets that indicate how a verb can be used. Typically, these codes are a letter followed by numbers. Often, two groups are shown, separated by a semicolon, the first with mostly capital letters, and the second with lower case letters. Examples: [A12; b3], [A3; b(1)], etc. The 1 (one) in those groups is often misread as l (letter el), please look carefully and correct them. Please insert a space after the semicolon if it is not present.
Note that we will use these patterns to generate the inflected forms of verbs to be used in the on-line index to the dictionary.
Special Symbols The dictionary includes arrows to indicate stress-shifts that are applied during affixation or may change the meaning of a root-word.
- Left arrow: stress shifted towards beginning of word. (Always in parentheses, type as
(<-)
) - Right arrow: stress shifted towards end of word. (Always in parentheses, type as
(->)
) - Dagger: to indicate additional information is available in the supplement (Type as
[+]
) - em-dash: when used to indicate the occurrence of the head word in a phrase. (Type as
--
, but do not close up the spaces)
In the preface, the following symbols also occur:
- Phonetic symbol for glottal stop. (Type as
[?]
, note that this symbol sometimes occurs in the text within brackets, which should be retained as well) - Phonetic symbol for ng sound (n with tail on right side). (Type as
[ng]
)
Hyphenation. I have purposely not dehyphenated the text file. End-of-line hyphenation needs to be undone. For Cebuano, this can be easily done in most cases, when taking into consideration the following rules.
- if the end-of-line hyphen occurs between two vowels, it can be dropped. (vowels are always pronounced separately)
- if the end-of-line hyphen occurs between two consonants, it can be dropped (except between n and g).
- if the end-of-line hyphen occurs after a consonant and the first letter on the next line is a vowel, it should stay. (In this case, the hyphen indicates a syllable-initial glottal stop.)
- if the end-of-line hyphen occurs after a vowel and the first letter on the next line is a consonant, it can be dropped.
These rules only apply to Cebuano words. English words should be dehyphenated following the normal rules for English. Use -* when in doubt.
Note that a mid-word hyphen in Cebuano indicates a syllable starting with a glottal stop, that is, tan-aw, to see, is pronounced /tan[?]aw/. Such hyphens should not be removed.
Corrections. If you find anything that may need correction, please use the following pattern to indicate that: original[**typo:suggested correction]
.
Things to look at.
- New entries start with a lowercase letter. Also, when the entry continues for a different part of speech, the POS code is a lower case letter. In both those cases, the OCR has often changed the preceding period into a comma. Please change it back to a period.
- The number one (1) is often mixed up with the letter ell (l) or the capital letter I. Please double check. In the verb classification codes between brackets, it is always a number one.
- The period after the letter t, in particular in the abbreviation s.t. (something) is often missing.
- The letters h and b, especially in italics, are easily confused.
- The accent above the i is often hard to read. Since the grave accent only appears at the end of words (except in reduplicated words, such as hulìhúlì), mid-word you either have an i, or an i with acute accent, í. It is never a circumflex, î, even though the blot might look like it.
No need to read beyond this point, if you are just helping in the proofing rounds.
Formatting Instructions
These instructions need to be followed during the formatting rounds.
Proper formatting of this text will be very important to be able to parse the text and turn it into an easy-to-use linguistic resource.
Since most entries are well-structured, and the structure is indicated with typographic features, it should not be too difficult to obtain the structure from the formatting. It is important though that the formatting is correct. After the F-rounds, I will use some specifically tuned programs to add semantic tags to it, based on the typography of this dictionary. After fine-tuning, this should be able to get at least 90% of the structure of this dictionary explicitly coded.
Typically, a head word (root entry) is followed by a number of meanings, sorted by part-of-speech, then numbered meanings, and followed by one or more examples. After that, a range of sub-entries (derived words, often only indicated by an affixation pattern), with the same structure.
Headwords We rely on headwords being bold to recognize them. This includes patterns and phrases used in sub-entries. For this reason, the stress-shift symbols (<-) and (->), and dash representing the headword in a sub-entry should be within the bold markup.
Homonym Numbers We rely on the Homonym numbers being sub-script to recognize them. The subscript formatting should have already been added during the P* rounds. Note that these numbers are to be included in the bold formatting of the headword.
Part-of-Speech. We rely on the PoS abbreviations being in italic to recognize them. Please wrap them in their own set of italic tags, even if they are preceded or followed by italics.
Sense Numbers. We rely on the sense numbers being bold to recognize them. Please wrap them in their own set of bold tags, even if they are preceded or followed by bold.
Cross References. All cross references are small-caps; we just rely on the typography to identify them (That is, an equals sign, followed by a phrase in small caps.) Note that although they are bold small-caps, it is sufficient to tag them as small-caps only, saving some clicks.
Binomial Scientific Names. All binomial scientific names are in bold italic. We rely on this formatting to recognize them as such.
English Equivalents. One feature that would make the dictionary considerably more useful would be the identification of direct translations or equivalents of the head word. These typically come quite early on in a meaning. The idea is that if you use the dictionary in reverse (that is, looking for a Cebuano equivalent of an English word), you want to find entries where that English word is given as an equivalent, not all entries where an English word just happens to be present in an example.
All English equivalents should be preceded by an at-sign, for example: @yellow. If an equivalent contains more than one word, group them with braces, as follows: @{wine bottle}
Note that this requires some editorial judgment, but no knowledge of Cebuano will be required. Prefer adding @ to single words, as they appear, rather than phrases, unless such a phrase as a whole is an appropriate English equivalent. Do not include the word to with verbs, just place an @ on the verb. If you feel unsure, then better not place an at-sign.
Sample Entries
dábuk1 v [A; a] 1 make a fire. Pagdábuk dihà kay magdigámu ta, Make a fire because we're going to fix dinner. 2 fumigate an area. Dabúkan ta ang mangga arun mudaghan ang búnga, Let's subject the mango tree to smoke so that there will be lots of fruit. (->) n 1 fire in an open place. 2 place where an open fire is built. Duul ra sa balay ang dabuk (dabukan), They built the fire too close to the house. -an(->) = dabuk, 2.
dábuk2 v [A; a] crush by pounding. Dabúka ang mani pára sa kaykay, Pound the peanuts for the cookies. (->) n crushed to fine bits, crumbled. Dabuk sa pán, Bread crumbs.
<b>dábuk_1</b> <i>v</i> [A; a] <b>1</b> make a @fire. <i>Pagdábuk dihà kay magdigámu ta,</i> Make a fire because we're going to fix dinner. <b>2</b> @fumigate an area. <i>Dabúkan ta ang mangga arun mudaghan ang búnga,</i> Let's subject the mango tree to smoke so that there will be lots of fruit. <b>(->)</b> <i>n</i> <b>1</b> @fire in an open place. <b>2</b> place where an open fire is built. <i>Duul ra sa balay ang dabuk (dabukan),</i> They built the fire too close to the house. <b>-an(->)</b> = <sc>DABUK</sc>, <i>2.</i> <b>dábuk_2</b> <i>v</i> [A; a] @crush by @pounding. <i>Dabúka ang mani pára sa kaykay,</i> Pound the peanuts for the cookies. <b>(->)</b> <i>n</i> @crushed to fine bits, @crumbled. <i>Dabuk sa pán,</i> Bread crumbs.
These are actually two entries, to demonstrate the use of the subscript numbers. Note how the Cebuano in the sample sentences ends with a comma, but the following English translation starts with a capital letter. This need not be marked or changed. Actually, I will use this feature to extract sample sentences.
Please note the following:
- The pattern (->) (indicating a stress change, that is, dábuk becomes dabuk, default stress on the ultimate syllable not being indicated) is bold, including the parenthesis.
- The parentheses in the example are italic.
- Punctuation follows the style of the preceding text, so comma's, periods, etc. go inside the bold or italic mark-up. The same is true for the arrows in patterns, such as (->), which should be part of the bold tagging.
- The cross reference is bold small-caps, but it is enough to mark just the small-caps and omit the bold. This should NOT be converted to lowercase. (note that the cross-reference is internal to the entry, and actually points to the second meaning. Expanded it says dabukan = dabuk, 2.
ispisiyal a 1 special, particularly good. Lutúan níyag sud-an nga ispisiyal ang íyang bisíta, She will fix special food for her guests. 2 especial, out of the ordinary. v 1 [APB12; c1] be, become special, particularly good. 2 [A; c1] do s.t. speical, out of the ordinary. Ispisyalun (iispisiyal) ta kag tawag, I'll mention your name espcially, apart from the others. 2a [A] do a particular dance at a ball where only certain people are invited to dance. Mag-ispisiyal run ug bayli, pára sa mga upisyális, The next number is a special number for the officers only. -- dilibiri n special delivery. v [A13; c6] send s.t. special delivery. †
<b>ispisiyal</b> <i>a</i> <b>1</b> special, particularly good. <i>Lutúan níyag sud-an nga ispisiyal ang íyang bisíta,</i> She will fix special food for her guests. <b>2</b> especial, out of the ordinary. <i>v</i> <b>1</b> [APB12; c1] be, become special, particularly good. <b>2</b> [A; c1] do s.t. speical, out of the ordinary. <i>Ispisyalun (iispisiyal) ta kag tawag,</i> I'll mention your name espcially, apart from the others. <b>2a</b> [A] do a particular dance at a ball where only certain people are invited to dance. <i>Mag-ispisiyal run ug bayli, pára sa mga upisyális,</i> The next number is a special number for the officers only. <b>-- dilibiri</b> <i>n</i> special delivery. <i>v</i> [A13; c6] send s.t. special delivery. [+]
- The em-dash here stands for the full entry word, and should be separated from the following word by a space. It is to be included in the bold mark-up.
- The dagger indicates that there is some information on this entry in the supplement. Type it as [+].
ispísu a 1 for liquids to be thick, of great density. Ispísu kaáyung sikwáti kay gidaghan níyag tablíya, The chocolate drink is thick because he puts lots of chocolate on it. 2 for colors to be intense as if thickly laid on. Ispísu kaáyu ang kaitum sa balhíbu sa ákung iring, My cat's hair is a deep black. 3 -- nga [noun] a diehard, fanatic follower or believer of. Ispísu giyud nang Katuliku, He is a devout Catholic. Ispísung Usminyista, Diehard follower of Osmeña. 4 in phrases: -- ug apdu brave (lit. having thick bile). -- ug dugù a having guts. b heartless, merciless. -- ug hambug laying bragging on thick. v [A B; c1] for liquids to become thick, cause them to do so. Muispísu (maispísu) an sabaw ug butangag úbi, The soup will become thick if you add yams.
<b>ispísu</b> <i>a</i> <b>1</b> for liquids to be thick, of great density. <i>Ispísu kaáyung sikwáti kay gidaghan níyag tablíya,</i> The chocolate drink is thick because he puts lots of chocolate on it. <b>2</b> for colors to be intense as if thickly laid on. <i>Ispísu kaáyu ang kaitum sa balhíbu sa ákung iring,</i> My cat's hair is a deep black. <b>3</b> <b>-- nga [<i>noun</i>]</b> a diehard, fanatic follower or believer of. <i>Ispísu giyud nang Katuliku,</i> He is a devout Catholic. <i>Ispísung Usminyista,</i> Diehard follower of Osmeña. <b>4</b> <i>in phrases:</i> <b>-- ug apdu</b> brave (lit. having thick bile). <b>-- ug dugù</b> <b>a</b> having guts. <b>b</b> heartless, merciless. <b>-- ug hambug</b> laying bragging on thick. <i>v</i> [A B; c1] for liquids to become thick, cause them to do so. <i>Muispísu (maispísu) an sabaw ug butangag úbi,</i> The soup will become thick if you add yams.
- Here, a meaning number is directly followed by a phrase in bold, and later on, a meaning is directly followed by a meaning counting letter. Please always format such meaning numbers separately from all other parts of the text.
- The part [noun] is part of a (sub-entry) head word. Keep it within a set of bold-markers, even though the word noun is in italics.
No need to read beyond this point if you are just helping in the formatting rounds.
Post-Processing
This information is provided for those who are curious, and to allow for collaboration between those involved in post-processing.
Deliverables.
From the dictionary data, I plan to produce the following
- A monolithic plain text and HTML file, as per Project Gutenberg standards, about 5-6 megabytes of text.
- A printable PDF file, formatted for A4 print-out in two columns.
- A TEI tagged SGML/XML master file.
- An XDXF formatted dictionary file, so it can be used on various applications and gadgets using that format.
- A searchable database to be published on www.bohol.ph and www.gutenberg.ph, to provide students and speakers with an interactive learning and reference resource, similar to Kaufmann's Visayan-English Dictionary currently available.
Steps
Note that I start post-processing after completion of round F1. The reason for this is that my intensive post-processing steps will catch most, if not all remaining formatting issues, often by automated tools, and that the interest for this type of work is rather low, so allowing this work to complete F2 will take considerable time.
Uptag1
Perl script to add semantic tagging to various elements of the dictionary, based on the typographic tagging added during formatting.
This script is run several times, fixing the various issues found in the PGDP output, until the output is reasonably clean.
This script has the following assumptions (with some refinements):
- single number or number followed by letter formatted bold: <number>.
- single a, n, v formatted italic: <pos>.
- bold formatted words: <form>.
- small-caps formatted items: <xref>.
Things that will need manual intervention here:
- sense-numbers not marked bold (common).
- sense-numbers marked bold, but as part of a following or preceding headword.
- pos-codes not marked italic (common).
- pos-codes marked italic as part of a preceding italic word.
- sense-numbers and pos-codes that are supposed to be part of a cross reference (common).
- head-words that are actually used as part of a sense (occasionally, but wrecks havoc on the tagging algorithm).
- italic words that are not examples (common).
Uptag2
XSLT style-sheets to add high-level structure to the entries, based on the semantic tagged elements of the previous round.
This style-sheet derives the dictionary structure in the following way.
- groups elements starting with <form> into sub-entries.
- within each sub-entry: groups elements starting with <pos> into homonyms.
- within each homonym: groups elements starting with <number> into senses.
- within each sense: groups elements starting with <i> into examples.
During this process, in some cases, the various formatting may not be relevant for the structure. In those cases, the formatting element-name is temporarily changed to the element-name followed by an x, e.g., <i> tags that do not start an example sentence are changed to <ix>.
Tools.
For post-processing and further uses of the dictionary data, the following will be needed.
0. Syllabification
Syllabify("bábuy") -> { "bá", "buy" }
A tool is need to split Cebuano words in syllables. Since the writing used in this dictionary is highly regular, this should not be too cumbersome, and can be based on research done by Jesse S. Banks in his Lingua-Phonology package. All what is needed is a definition of the a the sonority scale, and valid onsets and codas for Cebuano.
Based on this tool, other tasks become easier, for example, shifting stress, finding default stress, etc.
See also: Perl for Linguists.
1. A tool to expand patterns (as given in the dictionary) to full word forms, that is, a function:
ApplyPattern("bisayà", "-in-") -> "binisayà"
2. A tool to expand the verbal forms (as indicated in the dictionary with codes such as [A123S; c2]) to a full page listing all these forms in full.
ApplyVerbConjugation("isturya", "c3") -> "table with all forms"
All such expanded forms need to be included in the search interface, such that people searching for a word-form will find the corresponding entry in the dictionary.
Note that this algorithm should also take into account the rules for vowel-dropping, metathesis, and other morphophonemic alternations, as detailed in section 5.0 of the introduction.
3. Develop a web-based search interface.
I have setup an interface to Kauffman's Hiligaynon dictionary: http://www.bohol.ph/kved.php. We will need something similar, taking care that variant spellings still locate the proper word, that is, somebody typing baboy should still find bábuy.
Search requirements:
- Fuzzy matching. The search interface should allow fuzzy matching of Cebuano words, covering the most common spelling variations, that is:
- ignore case
- ignore accents
- ignore distinction between {u, o} and {e, i}
- ignore presence or absence of dash and apostrophe
- ignore presence or absence of i before y.
- ignore operation of common variants (such as dropping the l)
- Derived forms. The search interface should allow the location of a root word when searching for an inflected form.
- All words should be searchable. There is no stop-word list, and searches for ang or the will give results (a lot of results).
- Multiple search terms
- It should be possible to search for multiple words at once.
- When a search for multiple words is made, only entries matching all search words will be returned.
- Restrict search. The search interface should allow restriction to the search fields
- head words only (Cebuano)
- definitions only (English)
- head words and examples (Cebuano)
- definitions and translations of examples (English)
- All text.
- When a restricted search leads to no results, the interface should automatically select a less restrictive search mode. The user should be notified of this.
- Result ordering
- Results should be ordered alphabetically by head word.
- Matches in the head word or definitions should sort before matches in the examples.
- Match coloring
- Matching words should be high-lighted with a contrastive color in the search results.
- Exact matches should be colored brighter than fuzzy matches.
- Different colors should be used to high-light different search words matched.
- Automatic link-following
- Cross references in entries should be returned as an active hyperlink to an appropriate search action.
- Entries consisting of only a bare cross reference should be followed automatically, and both the source entry and the entry referenced to should be returned. It should be clear to the user that this has happened.
- Binomial names should link to the relevant entry on species.wikimedia.org (for example, Tarsius syrichta).
4. Derive 'standard' orthography forms. Although Cebuano has no standardized orthography as for example English and French, many speakers will not accept Prof. Wolff's spelling as correct. We need to apply rules (such as given by Yap in his grammar, and used in Cabonce's dictionary) to distinguish u and o, and apply some further changes. Such forms need to be verified against a corpus, as discussed below.
ApplyOrthography("haluk") -> "halok"
5. Develop a spell checker. From this normalized list, we can develop a spelling checker, for use with open office and other products. This includes finding ways to compactly represent affixed forms.
6. Develop hyphenation patterns (similar as used with the TeX system and Open Office), such that words can be broken if required to fit lines on a page. I have already made a small start with this.
7. Collect a corpus of Cebuano texts, such as newspapers, etc., to be used for expanding the dictionary material and bringing it up to date with current usage. A good starting point would be investigating http://borel.slu.edu/crubadan/
Any new material should be clearly distinguishable from the original Wolff dictionary, and presented as such.
Database Structure
The database structure for Wolff's dictionary will be fairly simple. Three main tables will be used:
CREATE TABLE ceb_entry ( entryid int unsigned NOT NULL, entry text NOT NULL );
which contains all the entries (as well-formed XML fragments, with the <entry> tag as root element).
CREATE TABLE ceb_wordentry ( wordid int unsigned NOT NULL, entryid int unsigned NOT NULL, );
linking the words and entries together, and
CREATE TABLE ceb_word ( wordid int unsigned NOT NULL, word varchar(32) NOT NULL, normalized varchar(32) NOT NULL, type tinyint unsigned NOT NULL default '0' );
Which serves as an index into the entries.
word is the word as it appears in the dictionary,
normalized is the word normalized to a simplified spelling, for Cebuano words that means all accents and hyphens removed. When queries are made, the query words are also simplified, and matched against this field. Exact matches are an option in the advanced search interface.
The type indicates how the word is used in the entry, with the following values.
- Cebuano headword
- Cebuano used otherwise
- English equivalents
- English word used otherwise
- Cebuano word derived from pattern (e.g., if the headword for a sub-entry is paN- and the main headword in abla, pangabla will be added as this type.)
- Cebuano word derived from verb codes (including those derived from patterns).
- l-dropped variant. (e.g., balay becoming báy)
I consider using a bitfield for this field, such that the following coding can be used:
- 1 Cebuano headword
- 2 Cebuano used otherwise
- 4 English equivalents
- 8 English word used otherwise
- 16 Cebuano word derived from pattern (e.g., if the headword for a sub-entry is paN- and the main headword in abla, pangabla will be added as this type.)
- 32 Cebuano word derived from verb codes (including those derived from patterns).
- 64 l-dropped variant. (e.g., balay becoming báy)
And queries can be implemented for combinations of any type. (And words can have more than one flag set, for example, an l-dropped derived form.
TEI Tagging Conventions.
TEI tags used in the dictionary will derive from the TEI P4 Guidelines. Tags will be selected from the TEI lite subset when possible, and the additions for printed dictionaries when needed.
The following principles will be followed:
- Tagging is extra. Information in tags and attributes will be used to make explicit and supplement the information in the dictionary. No information will be removed or moved into tags.
- Language will be tagged only when a language change is present. The top level element lang attribute will have the value en. The attribute lang will be set to en or ceb (or other appropriate languages) at the highest level possible. Biological scientific names will have the lang attribute set to la (for Latin).
- Default language for elements:
- entry: en
- form, q: ceb
Intermediate tagging
The `uptag` scripts used to add tagging, use existing (typographic) tags combined with some heuristics to determine the structure of entry. This will not always work, as sometimes, the typographic tags are not used in the expected way. To resolve this, "offending" tags are changed to the same tag followed by an 'x', for example `<i>` becomes `<ix>` if the italics do not represent a sample, and `<form>` becomes `<formx>` when the recognized form is not in a normal position of a head-word.
Later in the process, such tags will be changed back to the original tag.
Cross references
Cross references are encoded using `<xr>` for the entire cross-referencing phrase, and `<ref>` for the exact word. When there is no cross reference phrase, the `<xr>` is not required. Note that small-caps are used to formally indicate a cross reference, but many more cross references are implied by words in italics.
The formatting of cross-references to entries, pos, and meanings is highly inconsistent. I will attempt to normalize this, and link to the proper element directly.
TEI P4 versus TEI P5
The latest TEI P5 standard removes a number of tags proposed below. To accommodate TEI P5, the following changes can be made:
<eg>Ispísung Usminyista,<trans>Diehard follower of Osmeña.</trans> </eg>
Becomes:
<cit type="example"> Ispísung Usminyista, <cit type="trans">Diehard follower of Osmeña.</cit> </cit>
The transform can be applied with XSLT automatically.
References
A collection of references to projects that use TEI for the encoding of dictionaries.
- Ways of automatically tagging a dictionary, includes a discussion on Wolff's dictionary.
- Making Dictionaries.
- Corpus building for minority languages by Kevin P. Scannell.
- An interesting approach to building a corpus.
- Tools to convert a TEI dictionary to the DICT format.
- An interesting discussion on the ways to represent a dictionary in TEI (Powerpoint file).
- TEI tagged dictionaries, unfortunately just using a typographic representation.
- Another TEI tagged dictionary, using entryFree.
- Using the TEI Scheme in Compiling a Korean Dictionary by Beom-mo Kang
- Jaslo, A Japanese-Slovene Learners' Dictionary: Methods for Dictionary Enhancement by Tomaž Erjavec et al.
- The FreeDict project uses TEI as their core format, and has a collection of very interesting Perl scripts to deal with TEI dictionaries.
- Ang Dila Natong Bisaya, by Manuel Yap.
- A Cebuano grammar written in Cebuano.
Sample TEI Tagged Texts
Single Entry
A possible way of tagging the sample entry would be (not yet validated against TEI!):
<?xml version="1.0" encoding="utf-8"?> <entry> <form>ispísu</form> <hom> <gramGrp> <pos>a</pos> </gramGrp> <sense n="1"> <number>1</number> <trans>for liquids to be thick, of great density.</trans> <eg>Ispísu kaáyung sikwáti kay gidaghan níyag tablíya,<trans>The chocolate drink is thick because he puts lots of chocolate on it.</trans> </eg> </sense> <sense n="2"> <number>2</number> <trans>for colors to be intense as if thickly laid on.</trans> <eg>Ispísu kaáyu ang kaitum sa balhíbu sa ákung iring,<trans>My cat's hair is a deep black.</trans> </eg> </sense> <sense n="3"> <number>3</number> <form>-- nga [noun]</form> <trans>a diehard, fanatic follower or believer of.</trans> <eg>Ispísu giyud nang Katuliku,<trans>He is a devout Catholic.</trans> </eg> <eg>Ispísung Usminyista,<trans>Diehard follower of Osmeña.</trans> </eg> </sense> <sense n="4"> <number>4</number> <note>in phrases:</note> <sense> <form>-- ug apdu</form> <trans>brave (lit. having thick bile).</trans> </sense> <sense> <form>-- ug dugù</form> <sense n="a"> <number>a</number> <trans>having guts.</trans> </sense> <sense n="b"> <number>b</number> <trans>heartless, merciless.</trans> </sense> </sense> <sense> <form>-- ug hambug</form> <trans>laying bragging on thick.</trans> </sense> </sense> </hom> <hom> <gramGrp> <pos>v</pos> <itype>[AB; c1]</itype> </gramGrp> <trans>for liquids to become thick, cause them to do so.</trans> <eg>Muispísu (maispísu) an sabaw ug butangag úbi,<trans>The soup will become thick if you add yams.</trans> </eg> </hom> </entry>
Double Entry
Another sample tagged according to TEI:
<superEntry> <entry> <form>dábuk</form> <number>1</number> <hom> <gramGrp> <pos>v</pos> <itype>[A; a]</itype> </gramGrp> <sense> <number>1</number>
<trans>make a fire.</trans>
<eg>
Pagdábuk dihà kay magdigámu ta,
<trans>Make a fire because we're going to fix dinner.</trans>
</eg>
</sense>
<sense>
<number>2</number>
<trans>fumigate an area.</trans>
<eg>
Dabúkan ta ang mangga arun mudaghan ang búnga,
<trans>Let's subject the mango tree to smoke so that there will be lots of fruit.</trans>
</eg>
</sense>
</hom>
<hom>
<form>(->)</form>
<gramGrp>
<pos>n</pos>
</gramGrp>
<sense>
<number>1</number>
<trans>fire in an open place.</trans>
</sense>
<sense>
<number>2</number>
<trans>place where an open fire is built.</trans>
<eg>
Duul ra sa balay ang dabuk (dabukan),
<trans>They built the fire too close to the house.</trans>
</eg>
</sense>
</hom>
<hom>
<form>-an(->)</form>
<xr>= <ref>dabuk, 2.</ref></xr>
</hom>
</entry>
<entry>
<form>dábuk</form>
<number>2</number>
<hom>
<gramGrp>
<pos>v</pos>
<itype>[A; a]</itype>
</gramGrp>
<trans>crush by pounding.</trans>
<eg>
Dabúka ang mani pára sa kaykay,
<trans>Pound the peanuts for the cookies.</trans>
</eg>
</hom>
<hom>
<form>(->)</form>
<gramGrp>
<pos>n</pos>
</gramGrp>
<trans>crushed to fine bits, crumbled.</trans>
<eg>
Dabuk sa pán,
<trans>Bread crumbs.</trans>
</eg>
</hom>
</entry>
</superEntry>
Cross References
Cross references are treated as additional senses. They can appear as part of a larger entry, or stand-alone, as in this example.
<entry> <form>nan</form> <number>1</number> <hom> <sense> <number>1</number> <xr>= <ref>DAN</ref></xr>. </sense> <sense> <number>2</number> <trans>in narrations, particle preceding a statement that is off the subject but important for the course of the story. <eq>Nan, kadtu si Antunyu, palainan ta lang, ákù tung ámu,<trans>Now, this Antonio, to change the subject, was my employer.</trans> </eq> </sense> </hom> </entry>
The stand-alone cross reference.
<entry> <form>nan</form> <number>2</number> <hom> <sense> <xr>= <ref>UG</ref>, 1 (dialectal).</xr> </sense> </hom> </entry>
Phrasal Sub-entries
Sometimes, phrases are entered as sub-entries.
Issues in the dictionary structure
In a few cases, the structure of dictionary entries has some complexities.
- cross references sometimes point to an entry, sometimes to an entry in a certain role, and sometimes to a specific meaning. This is normally not an issue, except that correctly matching the typographic layout with the intended meaning is sometimes complex, and the typography is inconsistent. Note that cross references can point to sub-entries as well as other entries.
- sub-entries under a sense. Sometimes sub-entries (in bold) are part of a sense, whereas the regular location would be at the end of the entry. This typically happens with short phrases entered as entries. They are sometimes numbered, sometimes not.
- position of verb conjugation patterns. These sometimes appear before the sense numbers, and sometimes after. I take this to mean that in the first case the scope is all senses, and in the second case only the indicated sense (overriding any higher level pattern).
- numbering of senses. The numbering of senses is not always strictly 1, 2, 3, ..., but the exact semantics of the numbering system are not always clear. For verbs, they probably help to identify the scope of a verb conjugation pattern.
- additional parts of speech. Besides nouns, verbs, and adverbs/adjectives, the dictionary contains entries for particles and affixes and phrases. These are either not identified as such, or in prose.
Verb Conjugation Codes
Wolff's dictionary uses short-hand codes to indicate how a verb can be used. Those codes take some time to get used to. For a computer based dictionary, we do not have the space-constrains of a printed dictionary, so can present this information in a more comprehensive, template-based style.
The meaning of these codes is explained in the preface, section 7.1 and further.
Future | Past | Subjunctive | |
---|---|---|---|
Active | |||
Punctual | mu- | mi-, ni-, ning-, ming- | mu- |
Durative | mag-, maga- | nag-, naga-, ga- | mag-, maga- |
Potential | maka-, ka- | naka-, ka- | maka-, ka- |
Direct Passive | |||
Punctual | -un | gi- | -a |
Durative | paga-un* | gina-* | paga-a* |
Potential | ma- | na- | ma- |
Local Passive | |||
Punctual | -an | gi-an | -i |
Durative | paga-an* | gina-an* | paga-i* |
Potential | ma-an, ka-an | na-an | ma-i, ka-i |
Instrumental Passive | |||
Punctual | i- | gi- | i- |
Durative | iga-* | gina-* | iga-* |
Potential | ma-, ika- | na-, gika- | ma-, ika- |
Verb conjugation codes can have two parts. The letters have the following meanings
First part: active verbs
- A - Action Verbs
- 1 without punctual
- 2 without durative
- 3 without potential
- S stress shift in indicated class
- 1S stress shift in punctual
- P pa- can be added without change of meaning in indicated class
- 3P maka = makapa-
- N paN- can be added to root, mu- and maka- forms.
- B - Stative Verbs
- 1 without mu-
- 2 without mag-
- 3 maka- has meaning "become [so-and-so]"
- 3(1) maka- has meaning "become [so-and-so]" and "cause to become [so-and-so]"
- 4 without na-
- 5 without naka-
- 6 without magka-
- S, N as with A.
- C - Mutual Verbs
- 1 without mag-
- 2 without magka-
- 3 without makig-
Second part: passive verbs
- a - verbs with direct passive affixes (focus is recipient of action)
- 1 - without local passive
- 2 - without instrumental passive
- 3 - only potential passive
- 4 - focus is suffering from or affected by thing referred to
- b - verbs with local passive affix (focus is recipient of action) and instrumental passive affixes
- (1) - without instrumental passive affixes (except -i)
- 1 - focus is place of action or recipient of action
- 2 - focus is place of action; hi-an(->), hi-i also refers to accidental recipient of action.
- 3 - reason of action
- 3(1) - as 3, but only with potential affixes
- 4 - focus is thing affected
- 4(1) - as 4, but only with potential affixes
- 5 - local and direct passive are synonymous
- 6 - only local passive and instrumental -i (focus is place or beneficiary of action)
- 7 - focus is diminished or added to
- 8 - only potential local passives
- c - verbs with instrumental passive (focus is thing conveyed or recipient)
- 1 - direct and instrumental passive are synonymous
- 2 - local and instrumental passive are synonymous (focus is recipient of action)
- 3 - only with potential affixes -ika, -gika
- 4 - optionally take ig-
- 5 - focus is reason for agent to become in certain state
- 6 - without local passive affixes
Encarnacion's Diccionario Español-Bisaya
Juan Felis de la Encarnacion's Diccionario Español-Bisaya first appeared in 1866, and went through several reprints. Although mostly of historical interest, we are also processing this dictionary through this site.
To translate the Spanish based orthography to the style used by John U. Wolff, you'll need to apply the following replacements:
Encarnacion | Wolff | Note |
---|---|---|
ao | aw | |
ai | ay | |
c | k | |
e | i | |
gui | gi | |
j | h | |
ñ | ny | |
n[~g] | ng | depends on context |
ng | ngg | depends on context, not before consonants. |
o | u | |
oa | wa | depends on context, typically not when one of the pair is accented. |
ua | wa | depends on context, typically not when one of the pair is accented. |
qu | k |
Note that the accents are used differently in this dictionary.
The glottal stop is not indicated by Encarnacion.
Encarnacion is highly inconsistent in his use of o versus u, often using different spellings for the same word in a single entry. To remedy this, I propose to do the following during PP:
1. Follow the Spanish orthography for words derived from Spanish. 2. Follow the rules as given in Cabonce's dictionary for other words.