The main phase of the Waima’a documentation project ran from 2002-2006. The team adopted a community-based approach to the documentation. The basic documentation work (recording, first transcript, and translation into Tetum and Malay) was done by native speakers, in particular Maurício Belo. The focus here was on providing for a wide variety of genres of speech and associated cultural activity. The researchers from Germany and Australia prepared the documents for archiving and added information on setting, culture and linguistic structures. See Bowden & Hajek (2006) for a more detailed overview of the project work.

The multilingual nature of Timorese society and the demands of best international documentary practice mean that annotations for part of the corpus are provided in four languages: Portuguese and Tetum which are the two official languages of East Timor, as well as English and Malay, as languages of regional and international communication. Communication in the project was carried out mostly in Tetum and Malay.

We highly welcome further additions to the documentation of any kind: new recordings or fotos, improved and expanded annotations, analytical papers. An example for the latter kind is the short reflection on Waima’a culture by Father Cancio, submitted to us in 2009.

The project work also involved intensive interaction and cooperation with the East Timorese Instituto Nacional de Linguística (INL). Being the first major internationally funded research project of its kind, the project functioned as a kind of guinea pig for both the establishment of a research policy in the new nation and for testing and developing policies for the minority languages such as Waima’a, which the constitution also recognizes as national languages.

Bowden, John and John Hajek (2006) When best practice isn’t necessarily the best thing to do: dealing with capacity limits in a developing country. In Linda Barwick and Nicholas Thieberger (eds) Sustainable data from digital fieldwork. Sydney: Sydney University Press. (Available from Sydney eScholarship repository.)


This section briefly describes the corpus as of 30 November 2006 (work on the corpus continues but this is when the main work of the corpus finished).

1. Quantitative corpus description

Tapes used 98 78 video, 14 DAT, 6 Audio Cassettes
Nr. of recorded sessions 454 shortest session: < 10 sec (shot of a speaker)

longest session: 05:02:33 (part of a multi-day political meeting)

Nr. of annotation files (toolbox) 78 = close to 12 hours of recordings 42 including Portuguese glossing
lexical database ca. 4000 entries of variable size: all provide glosses at least in Tetum, Malay, and English; few have lengthy meaning explications plus exemplification
introductory materials ca. 200 pages 1) brief introduction to corpus

2) phonology and orthography

3) setting sketch with notes on place names and kinship and slide show

4) grammar sketch

literacy materials 3 booklets


other photographs ca. 185 mostly flora and fauna, some artifacts
lists of words and phrases used in eliciting data for phonetic analysis see sessions on sounds and intonation
laryngographs see Hajek & Stevens 2005, Stevens & Hajek 2004
documents and maps used in village meeting see sessions sukufoun
archeological report by Nuno Oliveira report on cave excavation in 2005 in Portuguese, English, Tetum, and Waima’a


2. Qualitative corpus description

In its current form, the corpus is characterized by the following features:

a)     it contains a highly diverse set of naturally occurring communicative events, including rain invocation chants and mourning songs, drunken speech, taking cock fighting bets, everyday chatting while peeling peanuts or binding flowers, political discussion and, of course, folktales and personal and historical narratives. In addition, it contains a fair number speech events elicited with the help of prompts such as the pear film, the frog story, the space games and other prompting material developed by the MPI Nijmegen. While certainly not yet ‘complete’ in the sense of including all types of speech events found in the community, this corpus goes well beyond of what is normally available for a small previously undocumented speech community.

b)    while consisting mainly of Waima’a recordings, it also contains a number of segments in the local variant of Tetum, which is quite regularly used in, for example, political discussions.

c)     to a significant degree, corpus contents have been determined by the East Timorese team members and the Waima’a community. Note that less than 5% of the total corpus was recorded by the professional linguists and that in more than 70% of the recordings none of them was present. That is, with regard to recording, the corpus is primarily an achievement of the East Timorese team members, specifically M. Belo who did almost all recordings.

d)    the standard annotation aimed at includes a transcription in practical orthography, segmented in intonation units, including false starts etc., glossing and free translation in Tetum, Malay, and English. A sizable subset of the corpus additionally includes glossing and translation in Portuguese.

e)     the quality of the annotation varies considerably across the annotated sessions. On the one extreme, there are a few sessions which have only been worked on by one or two team members. These may involve major inconsistencies in intonation-based segmentation, glossing mistakes, and translations which are difficult to parse and lack in coherence and cohesion across unit boundaries. On the other extreme, there are a few sessions which have been worked on repeatedly by three or four different team members in an attempt to weed out most inconsistencies and to provide for a coherent translation accessible to readers not familiar with language, culture and setting. Most sessions are at an intermediate stage of processing, having been only partially or roughly checked once by a second or third team member. Each annotation file contains information on the processing stage in its first record.

f)     the processing of the ca. 4000 lexical entries in the toolbox lexicon file are at similarly divergent stages, some having been gone over repeatedly with two or more native speakers, others having been created on the spot in order to provide a quick gloss for a word encountered in a transcript with no attempt yet made to elucidate the full range of meanings.

In assessing the above features of the corpus, the following two facts need to be taken into account:

1) Next to nothing (i.e. less than 150 words) were known about Waima’a before the documentation project started. Hence the first half year of the project was mainly spent in analyzing basic phonology, developing a practical orthography on the basis of it, and getting the community’s consent with regard to this orthography. Furthermore, the East Timorese team members had to be trained in using the orthography, apart from learning how to do recordings and annotations.

2) Only East Timorese, none with any previous training in linguistics, worked full time or for substantial regular part-time periods on the documentation. The western team members supervised and organized the work of the East Timorese team members and did the basic analysis needed to make a documentation possible (most importantly: orthography development, as just mentioned), but had only limited time for actually working on the corpus, apart from drafting the introductory materials (i.e. there was no trained linguist working full time on this corpus for any extended period of time).