The Corpus of Early Ontario English (CONTE)
Size and sampling
CONTE pre-Confederation section (CONTE-pC)
The corpus is conceived as spanning ther period from the earliest Ontarian English texts to the end of the 19th century (ca. 225,000 words). In its pre-Confederation section (pC), i.e. 1776 - 1850, CONTE-pC comprises about 125,000 words in three genres.
The genres are newspaper texts, diariy entries, and letters. The corpus is designed so that findings may be compared to other language corpora. For instancegenres, sampling sizes, and periodization are compatible with those found in ARCHER - A Representative Corpus of Historical English Registers. Since the CONTE features periods of 25 years, two CONTE periods taken together can be compared to one period in ARCHER. This design should also faciliate comparisons with other historical varieties of English.
CONTE is divided along temporal and social criteria. Chronologically, the corpus
is split into five periods, each spanning 25 years from 1776 to the end of 1899,
with the exception of period 1, which lasts only 24 years- Its design includes
three genres: diaries, (semi-)official letters, and local newspapers. Although
all three genres are non-speech-based genres, they may be classified according
to their level of formality to allow statements about written, as well as hypotheses
about spoken, Ontario English.
Following the classification by Kytö and Rissanen (1983), diaries belong to the informal register and are thus closer to spoken language. Official letters, in contrast, are formal pieces of writing. Some of the letters in CONTE are written by people with little schooling, where some influence from the informal register is to be expected. For this reason, these letters are termed semi-official letters, as opposed to the official letters by the more proficient writers. On the basis of this observation we may assume semi-official letters to be closer to spoken language than the official ones (cf. Tieken, 1985 for the principles of this approach). Newspaper texts belong to the formal register. This genre is comprised of local Ontario newspaper text and is the only genre in CONTE that is entirely made up of printed texts. As a consequence, some regularization on behalf of the printers may be expected.
Along social lines, CONTE is divided into two social classes, with the exception of the newspaper genre. Therefore, the diary and letter sections are split into middle and lower class writers (in the absence of an upper class in Early Ontario). Where possible, this distinction was based on external information, where this was not possible, as was the case with some letter writers, the absence of an author's name in the Dictionary of Canadian Biography, combined with unskilled handwriting and appropriate letter content was taken as an indicator of lower class membership.
Size and sampling
Texts from specific genres were selected differently from the general 'universe' of potential Ontario English texts: quasi-random selection was only possible in the case of letters, as these came on microfilm, allowing me to select every fifth or seventh letter, to reach the targeted number.
For diaries, the scarcity of verbatim editions and manuscripts ruled out any statistical method of discrimination. The procedure applied for the selection of texts took the holdings of the Archives of Ontario and the University of Toronto Libraries as a starting point.7 Anne Powell's travel diary of 1789 and the beginning of Ely Playter's diary from 1799 serve as evidence for the first period from 1776 to 1799. In this genre, what was found and proved to be reliable data is included.
With newspapers, however, we are luckily in a better position, since data are readily available for all periods except the first. Again, the holdings of the Archives of Ontario and the University of Toronto Libraries served as a starting point. Generally, newspapers from smaller villages are preferred over those from bigger ones. Therefore, we find the Wingham Times and not the Toronto Star in the period from 1875 to 1899. The preference for small local newspapers, as opposed to large national ones, arises from the presumably higher amount of linguistic variation that they offer.
All in all, the corpus comprises some 225,000 words over three genres, approximately 10,000 to 20,000 per genre and period. At least two texts are included for each genre and period. The goal to include chunks between 5,000 and 10,000 words is not always met, but it was ensured that for diaries and newspapers one chunk of at least 2,000 words is included, which should provide a minimum to carry out syntactic studies. For letters, the sample sizes depended on the length of the letters, as they were transcribed in full. See the appendix for a complete list of the texts included in CONTE.
More information may be found in Dollinger (2005).
CONTE pre-Confederation section
CONTE-pC is a subsection of CONTE, consisting of periods, 1, 2, and 3, i.e. 1776-1799, 1800-1824 and 1825-1849 in the corpus's three genres. As such, it comprises 125,000 words and may serve as a database for pre-Confederation CanE (prior to 1867). CONTE-pC was finished for internal work in July 2004 and is the empirical basis for my PhD thesis (Dollinger 2006). CONTE-pC will be made available to the research community once Dthe necessary documentation is completed.
Dollinger, Stefan. 2006. New-dialect formation in early Canada: the modal auxiliaries in Ontario, 1776-1850. PhD thesis, University of Vienna.
Dollinger, Stefan. 2005. forthc. (2005) "Oh Canada! Towards the Corpus of Early Ontario English" - in: Renouf, Antoinette and Andrew Kehoe (eds.). The changing face of corpus linguistics. Proceedings of the 24th ICAME Conference, Guernsey, UK, April 2003. Amsterdam: Rodopi.
Kytö, M. and M. Rissanen (1983), 'The syntactic study of Early American English. The variationist at the mercy of his corpus?', Neuphilologische Mitteilungen, 84: 470-490.
Tieken-Boon van Ostade, I. (1985), 'Do-support in the writings of Lady Mary Wortley Montagu: a change in progress', Folia Linguistica Historica, 6/1: 127-151.