Introduction into the Diachronic Section of the CNC
The diachronic section of the CNC covers the texts of a total of seven centuries of the Czech language development. The first completed part (approximately 700 000 word forms) of the diachronic section of the Czech National Corpus (further only DCNC) was made accessible to the public in September 2005. Making the DCNC public continues at apace of about 250 000 word forms yearly.
The DCNC contains texts dating from the end of the 13th century upto the beginning of the synchronic section, that is until 1989 inclusive (for journalistic and specialized texts), or to 1944 inclusive (forfiction). The DCNC thus contains texts from approximately seven centuries of the development of Czech; the texts were originally written down or printed in different spelling systems (simple, digraphic and diacritical orthography) and their combinations. The heterogeneous character of the texts entering the DCNC necessarily demands somewhat different processing than is usual both in the editions of older written texts (their rules are usually considerably adapted to the specific language and orthographic characteristics of acertain period, or characteristics of one author or work), and in the synchronic corpora (their rules are oriented to the contemporary state of language and to some extent are based on the current linguistic awareness of the corpus users)
The main goal in processing texts for the diachronic corpus is to ensure - despite the above mentioned variety - a uniform, the simplest possible and most universal search of texts from the entireseven-hundred-year historical development of Czech while retaining maximum relevant linguistic information contained in these texts. Two rules are applied in the diachronic corpus to meet these goals:
- The texts are transcribed, not transliterated. This rule enables to search for occurrences of specific forms of words in the diachronic corpus, just like in the synchronic one.
- The texts are tagged. This enables obtaining various information about individual texts and their structure as wellas preserving substantial amount of linguistic information, which is normally lost when transcribing texts (for details see below).
In the future, the search options in the diachronic corpus will be considerably extended by lemmatisation using hyperlemmata, which will allow the user to search for all occurrences of a specific lexeme, without respect to the variety of its period and other forms (for instance, when using the hyperlemma k¨˛ in your search, it will also find the older Czech forms of kó˛ and kuo˛