LINK (originally LIngvistův Narozeninový Korpus, i.e. Linguist’s Birthday Corpus, created on the occasion of Professor František Čermák’s birthday) is a corpus comprising exclusively linguistic texts. It is thus designed especially for the research of academic language specifics (study of terminology, the language of linguistics etc.).
The corpus contains at present approximately 1.8 million tokens (without punctuation). The corpus is lemmatized and morphologically tagged in the same way as the corpora of the SYN series (lemmatization and tagging are more or less of the same level as the SYN2009PUB corpus). The LINK corpus consists of 256 linguistics texts from the period 1985 - 2010, vast majority of which comes from the turn of the millennium. The corpus includes both major linguistic studies (monographs, proceedings) and articles in professional periodicals and journals (esp. Slovo a slovesnost, Naše řeč).
To make the work with individual text subtypes easier, the corpus is provided with the structural tag disciplina (in addition to the standard set of structural tags: txtype, genre, med etc.), which divides the individual works into categories according to traditional linguistic disciplines:
- codification (and language cultivation)
- cognitive (linguistics)
- contrastive (linguistics, description)
- corpus (and corpus linguistics)
- general (linguistics)