Search:    
 

Available Corpora

Written corpora (synchronic)

corpus name size
(# of words)
lemmatisation morphological
tags
publication
date
short description
SYN1 300 mil.YESYES2010non-referenceNápověda unification of all the SYN-series synchronic written corpora
SYNSYN2010100 mil.YESYES2010balanced corpus, most of the texts are from 2005 - 2009
SYNSYN2009PUB700 mil.YESYES2010 corpus of newspapers and magazines from 1995 - 2007
SYNSYN2006PUB 300 mil. YES YES 2006 corpus of newspapers and magazines from 1989 - 2004
SYNSYN2005 100 mil. YES YES 2005balanced corpus, most of the texts are from 2000 - 2004
SYNSYN2000 100 mil. YES YES 2000balanced corpus, most of the texts are from 1990 - 1999
FSC2000 100 mil. YES NO 2004modified SYN2000, source of the Frequency Dictionary of Czech
LINK 1.8 mil.
YES YES 2010non-referenceNápověda corpus of linguistic texts
KSK-DOPISY 800 000 NO NO 2006transcriptions of handwritten correspondence from 1990 - 2004
ORWELL 80 000 YES YES 2003Orwell's "1984", manually annotated

Spoken corpora (synchronic)

corpus name size
(# of words)
lemmatisation morphological
tags
publication
date
short description
ORAL2008 1 mil NO NO 2008sociolinguistically balanced corpus of informal spoken Czech
ORAL2006 1 mil. NO NO 2006corpus of informal spoken Czech
SCHOLA2010790 000NONO2010corpus of school lessons
PMK 675 000 NO NO 2001Prague spoken corpus
BMK 490 000 NO NO 2002Brno spoken corpus

Diachronic corpora

corpus name size
(# of words)
lemmatisation morphological
tags
publication
date
short description
DIAKORP  1.95 mil. NO NO 2005non-referenceNápověda corpus of the diachronic section of the CNC
DOTKO12 mil.NONO2010non-referenceNápověda corpus of Lower Sorbian, most of the texts are from 1848 - 1933

Parallel corpus

corpus name size
(# of words)
lemmatisation morphological
tags
publication
date
short description
InterCorp 92 mil. YES
(partial)
YES
(partial)
2008non-referenceNápověda parallel corpus being compiled as a part of the InterCorp project