Search:    
 

Available Corpora

Written corpora (synchronic)

corpus name size
(# of words)
lemmatisation morphological
tags
publication
date
short description
SYN 1 300 mil. YES YES 2010 non-referenceNápověda unification of all the SYN-series synchronic written corpora
SYNSYN2010 100 mil. YES YES 2010 balanced corpus, most of the texts are from 2005 - 2009
SYNSYN2009PUB 700 mil. YES YES 2010 corpus of newspapers and magazines from 1995 - 2007
SYNSYN2006PUB 300 mil. YES YES 2006 corpus of newspapers and magazines from 1989 - 2004
SYNSYN2005 100 mil. YES YES 2005 balanced corpus, most of the texts are from 2000 - 2004
SYNSYN2000 100 mil. YES YES 2000 balanced corpus, most of the texts are from 1990 - 1999
FSC2000 100 mil. YES NO 2004 modified SYN2000, source of the Frequency Dictionary of Czech
CZESL-PLAIN 2 mil.
NO NO 2012
non-referenceNápověda learner corpus of non-native Czech speakers
LINK 1.8 mil.
YES YES 2010 non-referenceNápověda corpus of linguistic texts
KSK-DOPISY 800 000 NO NO 2006 transcriptions of handwritten correspondence from 1990 - 2004
ORWELL 80 000 YES YES 2003 Orwell's "1984", manually annotated

Spoken corpora (synchronic)

corpus name size
(# of words)
lemmatisation morphological
tags
publication
date
short description
ORAL2008 1 mil. NO NO 2008 sociolinguistically balanced corpus of informal spoken Czech
ORAL2006 1 mil. NO NO 2006 corpus of informal spoken Czech
SCHOLA2010 790 000 NO NO 2010 corpus of school lessons
PMK 675 000 NO NO 2001 Prague spoken corpus
BMK 490 000 NO NO 2002 Brno spoken corpus

Diachronic corpora

corpus name size
(# of words)
lemmatisation morphological
tags
publication
date
short description
DIAKORP  1.95 mil. NO NO 2005 non-referenceNápověda corpus of the diachronic section of the CNC

Foreign language corpora

corpus name size
(# of words)
lemmatisation morphological
tags
publication
date
short description
DOTKO 12 mil. NO NO 2010 non-referenceNápověda corpus of Lower Sorbian, most of the texts are from 1848 - 1933
HOTKO 36 mil.
NO
NO
2013
non-referenceNápověda corpus of Upper Sorbian
deWaC 1 350 mil. YES YES 2013 web corpus of German
frWaC 1 350 mil. YES YES 2013 web corpus of French
itWaC 1 600 mil. YES YES 2013 web corpus of Italian
ukWaC 1 900 mil. YES YES 2013 web corpus of British English

Parallel corpus

corpus name size
(# of words)
lemmatisation morphological
tags
publication
date
short description
InterCorp 92 mil. YES
(partial)
YES
(partial)
2008 non-referenceNápověda parallel corpus being compiled as a part of the InterCorp project