Available Corpora
Written corpora (synchronic) |
|||||
| corpus name | size (# of words) |
lemmatisation | morphological tags |
publication date |
short description |
| SYN | 1 300 mil. | YES | YES | 2010 | non-reference |
| 100 mil. | YES | YES | 2010 | balanced corpus, most of the texts are from 2005 - 2009 | |
| 700 mil. | YES | YES | 2010 | corpus of newspapers and magazines from 1995 - 2007 | |
| 300 mil. | YES | YES | 2006 | corpus of newspapers and magazines from 1989 - 2004 | |
| 100 mil. | YES | YES | 2005 | balanced corpus, most of the texts are from 2000 - 2004 | |
| 100 mil. | YES | YES | 2000 | balanced corpus, most of the texts are from 1990 - 1999 | |
| FSC2000 | 100 mil. | YES | NO | 2004 | modified SYN2000, source of the Frequency Dictionary of Czech |
| CZESL-PLAIN | 2 mil. |
NO | NO | 2012 |
non-reference |
| LINK | 1.8 mil. |
YES | YES | 2010 | non-reference |
| KSK-DOPISY | 800 000 | NO | NO | 2006 | transcriptions of handwritten correspondence from 1990 - 2004 |
| ORWELL | 80 000 | YES | YES | 2003 | Orwell's "1984", manually annotated |
Spoken corpora (synchronic) |
|||||
| corpus name | size (# of words) |
lemmatisation | morphological tags |
publication date |
short description |
| ORAL2008 | 1 mil. | NO | NO | 2008 | sociolinguistically balanced corpus of informal spoken Czech |
| ORAL2006 | 1 mil. | NO | NO | 2006 | corpus of informal spoken Czech |
| SCHOLA2010 | 790 000 | NO | NO | 2010 | corpus of school lessons |
| PMK | 675 000 | NO | NO | 2001 | Prague spoken corpus |
| BMK | 490 000 | NO | NO | 2002 | Brno spoken corpus |
Diachronic corpora |
|||||
| corpus name | size (# of words) |
lemmatisation | morphological tags |
publication date |
short description |
| DIAKORP | 1.95 mil. | NO | NO | 2005 | non-reference |
Foreign language corpora |
|||||
| corpus name | size (# of words) |
lemmatisation | morphological tags |
publication date |
short description |
| DOTKO | 12 mil. | NO | NO | 2010 | non-reference |
| HOTKO | 36 mil. |
NO |
NO |
2013 |
non-reference |
| deWaC | 1 350 mil. | YES | YES | 2013 | web corpus of German |
| frWaC | 1 350 mil. | YES | YES | 2013 | web corpus of French |
| itWaC | 1 600 mil. | YES | YES | 2013 | web corpus of Italian |
| ukWaC | 1 900 mil. | YES | YES | 2013 | web corpus of British English |
Parallel corpus |
|||||
| corpus name | size (# of words) |
lemmatisation | morphological tags |
publication date |
short description |
| InterCorp | 92 mil. | YES (partial) |
YES (partial) |
2008 | non-reference |


