Available Datasets

Here is a list of all datasets currently available for German.

Datasets for all types of Inflection Analyzers, Generators and Analyzer/Generators:

  • Evaluation license: all nouns, verbs, adjectives and adverbs starting with the letters 'a' and 's', plus all other types of entries (for a total of about 40'000 entries).
  • Full license: all currently available German words (currently almost 300'000).

Datasets for the German Wordformation Analyzer/Generator:

  • Evaluation license: all derivation level entries concerning all nouns, verbs, adjectives and adverbs starting with the letters 'a' and 's', plus all other types of entries (a total of about 30'000 relations).
  • Full license: all derivation level entries for all entries (currently 350'000 relations).

Datasets for the Unknown Word Products:

  • Evaluation license: all German entries and ad hoc analyzed combinations beginning with the characters 'a' and 's' (where the single words used for the combination begin with 'a' or 's').
  • Full license: all relevant word formation rules and a base of all currently available lexicalized German words (currently more than 210'000).

Language-Specific Features

Here are some German-specific features that need to be considered by your client application, in order to make the best use of our data analyzers.

Special Characters

Each kind of analyzer tolerates input elements that do not use special characters (e.g. the German word "mögen" written as "moegen"), tracing this information with the special "Flach" feature in the delivered output.

query   -> moegen 
result  -> mögen
             (Cat V)(Flach ouml),
             (Cat N)(Flach ouml)

The German-specific attribute "Flach" is used to tag forms which - according to the dictionary - are non-existent. These forms are nevertheless recognized because they correspond to valid forms which result when data is entered without language-specific keyboards. For example Kaese is the "Flach"-attributed version of the German word Käse. These are non-existent forms, nevertheless recognized by the Lemmatizer, in order to tolerate input entered without a language-specific keyboard.

Attribute Values Meaning
Flach auml Same meaning as HTML entities

Spelling Reform

Attribute Meaning
OCapRule "Spelling Rule"
ORule "Spelling Rule"
OSepRule "Spelling Rule"
Ortho "Spelling Variant"

Lexemes and wordforms affected by the German spelling reform in 1998 are marked with special features. These features allow filtering - on word level - of old and new spelling variants.

Features with feature attributes:

  • OCapRule
  • ORule
  • OSepRule

indicate the new or changed spelling rule that causes the new spelling variants.

Features with the attribute "Ortho" indicate the type of spelling variant:

Attribute Values Variant Example
Ortho New new, no preference aufwändig
Old old, no preference aufwendig
New-HV new, main essenziell
Old-NV old, secondary essentiell
Old-HV old, main Delphin
New-NV new, secondary Delfin
New-Only new, only Tipp
Old-Only old, only Telefon
Old-Obs old, obsolete Tip, Telephon
CH Swiss Fuss (German standard: Fuß)
NZZ Neue Zürcher Zeitung Crème

All New* attributes indicate that the variants have been introduced by the spelling reform. These variants are incorrect according to the old spelling.

Old* means the variants existed before the reform. All Old* variants except for the variants marked Old-Obs are still correct according to the new spelling rules.