Here is a list of all datasets currently available for German.
Datasets for all types of Inflection Analyzers, Generators and Analyzer/Generators:
- Evaluation license: all nouns, verbs, adjectives and adverbs starting with the letters 'a' and 's', plus all other types of entries (for a total of about 40'000 entries).
- Full license: all currently available German words (currently almost 300'000).
Datasets for the German Wordformation Analyzer/Generator:
- Evaluation license: all derivation level entries concerning all nouns, verbs, adjectives and adverbs starting with the letters 'a' and 's', plus all other types of entries (a total of about 30'000 relations).
- Full license: all derivation level entries for all entries (currently 350'000 relations).
Datasets for the Unknown Word Products:
- Evaluation license: all German entries and ad hoc analyzed combinations beginning with the characters 'a' and 's' (where the single words used for the combination begin with 'a' or 's').
- Full license: all relevant word formation rules and a base of all currently available lexicalized German words (currently more than 210'000).
Here are some German-specific features that need to be considered by your client application, in order to make the best use of our data analyzers.
Each kind of analyzer tolerates input elements that do not use special characters (e.g. the German word "mögen" written as "moegen"), tracing this information with the special "Flach" feature in the delivered output.
query -> moegen result -> mögen (Cat V)(Flach ouml), (Cat N)(Flach ouml)
The German-specific attribute "Flach" is used to tag forms which - according to the dictionary - are non-existent. These forms are nevertheless recognized because they correspond to valid forms which result when data is entered without language-specific keyboards. For example Kaese is the "Flach"-attributed version of the German word Käse. These are non-existent forms, nevertheless recognized by the Lemmatizer, in order to tolerate input entered without a language-specific keyboard.
|Flach||auml||Same meaning as HTML entities|
Lexemes and wordforms affected by the German spelling reform in 1998 are marked with special features. These features allow filtering - on word level - of old and new spelling variants.
Features with feature attributes:
indicate the new or changed spelling rule that causes the new spelling variants.
Features with the attribute "Ortho" indicate the type of spelling variant:
|Ortho||New||new, no preference||aufwändig|
|Old||old, no preference||aufwendig|
|Old-Obs||old, obsolete||Tip, Telephon|
|CH||Swiss||Fuss (German standard: Fuß)|
|NZZ||Neue Zürcher Zeitung||Crème|
All New* attributes indicate that the variants have been introduced by the spelling reform. These variants are incorrect according to the old spelling.
Old* means the variants existed before the reform. All Old* variants except for the variants marked Old-Obs are still correct according to the new spelling rules.