Available Datasets

The following datasets are deliverable for all types of Analyzers, Generators and Analyzer/Generators currently available in English:

  • Evaluation license: all nouns, verbs, adjectives and adverbs starting with the letters 'a' to 'd', plus all other types of entries (a total of about 11'000 entries).
  • Full license: all currently available English words (currently more than 50'000), with contraction elements analysis.

The following datasets are available for the English Word Formation Analyzer/Generator:

  • Evaluation license: all derivation level entries concerning all nouns, verbs, adjectives and adverbs starting with the letters 'a' to 'd', plus all other types of entries (a total of about 7'000 relations).
  • Full license: all derivation level entries for all entries (currently 43'000 relations).

Language-Specific Features

Here are some English-specific features that need to be considered by your client application, in order to make the best use of our data analyzers.

Attribute Meaning
Variety English variety (regional varieties of lexical items): British Common English (BCE), British English, American English
SpellVar Spelling variants: British Common English (standard), exclusive American English spelling variant (AE), optional American English spelling variant (ae), optional British spelling variant (be)
Contraction Contractions of elements, usually clitics

British and American English

Our English analyzers are able to distinguish between different spelling variants. We adopted British Common English (BCE) as standard spelling type. Special features mark American and British spelling variants.

The features are:
  • (SpellVar BCE): British Common English spelling
  • (SpellVar AE): exclusive American spelling variant, used instead of BCE spelling. Example: BCE colour, AE color
  • (SpellVar ae): optional American spelling variant, used as well as BCE spelling. Example: BCE travelled, AE traveled
  • (SpellVar be): optional British spelling variant, used as well as BCE spelling. Example: BCE realise, be realize

With this information you can set a filter to analyze your text according to your specific criteria.

SpellVar-Features differ from Variety-Features. Variety-Features are used to mark regional varieties of lexical items, such as the American word "billfold" for BCE "wallet", "mailman" vs "postman".


The English version is able to analyze and recognize word forms with apostrophes:

  • Possessive forms of nouns; this includes singular word forms like "entry's", as well as plural word forms like "points'", including exceptions.
  • Contractions of auxiliary + not such as "doesn't", "haven't".

Please note: If you require a single analysis of a word form with an apostrophe, do not use the apostrophe character as a separator within your application.

Here is an example for the Lemmatizer:

query   -> cat's
filter  -> (Cat N)
result  -> cat
             (Cat N)(Contraction N+'s/Clitic)
             (Cat N)(Contraction N+have/V)
             (Cat N)(Contraction N+be/V)

The Contraction Feature

The contraction feature is used to specify contraction elements included in the answer. The above example shows the Lemmatizer results for the query "cat's". An entity is described uniquely by its category (example: N) if it is an "open" entity, i.e. all entries of the same category could potentially be applied to that entity (following specific restrictions). On the other hand, an entity is specified by the pair "citation form" - "/" - "category", if it describes an element from a finite set of possibilities (example: be/V).

Here is a formal syntax description:

In the text representation a contraction feature is represented by an attribute-value pair, where the attribute is "Contraction" and the value is the entity

contraction-feature ::= "(Contraction" value ")".

The value of the pair is composed by a sequence of entities, separated by "+"

value               ::= entity {"+" entity}+.
entity              ::= [citation-form "/"] category.

Problem Feature

Features of the type (Problem xy) are related to the entry specification in our database. They can be ignored.