DE-Lemma (pronounced: de:e: le:ma:) is an object-oriented lemmatizer for German texts with a focus on the (bio)medical domain.
It is based on Apache OpenNLP and provides several pre-trained, binary Maximum-Entropy models in the corresponding directory. Those have been trained during October 2022 from freely available German treebanks.
- Apache Maven in version 3.6+
- Java / OpenJDK in version 17+
- Apache OpenNLP in version 2.1.0+
- OpenNLP releases < 2.1.0 can't reliably load the lemmatizer model files of this project! This is due to OpenNLP-1366 which was detected during work for DE-Lemma. The bug has been fixed via PR-427 and was included in version 2.1.0.
- Check and take care of your classpath so no older OpenNLP version is around!
Build the project via Apache Maven.
The command for the relevant parts is mvn clean package
.
This should download all required dependencies which are:
- Apache OpenNLP,
- Apache Commons Lang3, and
- slf4j + log4j2 bindings.
If you want to re-use the current, experimental version of DE-Lemma in your projects,
execute mvn clean install
to transport the bundled jar file to your local .m2
folder.
Note:
You have to select one or more model files and copy it over to the execution environment.
Those models must reside in the models
directory, as the current code inspects this directory name.
For a first impression, just execute DELemmaDemo.java
which will, by default, load the DE-Lemma_UD-gsd-2022-maxent.bin
model resource. The loaded Lemmatizer
instance will then find the lemmas for German (non-)inflected nouns from the (bio)medical domain.
Important
For reasons of limited LFS storage, only the DE-Lemma_UD-gsd-2022-maxent.bin model will be included in the models
directory of this Git repository, if you clone this repository.
You will have to download all other model files separately.
Once retrieved, place those model files in the models
directory to start experimenting with it.
In the demo example, the German nouns List.of("Ärzte", "Herzzusatztöne", ...)
will be processed.
The results are logged to STD out / console. It should be similar to:
INFO [main] OpenNLPModelServiceImpl (OpenNLPModelServiceImpl.java:50) - Importing NLP model file 'DE-Lemma_UD-gsd-2022-maxent.bin' ...
INFO [main] DELemmaDemo (DELemmaDemo.java:30) - Found lemma 'Virus' for noun 'Viren'.
INFO [main] DELemmaDemo (DELemmaDemo.java:30) - Found lemma 'Herzzusatzton' for noun 'Herzzusatztöne'.
INFO [main] DELemmaDemo (DELemmaDemo.java:30) - Found lemma 'Vorhofflattern' for noun 'Vorhofflatterns'.
INFO [main] DELemmaDemo (DELemmaDemo.java:30) - Found lemma 'Arzt' for noun 'Ärzte'.
INFO [main] DELemmaDemo (DELemmaDemo.java:30) - Found lemma 'Klinikum' for noun 'Klinikum'.
The complete set of files consists of four models:
Model name | Size | External download required |
---|---|---|
DE-Lemma_UD-gsd-2022-maxent.bin | 861K | No |
DE-Lemma_UD-hdt-2022-maxent.bin | 14M | Yes |
DE-Lemma_Tue-BuReg-2022-maxent.bin | 3.9M | Yes |
DE-Lemma_Tue-Wiki-2022-maxent.bin | 131M | Yes |
as reported in the paper.
Note
All trained models were evaluated for lemma prediction performance, see Table 3 in the paper.
If you use DE-Lemma models or the lemmatizer code in scientific work, please cite the GMDS 2023 paper as follows:
📝
Wiesner M. DE-Lemma: A Maximum-Entropy Based Lemmatizer for German Medical Text. Studies in Health Technology and Informatics. 2023 Sep 12;307:189-195. DOI: 10.3233/SHTI230712, PMID: 37697853
Several available treebanks (in CoNLL-U or CoNLL-X format) were identified and selected as candidates for training German lemmatizer models.
The German UD-treebanks, UD-GSD and UD-HDT, are constructed from text corpora of German newspapers and other freely available text materials. The treebanks TüBa-D/DP and TüBa-D/W also qualified for training lemmatizer models. Those contain information about word types, morphology, lemmas, as well as dependency relations. TüBa-D/W represents a huge corpus: It is based on Wikipedia text material including 36.1 million sentences.
The training of lemmatizer models was conducted based on the open-source NLP toolkit Apache OpenNLP. For the generation of lemmatizer models with smaller treebanks (UD-GSD, UDHDT, TüBa-D/DP-political), the OpenNLP training parameters were chosen as follows:
training.algorithm=maxent
training.iterations=100
training.cutoff=5
training.threads=16
language=de
use.token.end=false
sentences.per.sample=5
upos.tagset=upos
The training for TüBa-D/W was conducted with these parameters:
training.algorithm=maxent
training.iterations=20
training.cutoff=5
training.threads=4
language=de
use.token.end=false
sentences.per.sample=5
upos.tagset=upos
Since the training of a lemmatizer model (LM) required between ~32 GB (UD-GSD) and ~1,100 GB (TüBa-D/W) of RAM at runtime, these tasks could not be performed on conventional workstation hardware. Therefore, the training of each model was conducted on the mainframe environment of the bwUniCluster during October 2022. The execution environment of the training program was a Java Runtime Environment (JRE), a 64bit OpenJDK in version 8 build 292.
The resulting binary model files were persisted for evaluation and later re-use in NLP applications with a lemmatizer component.