Tokenization and Lemmatization on German Learning Textbook Level A1 of CEFR Standard
DOI:
https://doi.org/10.33423/jhetp.v22i1.4971Keywords:
higher education, SpaCy, lemmatizer, vocabulary type, token, word countAbstract
This study aims to compare the number of vocabulary types and tokens contained in the German language learning textbooks Themen Neu, Studio D, and Netzwerk A1 level of CEFR standards and to describe the use of German Lemmatizer to identify the lemmas in the textbook. The SpaCy lemmatizes and parses the words. Both discussions with expert judgment and German language experts validate the data. Based on the analysis results, the number of vocabulary types and tokens in the three books in each chapter is always rises changes to always rises at the beginning, but increases and decreases unsteadily from the middle to the last chapter. In addition, the SpaCy lemmatizer is able to lemmatize and parse the form of the words and classify the word classes in German, although there are still errors in its analysis. Therefore SpaCy still has to improve the system in its dataset, to make a better analyze the form of lemmas and classify the word classes.