Tokenization and Lemmatization on German Learning Textbook Level A1 of CEFR Standard

M. Kharis; Kisyani Laksono; Suhartono; Agus Ridwan; Mintowati; Yuniseffendri

doi:10.33423/jhetp.v22i1.4971

Authors

M. Kharis Universitas Negeri Surabaya, Universitas Negeri Malang
Kisyani Laksono Universitas Negeri Surabaya
Suhartono Universitas Negeri Surabaya
Agus Ridwan Universitas Negeri Surabaya
Mintowati Universitas Negeri Surabaya
Yuniseffendri Universitas Negeri Surabaya

DOI:

https://doi.org/10.33423/jhetp.v22i1.4971

Keywords:

higher education, SpaCy, lemmatizer, vocabulary type, token, word count

Abstract

This study aims to compare the number of vocabulary types and tokens contained in the German language learning textbooks Themen Neu, Studio D, and Netzwerk A1 level of CEFR standards and to describe the use of German Lemmatizer to identify the lemmas in the textbook. The SpaCy lemmatizes and parses the words. Both discussions with expert judgment and German language experts validate the data. Based on the analysis results, the number of vocabulary types and tokens in the three books in each chapter is always rises changes to always rises at the beginning, but increases and decreases unsteadily from the middle to the last chapter. In addition, the SpaCy lemmatizer is able to lemmatize and parse the form of the words and classify the word classes in German, although there are still errors in its analysis. Therefore SpaCy still has to improve the system in its dataset, to make a better analyze the form of lemmas and classify the word classes.

Tokenization and Lemmatization on German Learning Textbook Level A1 of CEFR Standard

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Information