Tokenization and Lemmatization on German Learning Textbook Level A1 of CEFR Standard

Authors

  • M. Kharis Universitas Negeri Surabaya, Universitas Negeri Malang
  • Kisyani Laksono Universitas Negeri Surabaya
  • Suhartono Universitas Negeri Surabaya
  • Agus Ridwan Universitas Negeri Surabaya
  • Mintowati Universitas Negeri Surabaya
  • Yuniseffendri Universitas Negeri Surabaya

DOI:

https://doi.org/10.33423/jhetp.v22i1.4971

Keywords:

higher education, SpaCy, lemmatizer, vocabulary type, token, word count

Abstract

This study aims to compare the number of vocabulary types and tokens contained in the German language learning textbooks Themen Neu, Studio D, and Netzwerk A1 level of CEFR standards and to describe the use of German Lemmatizer to identify the lemmas in the textbook. The SpaCy lemmatizes and parses the words. Both discussions with expert judgment and German language experts validate the data. Based on the analysis results, the number of vocabulary types and tokens in the three books in each chapter is always rises changes to always rises at the beginning, but increases and decreases unsteadily from the middle to the last chapter. In addition, the SpaCy lemmatizer is able to lemmatize and parse the form of the words and classify the word classes in German, although there are still errors in its analysis. Therefore SpaCy still has to improve the system in its dataset, to make a better analyze the form of lemmas and classify the word classes.

Downloads

Published

2022-02-10

How to Cite

Kharis, M., Laksono, K., Suhartono, Ridwan, A., Mintowati, & Yuniseffendri. (2022). Tokenization and Lemmatization on German Learning Textbook Level A1 of CEFR Standard. Journal of Higher Education Theory and Practice, 22(1). https://doi.org/10.33423/jhetp.v22i1.4971

Issue

Section

Articles