twintech.dev

Project Overview

This study tackles SAP ERP requirement-to-module classification across FI, CO, MM, SD, PM, PS, BC (and IM in parts) with three methods: (1) EmbedBoost—frozen multilingual E5 sentence embeddings + LightGBM; (2) TuneWeight—fine-tuned multilingual E5 with class-weighted loss; and (3) DocuMatch—unsupervised TF-IDF cosine matching against SAP blueprint docs. Using ~1.9k high-quality real requests plus 15k synthetic examples, the authors evaluate in four train/test regimes (orig→orig, orig→synthetic, synthetic→synthetic, synthetic→orig). TuneWeight leads in aligned settings (up to ~95% macro-F1 w/o IM; ≈93% accuracy on real data, 99% on synthetic), while EmbedBoost offers a strong accuracy/efficiency trade-off and better robustness under domain shift; DocuMatch trails but is interpretable. SHAP-driven analysis motivates preprocessing (entity filtering, Slavic stemming, expanded stopwords) that boosts minority-class macro-F1 by up to ~8 points. The work contributes a synthetic SAP requirements dataset and argues for domain-aware, multilingual tooling to speed fit-gap analysis and improve traceability.

ML-Powered SAP Requirements Classification

Project Overview