首页> 外文会议>East Indonesia Conference on Computer and Information Technology >Developing Machine Learning Framework to Classify Harmonized System Code. Case Study: Indonesian Customs
【24h】

Developing Machine Learning Framework to Classify Harmonized System Code. Case Study: Indonesian Customs

机译:开发机器学习框架来分类协调系统代码。 案例研究:印度尼西亚海关

获取原文

摘要

Directorate General of Customs and Excise (DGCE), an Indonesian Government agency under the Ministry of Finance, is responsible for ensuring importer or exporter classify their declared goods based on the Harmonized System Code (HS Code). This study aims to find an optimal machine learning framework to classify goods into their HS Code based on the challenges DGCE faced, such as mixed language with an inconsistent pattern of goods descriptions, imbalance multiclass HS Code, and some additional categorical variables. Refer to some previous studies that propose some machine learning models to predict the HS Code based on goods descriptions. This study tries to make some improvements and adjustments in line with the previously mentioned challenges faced by DGCE. Some preprocessing tasks were performed, such as dealing with abbreviations, misspellings, the varying pattern of goods description, and translating Indonesian words into English. One Hot Coding (OHC) is applied to encode nominal and categorical variables. To make features from goods descriptions, we choose Term Frequency - Inverse Document Frequency (TF-IDF) combined with bigrams. As a result, our models show that Random Forest got an F1-score of 79.60% when classifying the HS Code's first four digits, and Multinomial NB got an F1-score of 72.74% when classifying the HS Code's entire digits. Compared to the baseline paper, those scores are 11.26% and 11.36% higher, respectively.
机译:海关总署(DGCE)是金融部的印度尼西亚政府机构,负责确保进口商或出口商根据协调的制度代码(HS代码)宣布其宣布的货物。本研究旨在找到最佳的机器学习框架,以基于所面临的挑战,例如具有不一致的商品描述模式,不平衡的多字符HS代码和一些附加分类变量的混合语言来将商品分类为他们的HS代码。请参阅以前的一些研究,提出了一些机器学习模型,以基于商品描述预测HS代码。本研究试图根据DGCE面临的前面提到的挑战进行一些改进和调整。进行了一些预处理任务,例如处理缩写,拼写错误,商品描述的不同模式,并将印度尼西亚语翻译成英文。应用一个热编码(OHC)以编码标称和分类变量。要从商品描述中进行功能,我们选择术语频率 - 逆文档频率(TF-IDF)与Bigrams相结合。因此,我们的模型显示随机森林在分类HS代码的前四位数时获得了79.60%的F1分数,并且在分类HS代码的整个数字时,多项式NB在72.74%的F1分数为72.74%。与基线纸相比,这些评分分别为11.26%和11.36%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号