首页> 外文期刊>Procedia Computer Science >Automated anonymization of text documents in Polish
【24h】

Automated anonymization of text documents in Polish

机译:在波兰语中自动匿名文本文档匿名化

获取原文
           

摘要

The anonymization of unstructured texts has become a very popular and widely researched topic. This is due not only to the latest GDPR regulation, but also due to the development of state-of-the-art models in the field of natural language processing. The texts required for building such models have to be anonymized before and very often have to be anonymized on the premises of data providers, not the machine learning teams. In this work, we present the use of machine learning models such as part-of-speech tagger or named entity recognizer and their integration with regular expressions for anonymization of unstructured texts in Polish. The goal is to create a system that recognizes many types of sensitive data and can remove, tag, and pseudo-anonymize (replace with words from the same category in an appropriate form) the detected tokens. To test the performance of this system, we prepared a manually annotated dataset containing different categories of sensitive data. The paper presents a detailed analysis of the proposed method’s performance. Moreover, a deployment architecture is discussed in the paper, that results in the creation of a scalable tool capable of processing a large amount of data that can be easily used.
机译:非结构化文本的匿名化已成为一个非常流行和广泛研究的主题。这不仅是由于最新的GDPR规则,而且由于在自然语言处理领域的最先进模型的发展。建立此类模型所需的文本必须以前匿名,通常必须在数据提供者的场所匿名,而不是机器学习团队。在这项工作中,我们介绍了Machine学习模型,例如语音型标记或命名实体识别器及其与正则表达式的集成,以便在波兰语中匿名化。目标是创建一个系统,识别许多类型的敏感数据,可以删除,标记和伪匿名(以适当的形式从相同类别中的单词替换)检测到的令牌。要测试该系统的性能,我们准备了包含不同类别的敏感数据的手动注释的数据集。本文提出了对拟议方法的性能的详细分析。此外,本文讨论了部署架构,其导致创建能够处理可以容易地使用的大量数据的可伸缩工具。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号