Automated anonymization of text documents in Polish

Marcin Oleksy; Norbert Ropiak; Tomasz Walkowiak

首页> 外文期刊>Procedia Computer Science >Automated anonymization of text documents in Polish

【24h】

Automated anonymization of text documents in Polish

机译：在波兰语中自动匿名文本文档匿名化

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The anonymization of unstructured texts has become a very popular and widely researched topic. This is due not only to the latest GDPR regulation, but also due to the development of state-of-the-art models in the field of natural language processing. The texts required for building such models have to be anonymized before and very often have to be anonymized on the premises of data providers, not the machine learning teams. In this work, we present the use of machine learning models such as part-of-speech tagger or named entity recognizer and their integration with regular expressions for anonymization of unstructured texts in Polish. The goal is to create a system that recognizes many types of sensitive data and can remove, tag, and pseudo-anonymize (replace with words from the same category in an appropriate form) the detected tokens. To test the performance of this system, we prepared a manually annotated dataset containing different categories of sensitive data. The paper presents a detailed analysis of the proposed method’s performance. Moreover, a deployment architecture is discussed in the paper, that results in the creation of a scalable tool capable of processing a large amount of data that can be easily used.

机译：非结构化文本的匿名化已成为一个非常流行和广泛研究的主题。这不仅是由于最新的GDPR规则，而且由于在自然语言处理领域的最先进模型的发展。建立此类模型所需的文本必须以前匿名，通常必须在数据提供者的场所匿名，而不是机器学习团队。在这项工作中，我们介绍了Machine学习模型，例如语音型标记或命名实体识别器及其与正则表达式的集成，以便在波兰语中匿名化。目标是创建一个系统，识别许多类型的敏感数据，可以删除，标记和伪匿名（以适当的形式从相同类别中的单词替换）检测到的令牌。要测试该系统的性能，我们准备了包含不同类别的敏感数据的手动注释的数据集。本文提出了对拟议方法的性能的详细分析。此外，本文讨论了部署架构，其导致创建能够处理可以容易地使用的大量数据的可伸缩工具。

著录项

来源
《Procedia Computer Science》 |2021年第a期|共11页
作者
Marcin Oleksy; Norbert Ropiak; Tomasz Walkowiak;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词
natural language processinganonymizationPolish languagemicro-services;

机译：自然语言加工anyonymizatePolish Languag病人;

相似文献

外文文献
中文文献
专利

1. Enhanced Information Retrieval from Narrative German-language Clinical Text Documents using Automated Document Classification [J] . Stephan SPAT, Bruno CADONNA, Ivo RAKOVAC, Studies in Health Technology and Informatics . 2008,第期

机译：使用自动文档分类从叙事德语临床文本文档中增强信息检索
2. Automated extraction of information from Polish resume documents in the IT recruitment process [J] . Agnieszka Wosiak Procedia Computer Science . 2021,第a期

机译：在IT招聘过程中自动提取来自波兰语简历文件的信息
3. Robust Text Extraction for Automated Processing of Multi-Lingual Personal Identity Documents [J] . Pushpa B R, Ashwin M, Vivek K R International Journal of Engineering and Technology . 2016,第2期

机译：强大的文本提取功能，可自动处理多语言个人身份文档
4. Automated anonymization of text documents [C] . Nuno Mamede, Jorge Baptista, Francisco Dias IEEE Congress on Evolutionary Computation . 2016

机译：文本文件自动匿名
5. Manipulating Comprehensibility of Text: An Automated Approach to Generate Deceptive Documents for Cyber Defense [D] . Karuna, Prakruthi 2019

机译：操纵文本的可理解性：一种用于生成网络防御欺骗性文件的自动化方法
6. Automating the generation of lexical patterns for processing free text in clinical documents [O] . Frank Meng, Craig Morioka 2015

机译：自动生成词汇模式以处理临床文档中的自由文本
7. ACC/AHA guidelines for the evaluation and management of chronic heart failure in the adult: executive summary A report of the american college of cardiology/american heart association task force on practice guidelines (committee to revise the 1995 guidelines for the evaluation and management of heart failure) developed in collaboration with the international society for heart and lung transplantation endorsed by the heart failure society of america51The document was approved by the American College of Cardiology Board of Trustees in November 2001 and the American Heart Association Science Advisory and Coordinating Committee in September 2001.52When citing this document, the American College of Cardiology and the American Heart Association would appreciate the following citation format: Hunt SA, Baker DW, Chin MH, Cinquegrani MP, Feldman AM, Francis GS, Ganiats TG, Goldstein S, Gregoratos G, Jessup ML, Noble RJ, Packer M, Silver MA, Stevenson LW. ACC/AHA guidelines for the evaluation and management of chronic heart failure in the adult: executive summary: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines (Committee to Revise the 1995 Guidelines for the Evaluation and Management of Heart Failure). J Am Coll Cardiol 2001;38:2101–13.53The American College of Cardiology and the American Heart Association make every effort to avoid any actual or potential conflicts of interest that may arise as a result of an outside relationship or a personal, professional, or business interest of a member of the writing panel. Specifically, all members of the writing group are required to provide disclosure statements of all such relationships that might be perceived as real or potential conflicts of interest. These statements are reviewed by the parent task force, reported orally to all members of the writing panel at the first meeting, and updated as changes occur.54This document, as well as the corresponding full-text guidelines, is available on the World Wide Web sites of the American College of Cardiology (www.acc.org) and the American Heart Association (www.americanheart.org). Single reprints of the executive summary are available for $5.00 each by calling 800-253-4636 (US only) or writing the American College of Cardiology, Educational Services, 9111 Old Georgetown Road, Bethesda, MD 20814-1699. To purchase additional reprints up to 999 copies, call 800-611-6083 (US only) or fax 413-665-2671; 1000 or more copies, call 214-706-1466, fax 214-691-6342, or e-mail pubauth@heart.org (specify version: Executive Summary—71-0125; Full Text—71-1026).55© 2001 American College of Cardiology and American Heart Association, Inc. [O] . Hunt Sharon A, Baker David W, Chin Marshall H, 2001

机译：ACC / AHA成人慢性心力衰竭评估和管理指南：执行摘要美国心脏病学会/美国心脏协会实践指南工作组的报告（委员会修订1995年心脏评估和管理指南与美国心力衰竭学会认可的国际心脏和肺移植协会合作51）该文件于2001年11月获得美国心脏病学会理事会的批准，并于2001年9月获得美国心脏协会科学咨询和协调委员会的批准.52引用本文件时，美国心脏病学会和美国心脏协会将赞赏以下引用格式：Hunt SA，Baker DW，Chin MH，Cinquegrani MP，Feldman AM，Francis GS，Ganiats TG，Goldstein S，Gregoratos G，Jessup ML，Noble RJ，Packer M，Silver MA，Stevenson LW。 ACC / AHA成人慢性心力衰竭评估和管理指南：摘要：美国心脏病学会/美国心脏协会实践指南工作组的报告（修订1995年《美国心脏病学会评估和管理指南》的委员会心脏衰竭）。 J Am Coll Cardiol 2001; 38：2101–13.53美国心脏病学会和美国心脏协会竭尽全力避免由于外部关系或个人，专业或其他原因而引起的任何实际或潜在的利益冲突。写作小组成员的商业利益。具体而言，要求写作小组的所有成员提供所有可能被视为实际或潜在利益冲突的关系的披露声明。这些声明由上级工作组审查，在第一次会议上口头报告给写作小组的所有成员，并在发生变化时进行更新。54本文档以及相应的全文指南可在万维网上找到。美国心脏病学院（www.acc.org）和美国心脏协会（www.americanheart.org）的网站。致电800-253-4636（仅适用于美国）或写信给美国心脏病，教育服务学院，地址为9111 Old Georgetown Road，Bethesda，MD 20814-1699，可以执行摘要的单个重印本，每张5.00美元。要购买最多999份的其他重印本，请致电800-611-6083（仅限美国）或传真413-665-2671； 1000或更多副本，请致电214-706-1466，传真214-691-6342或电子邮件pubauth@heart.org（指定版本：执行摘要-71-0125；全文-71-1026）。55©2001美国心脏病学会和美国心脏协会有限公司

Automated anonymization of text documents in Polish

摘要

著录项

相似文献

相关主题

期刊订阅