基于分层抽样的重叠深网数据源选择

江俊彦; 彭智勇; 吴小莹; 彭承晨; 王敏

首页> 中文期刊> 《软件学报》 >基于分层抽样的重叠深网数据源选择

基于分层抽样的重叠深网数据源选择

开具论文收录证明 >>

期刊封面封底目录下载 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Many Web applications,such as multimedia data integration and online business data aggregation,require deep Web querying to integrate information from many data sources on the Web.The success of such applications is largely determined by the efficiency and effectiveness of querying methods over relevant sources.Existing studies on multiple data source integration have focused on ranking the relevance of queries w.r.t data sources without considering the impact of overlap among the sources over data source selection,resulting in not only query processing overhead but also increased workloads on data sources.In order to improve query efficiency on overlapping data sources,this work proposes a tuple-level stratified sampling approach for overlapping data source selection.The approach has two stages:the offline stage and the online stage.In the offline stage,tuple-level stratified sampling is applied to obtain sample tuples.In the online stage,samples are used to estimate query coverage and overlap among multiple data sources.A heuristic method is also designed to discover data sources with low overlap.Experimental results show that the proposed approach is more efficient and effective than the state of the art methods for selecting overlapping data sources.%深网查询在Web上众多的应用,需要查询大量的数据源才能获得足够的数据,如多媒体数据搜索、团购网站信息聚合等.应用的成功,取决于查询多数据源的效率和效果.当前研究侧重查询与数据源的相关性而忽略数据源之间的重叠关系,使得不同数据源上相同结果的数据被重复查询,增加了查询开销及数据源的工作负载.为了提高深网查询的效率,提出一种元组水平的分层抽样方法来估计和利用查询在数据源上的统计数据,选择高相关、低重叠的数据源,该方法分为两个阶段:离线阶段,基于元组水平对数据源进行分层抽样,获得样本数据;在线阶段,基于样本数据迭代地估计查询在数据源上的覆盖率和重叠率,并采用一种启发式策略以高效地发现低重叠的数据源,实验结果表明,该方法能够显著提高重叠数据源选择的精度和效率.

著录项

来源
《软件学报》 |2017年第5期|1271-1295|共25页
作者
江俊彦; 彭智勇; 吴小莹; 彭承晨; 王敏;
展开▼
作者单位

武汉大学计算机学院;

湖北武汉430072;

软件工程国家重点实验室(武汉大学);

湖北武汉430072;

软件工程国家重点实验室(武汉大学);

湖北武汉430072;

软件工程国家重点实验室(武汉大学);

湖北武汉430072;

武汉大学计算机学院;

湖北武汉430072;

武汉大学计算机学院;

湖北武汉430072;

展开▼
原文格式 PDF
正文语种 chi
中图分类程序设计、软件工程;
关键词
数据源选择; 分层抽样; 数据源重叠率估计; 回归;

相似文献

中文文献
外文文献
专利

1. 基于主题与概率模型的非合作深网数据源选择 [J] . 邓松 ,万常选 . 软件学报 . 2017,第012期
2. 面向医学领域实体关联检索的深网数据源选择 [J] . 邓松 ,陈辉 . 计算机工程与应用 . 2016,第010期
3. 实体信息集成检索的深网数据源选择 [J] . 邓松 . 计算机工程 . 2016,第010期
4. 面向混合类型关键词查询的非合作结构化深网数据源选择 [J] . 万常选 ,邓松 ,刘德喜 . 计算机研究与发展 . 2014,第004期
5. 基于世界知识的深网数据源增强分类模型 [J] . 黄黎 ,赵朋朋 ,方巍 . 计算机工程 . 2010,第008期
6. 基于主题语义的非合作结构化Top-N深网数据源选择 [C] . Deng Song ,邓松 ,Wan Changxuan . 第29届中国数据库学术会议 . 2012
7. 非合作结构化深网数据源选择技术研究 [A] . 邓松 . 2013

基于分层抽样的重叠深网数据源选择

摘要

著录项

相似文献

相关主题

期刊订阅