Approximation Algorithms for Submodular Data Summarization with a Knapsack Constraint

Kai Han; Enpei Zhang; Tong Xu; Shuang Cui; Benwei Wu; Shaojie Tang; Tianshuai Zhu; Zhizhuo Yin; He Huang

摘要

Data summarization, a fundamental methodology aimed at selecting a representative subset of data elements from a large pool of ground data, has found numerous applications in big data processing, such as social network analysis , crowdsourcing , clustering , network design , and document/corpus summarization . Moreover, it is well acknowledged that the "representativeness" of a dataset in data summarization applications can often be modeled by submodularity-a mathematical concept abstracting the "diminishing returns" property in the real world. Therefore, a lot of studies have cast data summarization as a sub-modular function maximization problem (e.g.,). There is a long research history for submodular function maximization . Among the various NP-hard problems studied in this area, Submodular function Maximization subject to a Knapsack constraint (SMK) is perhaps one of the most fundamental problems, and it has attracted tremendous attention since the 1980s . This problem has numerous applications in data summarization regardless whether the considered submodular function f(•) is monotone or non-monotone. For example, identifying a diverse set of movies with high ratings from a video library can be cast as a non-monotone SMK problem . To address the SMK problem, previous studies have proposed approximation algorithms under both the offline and streaming settings. In the offline setting, it is assumed that the algorithm has access to the whole dataset, while such an assumption is abandoned in the streaming setting as the data volume could be too large to fit into memory. Undoubtedly, it is desirable for both offline and streaming algorithms to achieve accuracy and time efficiency, while it is also important for a streaming algorithm to reduce the space complexity and to minimize the number of passes that it has to make over the stream. Compared to the studies on the monotone SMK problem, there are relatively few proposals for the non-monotone SMK problem. The existing algorithms for non-monotone SMK, however, suffer from the following deficiencies. First, most offline algorithms have a high time complexity, which implies that they are inappropriate for data summarization, as algorithms for big data processing usually have to run in nearly-linear or sub-linear time. Second, it is still unclear whether there exist streaming algorithms with constant approximation ratios for the non-monotone SMK problem. Motivated by the deficiencies of the existing studies, the goal of our paper is to design practical, efficient, and effective algorithms for the non-monotone SMK problem, under both the offline and streaming settings. To achieve this goal, we propose (1) four offline algorithms-which consist of two deterministic algorithms (SMKDER and SMKDETACC) and two randomized ones (SMKRAN and SMKRANACC), as well as (2) one streaming algorithm (SMKSTREAM). As a by-product, we also show that the SMKSTREAM algorithm can be used for the monotone SMK problem and achieves better performance bounds than the existing algorithms. The following is a list of the theoretical performance bounds of our algorithms and their advantages: 1. SMKDET (resp. SMKDETACC) achieves an approximation ratio of 6 (resp. 6+e) using at most O(kn) (resp. O(ε/n log ε/k)) oracle queries to the objective function, where k is the maximum cardinality of any feasible solution to the SMK problem, and n is the size of the ground set. Therefore, SMKDET runs in orders of magnitude faster than the algorithm in, which achieved the best-known deterministic approximation ratio of 6 but using O(kn~4) oracle queries for the non-monotone SMK problem. 2. SMKRAN (resp. SMKRANACC) achieves an expected approximation ratio of 4 (resp. 4 + ε) using at most O(kn) (resp. O(ε/n log ε/k)) oracle queries. The existing fastest randomized algorithm proposed in only achieves an expected ratio of 5.83 + ε using O(ε/n log ε/n) oracle queries. Therefore, SMKRANACC achieves a better approximation ratio without sacrificing efficiency compared to. 3. SMKSTREAM achieves an approximation ratio of 6+ε (resp. 8+ ε) by making two passes (resp. one pass) over the stream with space complexity of O(ε/k log B), using at most O(ε/n log B) oracle queries in total. To the best of our knowledge, SMKSTREAM is the first streaming algorithm to achieve a constant approximation ratio for the non-monotone SMK problem. 4. As a by-product, SMKSTREAM can also achieve an approximation ratio of 2 + e by making only two passes over the stream for the monotone SMK problem, with the same complexities listed above. Therefore, SMKSTREAM achieves better performance bounds than the streaming algorithm in with the best-known approximation ratio for the monotone SMK problem. The algorithm in achieves an approximation ratio of 2 + ε by making O (ε/1) passes over the stream, and it incurs O(nε~(-8) log~2 B) space complexity and O(Bε~(-7) log~2 B) oracle queries. The practical performance of our algorithms is also evaluated i

机译：数据摘要，旨在从大型地面数据中选择代表性数据元素的基本方法，发现了大数据处理中的许多应用，例如社交网络分析，众群，聚类，网络设计和文档/语料库摘要。此外，它得到了很好的承认数据摘要应用中数据集的“代表性”通常由子骨折 - 一个数学概念提出了现实世界中的“减少返回”属性的数学概念。因此，许多研究将数据摘要作为子模块函数最大化问题（例如，）。子模型函数最大化存在很长的研究历史。在该领域研究的各种NP难题中，患有背包约束（SMK）的子模具函数最大化可能是最基本的问题之一，自20世纪80年代以来它引起了巨大的关注。此问题在数据摘要中具有许多应用程序，而不管是否被审判的子模块函数f（•）是单调或非单调的。例如，识别来自视频库的高额定值的各种电影可以作为非单调的SMK问题。为了解决SMK问题，之前的研究在离线和流传输设置下提出了近似算法。在离线设置中，假设算法可以访问整个数据集，而在流式设置中放弃这样的假设，因为数据量太大以适合存储器。毫无疑问，希望离线和流算法均可实现精度和时间效率，而流化算法也很重要，以降低空间复杂度，并最小化它必须在流上制作的传递数量。与单调SMK问题的研究相比，非单调SMK问题的提案相对较少。然而，非单调SMK的现有算法遭受以下缺陷。首先，大多数离线算法具有很高的时间复杂度，这意味着它们不适合数据摘要，因为大数据处理的算法通常必须在近乎线性或子线性时间内运行。其次，尚不清楚是否存在具有恒定近似比对于非单调SMK问题的流算法算法。通过现有研究的缺陷，我们的论文的目标是在离线和流式设置下设计非单调SMK问题的实用，高效和有效的算法。为了实现这一目标，我们提出（1）四个离线算法 - 由两个确定性算法（SMKDER和SMKDECC）和两个随机（SMKRAN和SMKRANACC）组成，以及（2）一个流算法（SMKSTREAM）。作为副产品，我们还表明，SMKStream算法可用于单调SMK问题，并实现比现有算法更好的性能界限。以下是我们算法的理论性能范围的列表及其优势：1。SMKDET（SMKDETCC）最多实现了6（REN）的近似比为6（kn）（REA）（RESP.O（ ε/ n日志ε/ k））Oracle查询到目标函数，其中k是任何可行解决方法的最大基数，并且n是地面集的大小。因此，SMKDET以比算法更快的速度运行，这实现了最熟知的确定性近似比为6，而是使用O（kn〜4）Oracle查询非单调SMK问题。 2. SMKRAN（RESCH。SMKRANACC）最多可以实现4（REN）的预期近似比为4（kn）（REN）（RESC.O（ε/ n logε/ k））ORACLE查询。所提出的现有最快的随机算法仅使用O（ε/ n logε/ n）oracle查询实现了5.83 +ε的预期比率。因此，与之相比，SMKranACC实现了不牺牲效率的更好的近似比率。 3. SMKSTREAM通过将两个通过O（ε/ k log b）的空间复杂度，通过最多o（ε/ k log b）的空间复杂度，实现了近似值为6 +ε（RESP.8+ε）的近似比率。 / n log b）总共查询Oracle查询。据我们所知，SMKStream是实现非单调SMK问题的恒定近似比的第一流算法。 4.作为副产物，SMKSTREAM也可以通过在单调SMK问题上仅进行两个通过的流，以达到2 + e的近似比，以便在上面列出的相同复杂性。因此，SMKSTREAM比单调SMK问题的最佳已知近似比的流算法实现了比流媒体算法更好的性能界限。通过使O（ε/ 1）通过流来实现2 +ε的近似比，并且它会引起O（n∈〜（-8）log〜2 b）空间复杂度和o（b∈〜（-7 ）log〜2 b）Oracle查询。我们的算法的实际表现也在评估我

Approximation Algorithms for Submodular Data Summarization with a Knapsack Constraint

摘要

著录项

相似文献

相关主题

期刊订阅