In many situations, the target population may be a small community within a massive graph such as a Twitter friendship graph discussing a particular current topic. Search has to be done on the entire network to select a small group of relevance, or targeted members can be found through link tracing. Graph connections are informative to identify community membership. The sampling starts from a seed node that belongs to the target community. A personalized page rank (PPR) can be considered as an alternative to snowball sampling which is the most commonly used method. For some d ≥ 0, snowball sampling collects all friends who are d friends away from the seed. This approach has two drawbacks compared to PPR. First, it does not account for the density of common friendship. Second, the snowball sample size increases fast with d. PPR on the other hand, gives a sample around the seed. The PPR vector is defined as the stationary distribution of personalized random walk. At each step, the random walker moves to the seed node with probability a and goes to the adjacent node with probability 1 - α. Consider a stationary distribution giving inclusion probability for sample size 1. This is a PPR vector. PPR leads to a cluster which is made up of nodes with large inclusion probabilities. To approximate the PPR vector quickly, Berkhin (Ref. 1) proposed an algorithm that examines only nodes with large inclusion probabilities. PPR is also computationally efficient based on the running time and data volume.
展开▼