## Sketching Algorithm

In this article, I discuss a fundamental algorithm to find out the most influential nodes in a social network.

The sketching algorithm was developed by Borgs`[1]`, and tries to solve the fundamental influence maximization problem: given a network G, how to find out which nodes to activate which can further activate maximum number of nodes. In addition, there is a limit `k` on the number of nodes to target(due to advertizing constraints). Hence the goal is to efficiently find an appropriate set of `k` nodes with which to start a diffusion process.

The graphs, which represent the real life social media networks, can have massive size. For example, a graph which represents Facebook will have billions of nodes and edges. This makes the running time of the influence maximization algorithm an important factor. To make things complicated, these social networks are constantly changing. A link is formed between two nodes when they befriend each other on Facebook, or follow on Twitter and can be removed in a similar way. This requires re-computation of solutions over time. As a result, the algorithm should have near-linear run time to work with massive datasets.

This speed requirement has encouraged a good amount of work to develop fast, heuristic methods of finding influential individuals in social networks. This line of work has mostly been experimental and observational, rather than based on theory.

Sketching algorithm tries to bridge this gap by developing a constant-factor approximation algorithm. Furthermore, the runtime is independent of the number of seeds `k`. It is close to optimal, as the authors provide a lower bound of Ω(m+n) on the time needed to obtain a constant approximation. The algorithm is randomized, succeeding with the probability 3/5, and the failure is detectable, which means the success probability can be amplified through repetition. It uses the standard independent cascade model of influence spread.

Developed by Kempe et. al. `[2]`. This model uses a directed, edge-weighted graph `G` with `n` nodes and `m` edges. The graph represents the underlying network. Influence spreds via a random process, beginning at a set `S` of nodes. As each node is activated, it can activate its neighbours. Whether it will actually activate its neighbours depends on the activation probability associated on the edge between these two nodes.

The weight of the edge `e = (u, v)` represents the probability that the process spreads along edge `e` from `u` to `v`. Let `I(S)` denote the total number of nodes eventually activated by this process. Then `E[I(S)]`, which is the expected value of `I(S)` can be considered as the influence of set `S`. IC model has become one of the important models of influence spread.

Motivation

To describe the sketching algorithm, it’s necessary to understand how the problem can be solved in a greedy, brute-force way. The problem is to find the set of nodes with highest influence. This boils down to, in its essence finding one node which has the highest influence. And this process can be repeated `k` times, to find the sees set `S`, of size `k`.

One strategy would be to estimate the influence of every node, and then take top `k` nodes which have highest influence. But this approach is of course, computationally expensive.

The basic idea behind the sketching algorithm is ‘polling’, where a node `v` is selected uniformly at random. Then, the set of nodes that would influence `v` is determined. Intuitively, if this process is repeated multiple times, and a certain node appears often as an influencer, there is a good chance that `u` is the most influential node. The authors show that the probability a node `u` appears in a set of influencers is proportional ot `E[I(u)]`, which can be estimated accurately with realatively few repetitions of the polling process.

Algorithm: The algorithm proceeds in two steps:

1. The random sampling technique is repeatedly applied to generate a sparse hypergraph representation(H) of the network. Each hypergraph edge corresponds to a set of individuals that was influenced by a randomly selected node in the transpose graph. This hypergraph encodes the influence estimates. For a set of nodes `S`, the total degree of `S` in the hypergraph is approximately proportional to the influence of `S` in the original graph.

2. Run a standard greedy algorithm on `H` to return a set of size `k` of approximately maximal total degree.

Computational Steps

1. Choose a random vertex `z_i∈V` and an activation function `xi:E→[0,1]` uniformly at random.
2. An edge from a node `a` to `b` is said to be active w.r.t. `x_i` if `x_i(a,b)<=pab`, otherwise it’s inactive.
3. Repeatedly construct a subgraph `Hi` that consists of all vertices that can reach `z_i`. This is done by performing a reverse BFS from `z_i`, or performing a BFS from `z_i` on `G^T`.
4. Stop this process when the total number of traversed edges exceeds `W`. Intuitively, most influential vertices seem likely to appear in `H_i`’s.

References

1. C. Borgs, M. Brautbar, J. Chayes, and B. Lucier. Maximizing social influence in nearly optimal time. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 946–957. SIAM, 2014.
1. Kempe, D., Kleinberg, J., and Tardos, E. 2003. Maximizing the spread of influence through a social network. In KDD. 137–146.