Papers by Muhammad Irfan Yousuf

Cornell University - arXiv, Oct 18, 2019
Graph Sampling provides an efficient yet inexpensive solution for analyzing large graphs. While e... more Graph Sampling provides an efficient yet inexpensive solution for analyzing large graphs. While extracting small representative subgraphs from large graphs, the challenge is to capture the properties of the original graph. Several sampling algorithms have been proposed in previous studies, but they lack in extracting good samples. In this paper, we propose a new sampling method called Weighted Edge Sampling. In this method, we give equal weight to all the edges in the beginning. During the sampling process, we sample an edge with the probability proportional to its weight. When an edge is sampled, we increase the weight of its neighboring edges and this increases their probability to be sampled. Our method extracts the neighborhood of a sampled edge more efficiently than previous approaches. We evaluate the efficacy of our sampling approach empirically using several real-world data sets and compare it with some of the previous approaches. We find that our method produces samples that better match the original graphs. We also calculate the Root Mean Square Error and KolmogorovSmirnov distance to compare the results quantitatively.

Uniform Preferential Selection Model for Generating Scale-free Networks
Methodology and Computing in Applied Probability
It has been observed in real networks that the fraction of nodes P ( k ) with degree k satisfies ... more It has been observed in real networks that the fraction of nodes P ( k ) with degree k satisfies the power-law P ( k ) ∝ k − γ for k > k m i n > 0. However, the degree distribution of nodes in these networks before k m i n varies slowly to the extent of being uniform as compared to the degree distribution after k m i n . Most of the previous studies focus on the degree distribution after k m i n and ignore the initial flatness in the distribution of degrees. In this paper, we propose a model that describes the degree distribution for the whole range of k > 0, i.e., before and after k m i n . The network evolution is made up of two steps. In the first step, a new node is connected to the network through a preferential attachment method. In the second step, a certain number of edges between the existing nodes are added such that the end nodes of an edge are selected either uniformly or preferentially. The model has a parameter to control the uniform or preferential selection of nodes for creating edges in the network. We perform a comprehensive mathematical analysis of our proposed model in the discrete domain and prove that the model exhibits an asymptotically power-law degree distribution after k m i n and a flat-ish distribution before k m i n . We also develop an algorithm that guides us in determining the model parameters in order to fit the model output to the node degree distribution of a given real network. Our simulation results show that the degree distributions of the graphs generated by this model match well with those of the real-world graphs.

Graph sampling allows mining a small representative subgraph from a big graph. Sampling algorithm... more Graph sampling allows mining a small representative subgraph from a big graph. Sampling algorithms deploy different strategies to replicate the properties of a given graph in the sampled graph. In this study, we provide a comprehensive empirical characterization of five graph sampling algorithms on six properties of a graph including degree, clustering coefficient, path length, global clustering coefficient, assorta-tivity, and modularity. We extract samples from fifteen graphs grouped into five categories including collaboration, social, citation, technological , and synthetic graphs. We provide both qualitative and quantitative results. We find that there is no single method that extracts true samples from a given graph with respect to the properties tested in this work. Our results show that the sampling algorithm that aggressively explores the neighborhood of a sampled node performs better than the others.

IEEE Access
We present a new graph compression scheme that intrinsically exploits the similarity and locality... more We present a new graph compression scheme that intrinsically exploits the similarity and locality of references in a graph by first ordering the nodes and then merging the contiguous adjacency lists of the graph into blocks to create a pool of nodes. The nodes in the adjacency lists of the graph are encoded by their position in the pool. This simple yet powerful scheme achieves compression ratios better than the previous methods for many datasets tested in this paper and, on average, surpasses all the previous methods. The scheme also provides an easy and efficient access to neighbor queries, e.g., finding the neighbors of a node, and reachability queries, e.g., finding if node u is reachable from node v. We test our scheme on publicly available graphs of different sizes and show a significant improvement in the compression ratio and query access time compared to the previous approaches.

Transport and Telecommunication Journal
This paper reports the undermined potential of broad range of (Information and communication tech... more This paper reports the undermined potential of broad range of (Information and communication technologies) ICTs that remained effective yet unnoticed in different flood-phases to exchange traffic, travel, and evacuation related information. The objective was to identify convenient ICTs that people found operational in life cycle of a flood. For the purpose, ICTs were tested in relation to 18 different variables based on personal capabilities, demographic, and vehicle-based information etc. Samples of 105 and 102 subjects were recruited from flood-prone communities of developing and developed case-studies respectively, through random sampling and analyzed through Multinomial Logistic Regression. Those categories of independent variables that showed p-value ≥ 0.05 were considered to model the results. The main findings showed that in developed countries TV, mobile phone subscriptions and international news channels were prominent source of information whilst in developing countries mu...
A generative model for time evolving networks
Knowledge and Information Systems, 2021

Discret. Math. Theor. Comput. Sci., 2020
It is commonly believed that real networks are scale-free and fraction of nodes $P(k)$ with degre... more It is commonly believed that real networks are scale-free and fraction of nodes $P(k)$ with degree $k$ satisfies the power law $P(k) \propto k^{-\gamma} \text{ for } k > k_{min} > 0$. Preferential attachment is the mechanism that has been considered responsible for such organization of these networks. In many real networks, degree distribution before the $k_{min}$ varies very slowly to the extent of being uniform as compared with the degree distribution for $k > k_{min}$ . In this paper, we proposed a model that describe this particular degree distribution for the whole range of $k>0$. We adopt a two step approach. In the first step, at every time stamp we add a new node to the network and attach it with an existing node using preferential attachment method. In the second step, we add edges between existing pairs of nodes with the node selection based on the uniform probability distribution. Our approach generates weakly scale-free networks that closely follow the degree...
ArXiv, 2021
Real-world graphs are massive in size and we need a huge amount of space to store them. Graph com... more Real-world graphs are massive in size and we need a huge amount of space to store them. Graph compression allows us to compress a graph so that we need a lesser number of bits per link to store it. Of many techniques to compress a graph, a typical approach is to find clique-like caveman or traditional communities in a graph and encode those cliques to compress the graph. On the other side, an alternative approach is to consider graphs as a collection of hubs connecting spokes and exploit it to arrange the nodes such that the resulting adjacency matrix of the graph can be compressed more efficiently. We perform an empirical comparison of these two approaches and show that both methods can yield good results under favorable conditions. We perform our experiments on ten real-world graphs and define two cost functions to present our findings.

Data Mining and Knowledge Discovery, 2020
Large real-world graphs claim lots of resources in terms of memory and computational power to stu... more Large real-world graphs claim lots of resources in terms of memory and computational power to study them and this makes their full analysis extremely challenging. In order to understand the structure and properties of these graphs, we intend to extract a small representative subgraph from a big graph while preserving its topology and characteristics. In this work, we aim at producing good samples with sample size as low as 0.1% while maintaining the structure and some of the key properties of a network. We exploit the fact that average values of degree and clustering coefficient of a graph can be estimated accurately and efficiently. We use the estimated values to guide the sampling process and extract tiny samples that preserve the properties of the graph and closely approximate their distributions in the original graph. The distinguishing feature of our work is that we apply traversal based sampling that utilizes only the local information of nodes as opposed to the global information of the network and this makes our approach a practical choice for crawling online networks. We evaluate the effectiveness of our sampling technique using real-world datasets and show that it surpasses the existing methods.

Journal of Statistical Physics, 2020
The study and analysis of real-world social, communication, information and citation networks for... more The study and analysis of real-world social, communication, information and citation networks for understanding their structure and identifying interesting patterns have cultivated the need for designing generative models for such networks. A generative model generates an artificial but a realistic-looking network with the same characteristics as that of a real network under study. In this paper, we propose a new generative model for generating realistic networks. Our proposed model is a blend of three key ideas namely preferential attachment, associativity of social links and randomness in real networks. We present a framework that first tests these ideas separately and then blends them into a mixed model based on the idea that a real-world graph could be formed by a mixture of these concepts. Our model can be used for generating static as well as time evolving graphs and this feature distinguishes it from previous approaches. We compare our model with previous methods for generating graphs and show that it outperforms in several aspects. We compare our graphs with real-world graphs across many metrics such as degree, clustering coefficient and path length distributions, assortativity, eigenvector centrality and modularity. In addition, we give both qualitative and quantitative results for clarity. Keywords Graph algorithms • Big graphs • Generative models 1 Introduction Complex networks are neither purely regular nor purely random. The non-trivial features of complex networks have been witnessed in the empirical studies [1,22] of real-world networks that include heavy-tailed degree distribution, small-world phenomenon and high clustering coefficient, to name a few. Most social, informational, biological and technical networks are Communicated by Irene Giardina.

Intelligent Data Analysis, 2018
Real world graphs are massive in size and often prohibitively expensive to analyze. Of the possib... more Real world graphs are massive in size and often prohibitively expensive to analyze. Of the possible solutions, sampling is extracting a representative subgraph from a large graph that faithfully represents the actual graph. The prior research has developed several sampling methods but the samples produced by these methods fail to match important properties of the original graph and work poorly in maintaining its topology. We observed that the existing methods do not explore the neighborhood of sampled nodes fairly and hence yield suboptimal samples. In this paper, we introduce a novel approach in which we keep a list of candidate nodes that is populated with all the neighbors of nodes that have been sampled so far. With this approach, we can balance the depth and breadth of graph exploration to produce better samples. We evaluate the effectiveness of our approach using several real world datasets and show that it surpasses the existing state-of-the-art approaches in maintaining the properties of the original graph and retaining its structure. We also calculate Kolmogorov-Smirnov Distance and Jensen-Shannon Distance for quantitative evaluation of our approach.

Coping with bad-mouthing in peer-to-peer file sharing networks
2015 IEEE International Conference on Peer-to-Peer Computing (P2P), 2015
In the recent years, the P2P file sharing systems have adopted rating systems in the hope to stop... more In the recent years, the P2P file sharing systems have adopted rating systems in the hope to stop the propagation of bad files. In a rating system, users rate files after downloading and a file with positive feedback is considered a good file. However, a dishonest rater can undermine the rating system by giving positive rating to bad files and negative rating to good files. In this paper, we design two filters based on probabilistic models such that the good files with negative feedback are not completely kept out of the system. The first filter is based on the binomial distribution of the ratings of a file, and the second filter considers the confidence of the downloading peer and the difference of positive and negative ratings of a file to calculate the probability to take a risk to download the file or reject it. Our filters only need the ratings of a file and this makes them suitable for popular torrent sharing websites that rank the files using a binary rating system without any information about raters. In addition, we can implement them entirely on the client side without any modification to the content sharing sites.

MTree: Reliable routing for machine-to-machine systems
International Conference on ICT for Smart Society, 2013
ABSTRACT This paper discusses the design and evaluation of MTree, a reliable routing scheme for d... more ABSTRACT This paper discusses the design and evaluation of MTree, a reliable routing scheme for distributed data storage and retrieval for machine-to-machine systems. MTree's ability to route efficiently even when a large number of nodes are joining and/or leaving the system makes it an ideal choice for different sort of applications including data sharing and content location. Each machine in MTree has a unique node identifier belonging to an identifier space. In MTree, we partition the identifier space into levels and segments and fix the manager of every segment. A node in MTree maintains links with constant number of nodes at the next level to forward queries. A node also creates a link with a node at the top level to get the global view of the system. This way MTree traverses a logarithmic number of nodes to route a query to its destination. A prototype implementation of MTree on PeerSim proves its reliability and efficiency.

2013 21st IEEE International Conference on Network Protocols (ICNP), 2013
This paper discusses the design and evaluation of Kistree, a reliable, fault-tolerant and self-co... more This paper discusses the design and evaluation of Kistree, a reliable, fault-tolerant and self-configuring constant degree distributed hash table (DHT) for peer-to-peer systems. The Kistree topology can be thought of as log(n) vertically stacked layers or levels. At each level, we divide the whole identifier space into segments to form an n-ary tree structure. The nodes and keys belong to a particular segment at a level in Kistree network depending on the node / key identifier. A node in Kistree contacts with a constant number of nodes at the next level to forward queries. A node also creates a link with a node at the topmost level to get a global view of the system. This way Kistree keeps a constant number of neighbors in the routing table and traverses a logarithmic number of nodes to route a query to its destination. An insert operation stores a key on a number of diverse nodes of a concerned segment. The lookup operation, on the other hand, retrieves a stored key efficiently and reliably. The prototype implementation of Kistree on PeerSim verifies its scalability, reliability and efficiency. The experimental results achieved with a network of 50,000 nodes confirm its selfconfigurability and ability to route messages even under a high rate of churn.
Uploads
Papers by Muhammad Irfan Yousuf