Graph Stream Sampling and Learning

The electronic systems and services that underpin daily life—the internet, social networks, banking systems, and online retailers—continuously generate vast amounts of operational data. A key analytical task for service providers is to understand the relation between different components, e.g. to provide friend recommendations based on patterns of social interaction, or to understand how a set of related transactions may collectively signify a bank fraud. The volume of these data presents enormous challenges, both for storage and processing, and for analysis and learning. While sampling and other methods of data reduction can make these data more manageable, without careful design they hinder the ability to discern patterns for forensics and other retrospective analyses.

This work develops a transformative framework for graph stream sampling and its applications that address outstanding gaps in knowledge and practice. First, current methods usually focus on computation of global graph properties, applications often require a representative sample for rapid or retrospective analysis without having to reprocess the entire stream, even if available. Second, current methods are typically optimized for specific subgraph targets and problems, but lack the capability to optimize for analysis accuracy and resource requirements. Third, it is challenging for domain researchers to generalize or apply these results beyond the original problem because of the expert tuning required for their creation. In response, this project will develop a framework for graph data stream sampling that is applicable to a wide class of application problems and settings, allows tunable trade-off between accuracy and space and time computational resources, and is implementable as a mapping between problem specification and working application code.

NSF-funded PhD research assistantship

An NSF funded PhD research assistantship is available to support a student working in the area of graph data stream sampling and learning. This position would be a good match for someone with:

Bachelors or Masters degree in data science, computer science, or electrical engineering
Strong analytic background in applied probability or statistics
Coding experience (Python, C) and some exposure to algorithm design
Interest in interacting with researchers in other disciplines

If you are interested in this position, please send email to Nick Duffield including resume, transcripts, and stating the reasons for your interest.

Team and Collaborators

Dr. Nesreen Ahmed (Intel)
Prof. Nick Duffield (Texas A&M Electrical and Computer Engineering)
Xi Liu (Texas A&M Electrical and Computer Engineering, PhD Student)
Liangzhen Xia (Facebook, Texas A&M MS graduate)
Yunhong Xu (Texas A&M Electrical and Computer Engineering, PhD Student)
Prof. Minlan Yu (Harvard)

Funding

NSF Grant 1848596 Adaptive Sampling of Massive Graph Streams, PI N. Duffield, 09/01/2018 to 08/21/2020, $200,488
Approximation Methods for Massive Graph Analytics, Intel Corporation, PI Nick Duffeld, 2016, $30,000

Publications

Micro- and Macro-Level Churn Analysis of Large-Scale Mobile Games, by Xi Liu, Muhe Xie, Xidao Wen, Rui Chen, Yong Ge, Nick Duffield, Na Wang, arXiv:1901.06247
Streaming Network Embedding through Local Actions, by Xi Liu, Ping-Chun Hsieh, Nick Duffield, Rui Chen, Muhe Xie, Xidao Wen, arXiv:1811.05932
A Semi-Supervised and Inductive Embedding Model for Churn Prediction of Large-Scale Mobile Games, by Xi Liu, Muhe Xie, Xidao Wen, Rui Chen, Yong Ge, Nick Duffield, and Na Wang, accepted to ICDM 2018
Sampling for Approximate Bipartite Network Projection, Nesreen Ahmed, Nick Duffield, Liangzhen Xia, IJCAI-ECAI 2018
Stream Aggregation Through Order Sampling, Nick Duffield, Yunhong Xu, Liangzhen Xia, Nesreen Ahmed, Minlan Yu, CIKM 2017
On Sampling from Massive Graph Streams, Nesreen K. Ahmed, Nick Duffield, Theodore Willke, Ryan A. Rossi, PVLDB 2017
Graphlet Decomposition: Framework, Algorithms, and Applications, Nesreen K. Ahmed, Jennifer Neville, Ryan A. Rossi, Nick Duffield, Theodore L. Willke, Knowledge and Information Systems 2016
Efficient Sampling for Better OSN Data Provisioning. N. Duffield and B. Krishnamurthy, 54th Annual Allerton Conference 2016
Efficient Graphlet Counting for Large Networks, N. Ahmed, J. Neville, R. Rossi, N.Duffield, IEEE ICDM 2015
Graph Sample and Hold: A Framework for Big-Graph Analytics, N. Ahmed, N. Duffield, J. Neville, R. Kompella, ACM SIGKDD 2014

External Resources

www.KDnuggets.com – Big Data, Data Mining, Data Science, and Machine Learning Resources