The electronic systems and services that underpin daily life—the internet, social networks, banking systems, and online retailers—continuously generate vast amounts of operational data. A key analytical task for service providers is to understand the relation between different components, e.g. to provide friend recommendations based on patterns of social interaction, or to understand how a set of related transactions may collectively signify a bank fraud. The volume of these data presents enormous challenges, both for storage and processing, and for analysis and learning. While sampling and other methods of data reduction can make these data more manageable, without careful design they hinder the ability to discern patterns for forensics and other retrospective analyses.
This work develops a transformative framework for graph stream sampling and its applications that address outstanding gaps in knowledge and practice. First, current methods usually focus on computation of global graph properties, applications often require a representative sample for rapid or retrospective analysis without having to reprocess the entire stream, even if available. Second, current methods are typically optimized for specific subgraph targets and problems, but lack the capability to optimize for analysis accuracy and resource requirements. Third, it is challenging for domain researchers to generalize or apply these results beyond the original problem because of the expert tuning required for their creation. In response, this project will develop a framework for graph data stream sampling that is applicable to a wide class of application problems and settings, allows tunable trade-off between accuracy and space and time computational resources, and is implementable as a mapping between problem specification and working application code.
NSF-funded PhD research assistantship
An NSF funded PhD research assistantship is available to support a student working in the area of graph data stream sampling and learning. This position would be a good match for someone with:
- Bachelors or Masters degree in data science, computer science, or electrical engineering
- Strong analytic background in applied probability or statistics
- Coding experience (Python, C) and some exposure to algorithm design
- Interest in interacting with researchers in other disciplines
If you are interested in this position, please send email to Nick Duffield including resume, transcripts, and stating the reasons for your interest.
Team and Collaborators
- Dr. Nesreen Ahmed (Intel)
- Prof. Nick Duffield (Texas A&M Electrical and Computer Engineering)
- Xi Liu (Texas A&M Electrical and Computer Engineering, PhD Student)
- Liangzhen Xia (Facebook, Texas A&M MS graduate)
- Yunhong Xu (Texas A&M Electrical and Computer Engineering, PhD Student)
- Prof. Minlan Yu (Harvard)
Funding
- NSF Grant 1848596 Adaptive Sampling of Massive Graph Streams, PI N. Duffield, 09/01/2018 to 08/21/2020, $200,488
- Approximation Methods for Massive Graph Analytics, Intel Corporation, PI Nick Duffeld, 2016, $30,000
Publications
- Micro- and Macro-Level Churn Analysis of Large-Scale Mobile Games, by Xi Liu, Muhe Xie, Xidao Wen, Rui Chen, Yong Ge, Nick Duffield, Na Wang, arXiv:1901.06247
- Streaming Network Embedding through Local Actions, by Xi Liu, Ping-Chun Hsieh, Nick Duffield, Rui Chen, Muhe Xie, Xidao Wen, arXiv:1811.05932
- A Semi-Supervised and Inductive Embedding Model for Churn Prediction of Large-Scale Mobile Games, by Xi Liu, Muhe Xie, Xidao Wen, Rui Chen, Yong Ge, Nick Duffield, and Na Wang, accepted to ICDM 2018
- Sampling for Approximate Bipartite Network Projection, Nesreen Ahmed, Nick Duffield, Liangzhen Xia, IJCAI-ECAI 2018
- Stream Aggregation Through Order Sampling, Nick Duffield, Yunhong Xu, Liangzhen Xia, Nesreen Ahmed, Minlan Yu, CIKM 2017
- On Sampling from Massive Graph Streams, Nesreen K. Ahmed, Nick Duffield, Theodore Willke, Ryan A. Rossi, PVLDB 2017
- Graphlet Decomposition: Framework, Algorithms, and Applications, Nesreen K. Ahmed, Jennifer Neville, Ryan A. Rossi, Nick Duffield, Theodore L. Willke, Knowledge and Information Systems 2016
- Efficient Sampling for Better OSN Data Provisioning. N. Duffield and B. Krishnamurthy, 54th Annual Allerton Conference 2016
- Efficient Graphlet Counting for Large Networks, N. Ahmed, J. Neville, R. Rossi, N.Duffield, IEEE ICDM 2015
- Graph Sample and Hold: A Framework for Big-Graph Analytics, N. Ahmed, N. Duffield, J. Neville, R. Kompella, ACM SIGKDD 2014
External Resources
- www.KDnuggets.com – Big Data, Data Mining, Data Science, and Machine Learning Resources