Graph Stream Sampling and Learning

The electronic systems and services that underpin daily life—the internet, social networks, banking systems, and online retailers—continuously generate vast amounts of operational data. A key analytical task for service providers is to understand the relation between different components, e.g. to provide friend recommendations based on patterns of social interaction, or to understand how a set of related transactions may collectively signify a bank fraud. The volume of these data presents enormous challenges, both for storage and processing, and for analysis and learning. While sampling and other methods of data reduction can make these data more manageable, without careful design they hinder the ability to discern patterns for forensics and other retrospective analyses.

This work develops a transformative framework for graph stream sampling and its applications that address outstanding gaps in knowledge and practice. First, current methods usually focus on computation of global graph properties, applications often require a representative sample for rapid or retrospective analysis without having to reprocess the entire stream, even if available. Second, current methods are typically optimized for specific subgraph targets and problems, but lack the capability to optimize for analysis accuracy and resource requirements. Third, it is challenging for domain researchers to generalize or apply these results beyond the original problem because of the expert tuning required for their creation. In response, this project will develop a framework for graph data stream sampling that is applicable to a wide class of application problems and settings, allows tunable trade-off between accuracy and space and time computational resources, and is implementable as a mapping between problem specification and working application code.

NSF-funded PhD research assistantship

An NSF funded PhD research assistantship is available to support a student working in the area of graph data stream sampling and learning. This position would be a good match for someone with:

  • Bachelors or Masters degree in data science, computer science, or electrical engineering
  • Strong analytic background in applied probability or statistics
  • Coding experience (Python, C) and some exposure to algorithm design
  • Interest in interacting with researchers in other disciplines

If you are interested in this position, please send email to Nick Duffield including resume, transcripts, and stating the reasons for your interest.

Team and Collaborators

  • Dr. Nesreen Ahmed (Intel)
  • Prof. Nick Duffield (Texas A&M Electrical and Computer Engineering)
  • Xi Liu (Texas A&M Electrical and Computer Engineering, PhD Student)
  • Liangzhen Xia (Facebook, Texas A&M MS graduate)
  • Yunhong Xu (Texas A&M Electrical and Computer Engineering, PhD Student)
  • Prof. Minlan Yu (Harvard)

Funding

  • NSF Grant 1848596 Adaptive Sampling of Massive Graph Streams, PI N. Duffield, 09/01/2018 to 08/21/2020, $200,488
  • Approximation Methods for Massive Graph Analytics, Intel Corporation, PI Nick Duffeld, 2016, $30,000

Publications

External Resources

  • www.KDnuggets.com – Big Data, Data Mining, Data Science, and Machine Learning Resources