Today Udayan Khurana presented his PhD Defense presentation on Historical Graph Data Management at 3258 A.V. Williams Building (AVW). Here is a brief overview:

Abstract

Over the last decade, we have witnessed an increasing interest in temporal analysis of information networks such as social networks or citation networks. Finding temporal interaction patterns, visualizing the evolution of graph properties, or even simply com- paring them across time, has proven to add significant value in reasoning over networks. However, because of the lack of underlying data management support, much of the work on large-scale graph analytics to date has largely focused on the study of static properties of graph snapshots. Unfortunately, a static view of interactions between entities is often an oversimplification of several complex phenomena like the spread of epidemics, information diffusion, formation of online communities, and so on. In the absence of appropriate support, an analyst today has to manually navigate the added temporal complexity of large evolving graphs, making the process cumbersome and ineffective.

In this dissertation, I address the key challenges in storing, retrieving, and analyzing large historical graphs. In the first part, I present DeltaGraph, a novel, extensible, highly tunable, and distributed hierarchical index structure that enables compact recording of the historical information, and that supports efficient retrieval of historical graph snapshots. I present analytical models for estimating required storage space and snapshot retrieval times which aid in choosing the right parameters for a specific scenario. I also present optimizations such as partial materialization and columnar storage to speed up snapshot retrieval.

In the second part, I present Temporal Graph Index that builds upon DeltaGraph to support version-centric retrieval such as a node’s 1-hop neighborhood history, along with snapshot reconstruction. It provides high scalability, employing careful partitioning, distribution, and replication strategies that effectively deal with temporal and topological skew, typical of temporal graph datasets. In the last part of the dissertation, I present Temporal Graph Analysis Framework that enables analysts to effectively express a variety of complex historical graph analysis tasks using a set of novel temporal graph operators and to execute them in an efficient and scalable manner on a cloud. My proposed solutions are engineered in the form of a framework called the Historical Graph Store, designed to facilitate a wide variety of large-scale historical graph analysis.

Examining Committee:

Committee Chair:                                              Dr. Amol Deshpande

Dean’s Representative:                                      Dr. Jimmy Lin

Committee Member(s)                                       Dr. Mihai PopDr. Tudor Dumitras and Dr. Catherine Plaisant

Outline

Introduction

Over the last decade, we have witnessed an increasing interest in temporal analysis of information networks such as social networks or citation networks. Finding temporal interaction patterns, visualizing the evolution of graph properties, or even simply com- paring them across time, has proven to add significant value in reasoning over networks. However, because of the lack of underlying data management support, much of the work on large-scale graph analytics to date has largely focused on the study of static properties of graph snapshots. Unfortunately, a static view of interactions between entities is often an oversimplification of several complex phenomena like the spread of epidemics, information diffusion, formation of online communities, and so on. In the absence of appropriate support, an analyst today has to manually navigate the added temporal complexity of large evolving graphs, making the process cumbersome and ineffective.

Reconstruction of Historical Graph States

Historical Graph Retrieval at Scale

IMG_8010

Temporal Graph Analytics

Temporal graph Analysis Framework (TAF)

Node-centric graph extraction and analytical logic

Set of Nodes (SON) refers to a collection of temporal nodes

Operations
  • Extract: Timeslice (\tao), Select (\sigma), Filter (\pi)
  • Compute: NodeCompute (\mu), NodeComputeTemporal (\rho), etc.
  • Analyze: Compare (\kai), Evolution (\eta), other aggregates

Based on Apache Spark, an in-memory cluster compute framework, uses Resillient Distributed Datasets (RDD) as a data abstraction.

Dataset

Wikipedia citation dataset with 267M edge.

NodeComputeTemporal provides large benefits for incremental computation.

Conclusion

Identified key problems in historical graph data management.
Presented solutions for storage, retrieval and analytics:
  • DeltaGraph: Snapshot reconstruction
  • GraphPool: Snapshot-based analysis
  • TGI: Version, snapshot retrieval in the cloud
  • TAF: Expressing and executing scalable analytics

Historical Graph Store:

  • Is useful for conducting temporal graph analysis at scale
  • Would help analysts, scientists in various domains