Vista Logo

Structure Assessment in Streaming Data

The widespread use of Internet of Things (IoT) technologies, smartphones, and social media services generate huge amounts of data streaming at high velocity. Automatic interpretation of these rapidly arriving data streams is required for the timely detection of interesting events that usually emerge in the form of clusters. At present, there is no technique on offer for visual assessment of evolving cluster structure in high-velocity, massive data streams.

Visual assessment of cluster tendency (VAT) model, which produces a record of structural evolution in the data stream by building a cluster heat map of the entire processing history in the stream. Existing VAT-based algorithms for streaming data are not suitable for high-velocity, high-volume streaming data because of high memory requirements and slower processing speed as the accumulated data increases. Scalable iVAT (siVAT) algorithm can handle big batch data, but for streaming data, it needs to be (re)applied everytime a new datapoint arrives, which is not feasible due to associated computation complexities. The aim is to develop an online algorithm for tracking of evolving cluster structures in high-velocity, high dimensional data streams. An incremental version of scalable iVAT algorithm is developed for change detection and structural assessment in high-velocity data streams.

The developed algorithm is illustrated with a 2D synthetic dataset which evolves significantly over time. The developed algorithm produces reordered dissimilarity image or cluster heat map (a square digital image) for cluster assessment, which is updated after every new chunk of a pre-specified number of datapoints. The intensity of each pixel in an RDI reflects the dissimilarity between the corresponding row and column objects. A "useful" RDI highlights potential clusters as a set of "dark blocks" along the diagonal of the image. This video demonstrates the algorithm’s ability to visualize changing cluster structure in streaming data.