Vista Logo

Big Data Cluster Analysis

Everyday, an abundant amount of data is generated from various sources such as IoT networks, smartphones, and social network activities. Making sense of such an unprecedented amount of data is essential for many businesses, services, and applications. Currently, there is little domain expertise to automate this big data analysis, and traditional supervised machine learning techniques suffer from a lack of labelled training data in this context. The aim is to develop scalable and efficient unsupervised algorithms to manage and extract actionable information from big data.

Cluster analysis is a useful unsupervised approach to discover the underlying groups and useful patterns in the data. Cluster Analysis for any data consists of three problems, (P1) cluster assessment, which asks “Do the data have clusters? If yes, how many?"; (P2) Clustering i.e., partitioning the data into clusters, and (P3) cluster validity, which asks “Are the clusters found useful? Is there a better one we did not find?" Traditional cluster analysis algorithms are not suitable for big data owing to its volume, variety, and velocity property.

A suite of novel scalable algorithms were developed to solve each of the three problems of cluster analysis, namely, cluster assessment, clustering, and cluster validity, for big data, that may be high-dimensional, anomalous and streaming.