TY - JOUR
T1 - CDF Transform-and-Shift
T2 - An effective way to deal with datasets of inhomogeneous cluster densities
AU - Zhu, Ye
AU - Ting, Kai Ming
AU - Carman, Mark J.
AU - Angelova, Maia
PY - 2021/9
Y1 - 2021/9
N2 - The problem of inhomogeneous cluster densities has been a long-standing issue for distance-based and density-based algorithms in clustering and anomaly detection. These algorithms implicitly assume that all clusters have approximately the same density. As a result, they often exhibit a bias towards dense clusters in the presence of sparse clusters. Many remedies have been suggested; yet, we show that they are partial solutions which do not address the issue satisfactorily. To match the implicit assumption, we propose to transform a given dataset such that the transformed clusters have approximately the same density while all regions of locally low density become globally low density—homogenising cluster density while preserving the cluster structure of the dataset. We show that this can be achieved by using a new multi-dimensional Cumulative Distribution Function in a transform-and-shift method. The method can be applied to every dataset, before the dataset is used in many existing algorithms to match their implicit assumption without algorithmic modification. We show that the proposed method performs better than existing remedies.
AB - The problem of inhomogeneous cluster densities has been a long-standing issue for distance-based and density-based algorithms in clustering and anomaly detection. These algorithms implicitly assume that all clusters have approximately the same density. As a result, they often exhibit a bias towards dense clusters in the presence of sparse clusters. Many remedies have been suggested; yet, we show that they are partial solutions which do not address the issue satisfactorily. To match the implicit assumption, we propose to transform a given dataset such that the transformed clusters have approximately the same density while all regions of locally low density become globally low density—homogenising cluster density while preserving the cluster structure of the dataset. We show that this can be achieved by using a new multi-dimensional Cumulative Distribution Function in a transform-and-shift method. The method can be applied to every dataset, before the dataset is used in many existing algorithms to match their implicit assumption without algorithmic modification. We show that the proposed method performs better than existing remedies.
KW - Density-based clustering
KW - Density-ratio
KW - Inhomogeneous cluster densities
KW - kNN Anomaly detection
KW - Scaling
KW - Shift
UR - https://www.sciencedirect.com/science/article/pii/S0031320321001643
UR - http://www.scopus.com/inward/record.url?scp=85104352798&partnerID=8YFLogxK
U2 - 10.1016/j.patcog.2021.107977
DO - 10.1016/j.patcog.2021.107977
M3 - Article
AN - SCOPUS:85104352798
SN - 0031-3203
VL - 117
JO - Pattern Recognition
JF - Pattern Recognition
M1 - 107977
ER -