dataset - How to generate multidimensional data with specific clustering properties? -


in section 5.a of research paper researcher used following synthetic datasets:

  1. gauss consisted of 6 gaussian clusters identity covariance, each 500 points in 5 dimensions. means randomly assigned value 0 10 in each dimension. cluster means required @ least 4 euclidean distance apart, , points required within 2 euclidean distance of cluster mean.
  2. paired consisted of 3 pairs of gaussian clusters identity covariance, each 500 points in 5 dimensions. each pair of gaussians placed around mean randomly assigned value in each dimension 0 20 such euclidean distance between paired gaussian clusters between 4 , eight, , euclidean distance between non-paired gaussians @ least 12. additionally, points required within 2 euclidean distance of cluster mean.

  3. elong consisted of 5 gaussian clusters identity covariance, each 300 points in 5 dimensions. means randomly assigned value 0 50 in each dimension. create elongated clusters in different dimensions, multiplied values of single, distinct dimension each cluster 15. cluster means required @ least 5 euclidean distance apart.

  4. uniform consisted of 8 clusters, each 300 points in 3 dimensions. each cluster had points uniformly distributed in 3x3x3 box around randomly assigned center in 10x10x10 cube. cluster centers required 5 euclidean distance apart.
  5. rings consisted of 2 ring clusters centered around (0,0), larger outer ring radius 2 , smaller inner ring of radius 1. 400 points evenly spaced degrees on inner ring.

http://postimg.org/image/jo4rjztjz/


i don't have these datasets. tried contact researcher of no use.

how create these datasets? there kind of tool create them?

original paper can found here

documentation , examples on elki data set generator can found here: http://elki.dbs.ifi.lmu.de/wiki/datasetgenerator

the generator in elki cannot produce ring-shaped clusters (only spherical), , not support clipping points @ maximum distance. generates independent samples each dimension independently. supported operation uses more 1 dimension @ time rotation operation. generating ring-shaped clusters, or clipping clusters based on distance mean means form of dependence of values not supported.

you need either contact authors of publication, or write program generate such data yourself. it's not hard; may not worth effort generate such synthetic data - it's not realistic scenario in opinion.


Comments

Popular posts from this blog

c++ - Delete matches in OpenCV (Keypoints and descriptors) -

java - Could not locate OpenAL library -

sorting - opencl Bitonic sort with 64 bits keys -