Stay organized with collections
Save and categorize content based on your preferences.
Cluster data using the k-means algorithm. Can use either the Euclidean distance (default) or the Manhattan distance. If the Manhattan distance is used, then centroids are computed as the component-wise median rather than mean. For more information see:
D. Arthur, S. Vassilvitskii: k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
Use canopies to reduce the number of distance calculations.
maxCandidates
Integer, default: 100
Maximum number of candidate canopies to retain in memory at any one time when using canopy clustering. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory.
periodicPruning
Integer, default: 10000
How often to prune low density canopies when using canopy clustering.
minDensity
Integer, default: 2
Minimum canopy density, when using canopy clustering, below which a canopy will be pruned during periodic pruning.
t1
Float, default: -1.5
The T1 distance to use when using canopy clustering. A value < 0 is taken as a positive multiplier for T2.
t2
Float, default: -1
The T2 distance to use when using canopy clustering. Values < 0 cause a heuristic based on attribute std. deviation to be used.
distanceFunction
String, default: "Euclidean"
Distance function to use. Options are: Euclidean and Manhattan.
maxIterations
Integer, default: null
Maximum number of iterations.
preserveOrder
Boolean, default: false
Preserve order of instances.
fast
Boolean, default: false
Enables faster distance calculations, using cut-off values. Disables the calculation/output of squared errors/distances.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2024-09-19 UTC."],[[["Clusters data using the k-means algorithm with either Euclidean (default) or Manhattan distance."],["If Manhattan distance is selected, centroids are calculated using the component-wise median instead of the mean."],["Offers various initialization methods including random, k-means++, canopy, and farthest first."],["Allows customization of distance calculation, iteration limits, and performance optimization through parameters."]]],[]]