Skip to content

Commit

Permalink
Initial commit for GitHub
Browse files Browse the repository at this point in the history
  • Loading branch information
nunofachada committed Apr 28, 2014
0 parents commit 39b0796
Show file tree
Hide file tree
Showing 4 changed files with 254 additions and 0 deletions.
13 changes: 13 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
*~
.~*#
.nfs*
*.mat
*.fig
*.aux
*.log
*.blg
*.out
*.pdf
*.gz
*.ods
*.eps
66 changes: 66 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# User Manual

### Summary

A Matlab/Octave script which generates 2D data for clustering; data is
created along straight lines, which can be more or less parallel
depending on the selected input parameters.

### Synopsis

[data, clustPoints, idx, centers, slopes, lengths] =
generateData(slope, slopeStd, numClusts, xClustAvgSep,
yClustAvgSep, lengthAvg, lengthStd, lateralStd,
totalPoints)

### Input parameters

Parameter | Description
-------------- | ------------------------------------------------------------------------------------------------------
*slope* | Base direction of the lines on which clusters are based
*slopeStd* | Standard deviation of the slope; used to obtain a random slope variation from the normal distribution, which is added to the base slope in order to obtain the final slope of each cluster
*numClusts* | Number of clusters (and therefore of lines) to generate
*xClustAvgSep* | Average separation of line centers along the X axis
*yClustAvgSep* | Average separation of line centers along the Y axis
*lengthAvg* | The base length of lines on which clusters are based
*lengthStd* | Standard deviation of line length; used to obtain a random length variation from the normal distribution, which is added to the base length in order to obtain the final length of each line
*lateralStd* | "Cluster fatness", i.e., the standard deviation of the distance from each point to the respective line, in both *x* and *y* directions; this distance is obtained from the normal distribution
*totalPoints* | Total points in generated data (will be randomly divided among clusters)

### Return values

Value | Description
------------- | --------------------------------------------------------------------------------------
*data* | Matrix (*totalPoints* x *2*) with the generated data
*clustPoints* | Vector (*numClusts* x *1*) containing number of points in each cluster
*idx* | Vector (*totalPoints* x *1*) containing the cluster indices of each point
*centers* | Matrix (*numClusts* x *2*) containing centers from where clusters were generated
*slopes* | Vector (*numClusts* x *1*) containing the effective slopes used to generate clusters
*lengths* | Vector (*numClusts* x *1*) containing the effective lengths used to generate clusters

### Usage example

[data cp idx] = generateData(1, 0.5, 5, 15, 15, 5, 1, 2, 200);

The previous command creates 5 clusters with a total of 200 points, with
a base slope of 1 (*std*=0.5), separated in average by 15 units in both
*x* and *y* directions, with average length of 5 units (*std*=1) and a
"fatness" or spread of 2 units.

To take a quick look at the clusters just do:

scatter(data(:,1), data(:,2), 8, idx);

### Reference

If you use this script in your work, please use the following reference:

- Fachada, N., Figueiredo, M.A.T., Lopes, V.V., Martins, R.C., Rosa,
A.C., [Spectrometric differentiation of yeast strains using minimum volume
increase and minimum direction change clustering criteria](http://www.sciencedirect.com/science/article/pii/S0167865514000889),
Pattern Recognition Letters, vol. 45, pp. 55-61 (2014), doi: http://dx.doi.org/10.1016/j.patrec.2014.03.008

### License

This script is made available under the [Simplified BSD License](license.txt).

151 changes: 151 additions & 0 deletions generateData.m
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
function [data, clustPoints, idx, centers, slopes, lengths] = ...
generateData( ...
slope, ...
slopeStd, ...
numClusts, ...
xClustAvgSep, ...
yClustAvgSep, ...
lengthAvg, ...
lengthStd, ...
lateralStd, ...
totalPoints ...
)
% GENERATEDATA Generates 2D data for clustering; data is created along
% straight lines, which can be more or less parallel depending
% on slopeStd argument.
%
% [data clustPoints idx centers slopes lengths] =
% GENERATEDATA(slope, slopeStd, numClusts, xClustAvgSep, yClustAvgSep, ...
% lengthAvg, lengthStd, lateralStd, totalPoints)
%
% Inputs:
% slope - Base direction of the lines on which clusters are based.
% slopeStd - Standard deviation of the slope; used to obtain a random
% slope variation from the normal distribution, which is
% added to the base slope in order to obtain the final slope
% of each cluster.
% numClusts - Number of clusters (and therefore of lines) to generate.
% xClustAvgSep - Average separation of line centers along the X axis.
% yClustAvgSep - Average separation of line centers along the Y axis.
% lengthAvg - The base length of lines on which clusters are based.
% lengthStd - Standard deviation of line length; used to obtain a random
% length variation from the normal distribution, which is
% added to the base length in order to obtain the final
% length of each line.
% lateralStd - "Cluster fatness", i.e., the standard deviation of the
% distance from each point to the respective line, in both x
% and y directions; this distance is obtained from the
% normal distribution.
% totalPoints - Total points in generated data (will be
% randomly divided among clusters).
%
% Outputs:
% data - Matrix (totalPoints x 2) with the generated data
% clustPoints - Vector (numClusts x 1) containing number of points in each
% cluster
% idx - Vector (totalPoints x 1) containing the cluster indices of
% each point
% centers - Matrix (numClusts x 2) containing centers from where
% clusters were generated
% slopes - Vector (numClusts x 1) containing the effective slopes
% used to generate clusters
% lengths - Vector (numClusts x 1) containing the effective lengths
% used to generate clusters
%
% ----------------------------------------------------------
% Usage example:
%
% [data cp idx] = GENERATEDATA(1, 0.5, 5, 15, 15, 5, 1, 2, 200);
%
% This creates 5 clusters with a total of 200 points, with a base slope
% of 1 (std=0.5), separated in average by 15 units in both x and y
% directions, with average length of 5 units (std=1) and a "fatness" or
% spread of 2 units.
%
% To take a quick look at the clusters just do:
%
% scatter(data(:,1), data(:,2), 8, idx);

% N. Fachada
% Instituto Superior Técnico, Lisboa, Portugal

% Make sure totalPoints >= numClusts
if totalPoints < numClusts
error('Number of points must be equal or larger than the number of clusters.');
end;

% Determine number of points in each cluster
clustPoints = abs(randn(numClusts, 1));
clustPoints = clustPoints / sum(clustPoints);
clustPoints = round(clustPoints * totalPoints);

% Make sure totalPoints is respected
while sum(clustPoints) < totalPoints
% If one point is missing add it to the smaller cluster
[C,I] = min(clustPoints);
clustPoints(I(1)) = C + 1;
end;
while sum(clustPoints) > totalPoints
% If there is one extra point, remove it from larger cluster
[C,I] = max(clustPoints);
clustPoints(I(1)) = C - 1;
end;

% Make sure there are no empty clusters
emptyClusts = find(clustPoints == 0);
if ~isempty(emptyClusts)
% If there are empty clusters...
numEmptyClusts = size(emptyClusts, 1);
for i=1:numEmptyClusts
% ...get a point from the largest cluster and assign it to the
% empty cluster
[C,I] = max(clustPoints);
clustPoints(I(1)) = C - 1;
clustPoints(emptyClusts(i)) = 1;
end;
end;

% Initialize data matrix
data = zeros(sum(clustPoints), 2);

% Initialize idx (vector containing the cluster indices of each point)
idx = zeros(totalPoints, 1);

% Initialize lengths vector
lengths = zeros(numClusts, 1);

% Determine cluster centers
xCenters = xClustAvgSep * numClusts * (rand(numClusts, 1) - 0.5);
yCenters = yClustAvgSep * numClusts * (rand(numClusts, 1) - 0.5);
centers = [xCenters yCenters];

% Determine cluster slopes
slopes = slope + slopeStd * randn(numClusts, 1);

% Create clusters
for i=1:numClusts
% Determine length of line where this cluster will be based
lengths(i) = abs(lengthAvg + lengthStd*randn);
% Determine how many points have been assigned to previous clusters
sumClustPoints = 0;
if i > 1
sumClustPoints = sum(clustPoints(1:(i - 1)));
end;
% Create points for this cluster
for j=1:clustPoints(i)
% Determine where in the line the next point will be projected
position = lengths(i) * rand - lengths(i) / 2;
% Determine x coordinate of point projection
delta_x = cos(atan(slopes(i))) * position;
% Determine y coordinate of point projection
delta_y = delta_x * slopes(i);
% Get point distance from line in x coordinate
delta_x = delta_x + lateralStd * randn;
% Get point distance from line in y coordinate
delta_y = delta_y + lateralStd * randn;
% Determine the actual point
data(sumClustPoints + j, :) = [(xCenters(i) + delta_x) (yCenters(i) + delta_y)];
end;
% Update idx
idx(sumClustPoints + 1 : sumClustPoints + clustPoints(i)) = i;
end;
24 changes: 24 additions & 0 deletions license.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
Copyright (c) 2012, Nuno Fachada
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in
the documentation and/or other materials provided with the distribution

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.

0 comments on commit 39b0796

Please sign in to comment.