Feb 24
Java: Power set and set partitioning using bitwise operations
icon1 Mikkel Meyer Andersen | icon4 February 24, 2011 at 18:32 (UTC) | icon3 No Comments »
icon3 ,

To tasks are quite common, especially writing math programs:

These two operations are of course interesting for Set<T> and cousins, but from a mathematical point of view, it is sufficient to solve this for index sets, i.e. sets {0, 1, ..., n - 1}. For List<T> and cousins (allowing duplicates), it can be a slightly different story depending on your view on the duplicates. So the rest of this post deals with index sets of size n, i.e. {0, 1, ..., n - 1}.

Especially for power sets, numerous implementations use strings. That I do not fancy. I tried to find implementations not using strings but merely used bitwise operations (which is also what the string using implementation does but they bring strings into the picture, maybe because it makes the programming a bit easier).

If you use some of this code in commercial software, you have to contact me first and make an agreement of usage. If your use is not commercial, feel free to use the code as long as you don't take credit for it yourself.

My implementations of the operations is shown below. Please contact me for questions or comments.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
/**
 * Get the power set of {0, 1, ..., n}.
 * 
 * @param n the upper bound in the index set {0, 1, ..., n} to get the power set of
 * @return the powerset including the empty set and the set itself as the first and last element, respectively
 * @author Mikkel Meyer Andersen, scienco [at] mikl [dot] dk
 */
public static int[][] getPowerset(int n) {
    if (n <= 0) {
        throw new IllegalArgumentException("n must be be at least 1");
    }
 
    // stop at (2^n) - 1, so calculate 2^n
    final int stop = 1 << n;
 
    int i0 = 0, i1;
    int[][] powerset = new int[stop][];
 
    // Starting at i = 0 gives an empty first pair, so start at i = 1
    for (int i = 0; i < stop; ++i) {
        int j = i;
        int k = i;
 
        int size = 0;
 
        // First calculate size of the array needed
        for (int r = n - 1; r >= 0; --r) {
            if ((j & 1) == 1) {
                size++;
            }
 
            j >>>= 1;
        }
 
        i1 = 0;
        int[] set = new int[size];
 
        for (int r = n - 1; r >= 0; --r) {
            if ((k & 1) == 1) {
                set[i1++] = r;
            }
 
            k >>>= 1;
        }
 
        powerset[i0++] = set;
    }
 
    return powerset;
}
 
/**
 * Partition an index set {0, 1, ..., n} into two parts A and B such that A
 * union B is the whole set and A and B are disjoint. If you use this in
 * commercial software, do contact me first. If not, feel free to use this
 * code as long as you don't take credit for it yourself.
 * 
 * @param n the upper bound in the index set {0, 1, ..., n} to get pairs of
 * @return the pairs
 * @author Mikkel Meyer Andersen, scienco [at] mikl [dot] dk
 */
public static int[][][] getPairs(int n) {
    if (n <= 1) {
        throw new IllegalArgumentException(
                                           "n must be be at least 2 (it takes at least two elements to create a pair)");
    }
 
    // stop at 2^(n-1) - 1, so calculate 2^(n-1)
    final int stop = 1 << (n - 1);
 
    int i0 = 0, i1, i2;
    int[][][] pairs = new int[stop - 1][][];
 
    // Starting at i = 0 gives an empty first pair, so start at i = 1
    for (int i = 1; i < stop; ++i) {
        int j = i;
        int k = i;
 
        int size = 0;
 
        // First calculate size of the array needed
        for (int r = n - 1; r >= 0; --r) {
            if ((j & 1) == 1) {
                size++;
            }
 
            j >>>= 1;
        }
 
        i1 = 0;
        i2 = 0;
        int[] set1 = new int[size];
        int[] set2 = new int[n - size];
 
        for (int r = n - 1; r >= 0; --r) {
            if ((k & 1) == 1) {
                set1[i1++] = r;
            } else {
                set2[i2++] = r;
            }
 
            k >>>= 1;
        }
 
        pairs[i0++] = new int[][] {
            set1, set2
        };
    }
 
    return pairs;
}
 
/**
 * Formats the power set of {0, 1, ..., n} obtained by {@link getPowerset}
 * in a nicely looking string.
 * 
 * @param powerset the powerset to format nicely
 * @author Mikkel Meyer Andersen, scienco [at] mikl [dot] dk
 */
public static String powersetToString(int[][] powerset) {
    StringBuilder sb = new StringBuilder();
 
    for (int i = 0; i < powerset.length; ++i) {
        int[] set = powerset[i];
 
        for (int j = 0; j < set.length; ++j) {
            sb.append(set[j]);
 
            if (j < set.length - 1) {
                sb.append(", ");
            }
        }
 
        if (i < powerset.length - 1) {
            sb.append("\n");
        }
    }
 
    return sb.toString();
}
 
/**
 * Formats the different pairs partitioning the index set of {0, 1, ..., n}
 * obtained by {@link getPairs} in a nicely looking string.
 * 
 * @param pairs the pairs to format nicely
 * @author Mikkel Meyer Andersen, scienco [at] mikl [dot] dk
 */
public static String pairsToString(int[][][] pairs) {
    StringBuilder sb = new StringBuilder();
 
    for (int i = 0; i < pairs.length; ++i) {
        int[] set1 = pairs[i][0];
        int[] set2 = pairs[i][1];
 
        sb.append("(");
 
        for (int j = 0; j < set1.length; ++j) {
            sb.append(set1[j]);
 
            if (j < set1.length - 1) {
                sb.append(", ");
            }
        }
 
        sb.append("), (");
 
        for (int j = 0; j < set2.length; ++j) {
            sb.append(set2[j]);
 
            if (j < set2.length - 1) {
                sb.append(", ");
            }
        }
 
        sb.append(")");
 
        if (i < pairs.length - 1) {
            sb.append("\n");
        }
    }
 
    return sb.toString();
}

The usage is exemplified with the following example:

int[][][] pairs = Utils.getPairs(4);
System.out.println(Utils.pairsToString(pairs));
 
int[][] powerset = Utils.getPowerset(4);
System.out.println(Utils.powersetToString(powerset));

The output is:

(3), (2, 1, 0)
(2), (3, 1, 0)
(3, 2), (1, 0)
(1), (3, 2, 0)
(3, 1), (2, 0)
(2, 1), (3, 0)
(3, 2, 1), (0)
 
3
2
3, 2
1
3, 1
2, 1
3, 2, 1
0
3, 0
2, 0
3, 2, 0
1, 0
3, 1, 0
2, 1, 0
3, 2, 1, 0
Aug 28
qqmultinorm.R version 1.1 - more intelligent plot size
icon1 Mikkel Meyer Andersen | icon4 August 28, 2009 at 07:20 (UTC) | icon3 No Comments »
icon3 ,

I've updated the qqmultinorm.R script a bit so that it's now capable of picking a more optimal plot size (less unused space).

The idea refers to deciding the sides (dimensions) of a rectangle if the areal (number of plots) is known i.e., optimising the dimensions of a rectangle given the areal. We want the perimeter as small as possible (for better viewing), preferably wider than longer if square isn't possible. A given number n is factorised in to numbers within a given error. For example if the allowed error is 2, then 7 gets factorised in (1,7), (2, 4), (4, 2), and (7, 1) (here redundant factorisation is included). Although 2*4 = 8, but abs(7-8) = 1 <= 2, so it's acceptable. Actually the default error is 10% of the input areal.

The way the proper factorisations is chosen, is by ordering the pairs by ascending the difference of the components i.e, how much the width and height differs. And then ordered ascending by height so that the plot gets wide screen-like instead of poster-like.

The new script is a follows (only find.optimal.mfrow.size(...) and a bit logic in qqmultinorm(...) added):

# File name: qqmultinorm.R
# Version: 1.1
# Last updated: 2009-08-28
#
# This R-code is made by:
# Mikkel Meyer Andersen, Denmark
# mikl [funny-a] math [.] aau [.] dk or 
# mikl [funny-a] mikl [.] dk
#
# Licence: GPLv2
#
# Feel free to use it, but if you do I'll like to hear about it (just for fun).
# If you make corrections, please submit them back so others can enjoy them as well.
 
qqchisq <- function(y, main, df=2, continuity.correction = 0.5)
{
  n <- length(y)
  y <- sort(y)
  c <- numeric(n)
 
  for (i in 1:n)
    c[i] <- qchisq((i-continuity.correction)/n, df=df)
 
  plot(c, y, xlab="Theoretical Quantiles", ylab="Sample Quantiles", main=main)
  lines(c(c[1], c[n]), c(c[1], c[n]), type="l")
}
 
dec2bin <- function(x)
{
  if (!is.vector(x) || length(x) != 1 || x < 0)
    stop("x must be a non-negative integer")
 
  N <- length(x)
  ndigits <- floor(log2(x)) + 1
  bin <- numeric(ndigits)
 
  for (i in (ndigits-1):0)
  {
    tmp <- 2^i
 
    if (x %/% tmp >= 1)
    {
      bin[i+1] <- 1
      x <- x - tmp
    }
  }
 
  return(rev(bin))
}
 
# Returns the power set without the empty set
power.set <- function(v)
{
  n <- length(v)  
  N <- 2^n - 1
  ps <- vector("list", N)
 
  for (i in 1:(N-1))
  {
    Nbin <- dec2bin(i)
    Nbin <- c(numeric(n-length(Nbin)), Nbin)
    Nbin <- rev(Nbin)
    ps[[i]] <- v[which(Nbin == 1)]
  }
 
  ps[[N]] <- v
 
  return(ps)
}
 
# Input: n
#  - here the number of plots
# Returns c(h, w)
#  - the optimal choice of rows and cols to use in in mfrow
find.optimal.mfrow.size <- function(n)
{  
  w0 <- ceiling(sqrt(n))
 
  # The area can at max contain of 10% unused space
  max.error <- round(n*0.1)
 
  n.minus.error <- n - max.error
  n.plus.error <- n + max.error
 
  # If n is a square, fine!
  if (w0^2 == n)
    return(c(w0, w0))
 
  # Col 1 and 2: w and h
  # Col 3: The difference between w and h: this should be as small as possible
  candidates <- matrix(ncol=3)
 
  for (w in w0:1)  
  {    
    h <- ceiling(n / w)
    n0 <- w*h
 
    # Because of ceiling we know that n0 >= n
    if (n0 <= n.plus.error)
      candidates <- rbind(candidates, c(h, w, abs(w-h)))
  }
 
  # First row is NA
  candidates <- candidates[-1,]
 
  # Uups, something went wront - well, don't panic
  if (nrow(candidates) == 0)
    return(c(w0, w0))
 
  # First order by abs(w-h) and then by h to get a widescreen-look 
  # instead of a poster-look
  candidates <- candidates[order(candidates[,3], candidates[,1]), ]
 
  return(candidates[1, c(1,2)])
}
 
# dataset: variables in columns and observations as rows
# subset.min.size, subset.max.size: inclusive limits
# filename: if specified, the plot are saved as a png file with this filename
qqmultinorm <- function(dataset, subset.min.size = 1, subset.max.size = 4, filename = NULL, use.optimale.size = F)
{
  p <- ncol(dataset)
  n <- nrow(dataset)
 
  if (subset.min.size < 1) stop("subset.min.size < 1")
  if (subset.min.size > p) stop("subset.min.size > p")
  if (subset.max.size < 1) stop("subset.max.size < 1")
  if (subset.max.size > p) stop("subset.max.size > p")
  if (subset.min.size > subset.max.size) stop("subset.min.size > subset.max.size")
 
  if (is.null(colnames(dataset)))
    colnames(dataset) <- 1:p
 
  # We have p variables. If all is to be checked against each other,
  # then we have a power-set with 2^p subsets (including the empty set)
 
  # Here we get subset containing indexes of the variables to include
  # Note that power.set doesn't include the empty set.
  subsets <- power.set(1:p)
  subsets.len <- length(subsets)
 
  # To get the plots with the fewest variables first, we do a litte trick:
  # While we find out which plots to include, we build a list
  # where each element of a list is the index of the subset,
  # and the index of the element is the size of the subset.
  # (The +1 is because the limits are includesive!)
  s.included <- vector("list", subset.max.size - subset.min.size + 1)
 
  plots <- 0
 
  for (i in 1:subsets.len)
  {
    s <- subsets[[i]]
    s.len <- length(s)
 
    if (s.len >= subset.min.size && s.len <= subset.max.size)
    {
      plots <- plots + 1
      s.included[[s.len - subset.min.size + 1]] <- c(s.included[[s.len - subset.min.size + 1]], i)
    }
  }
 
  # Now it's possible to build the subset index vector; 
  # we disregard the size of each subset; no more need to know it.
  s.indexes <- c()
 
  for (s in s.included)
  {
    s.indexes <- c(s.indexes, s)
  }
 
  # We want the best view  
  plot.per.row <- ceiling(plots^(1/2))
  plot.per.column <- ceiling(plots^(1/2))
 
  if (use.optimale.size)
  {
    mfrow.parameters <- find.optimal.mfrow.size(plots)
    plot.per.row <- mfrow.parameters[1]
    plot.per.column <- mfrow.parameters[2]
  }
 
  plot.width <- 300
  plot.height <- 200
 
  if (!is.null(filename))
    png(file=paste(filename, ".png", sep=""), bg="white", width = plot.per.row * plot.width, height = plot.per.column * plot.height)
 
  par(mfrow = c(plot.per.row, plot.per.column))  
 
  # There's no need to calculate a whole lot several times:
  ybar <- as.vector(colMeans(dataset))
 
  S <- as.matrix(var(dataset))
 
  current <- 1  
  for (i in s.indexes)
  {
    s <- subsets[[i]]
    s.len <- length(s)
 
    if (s.len > subset.max.size)
      next
 
    cat("Processing subset no.", current, "out of", plots, "\n")
 
    # Container for our values
    squared.dist <- numeric(n)
 
    # qr.solve(A) = A^(-1)  
    Sinv <- qr.solve(S[s,s])
 
    # Then calculate the squared distance for each datapoint
    for (i in 1:n)
    {
      c <- dataset[i,s] - ybar[s]
      squared.dist[i] <- t(c) %*% Sinv %*% c
    }
 
    # Finding the order statistic
    squared.dist <- sort(squared.dist)
 
    qqchisq(squared.dist, paste(colnames(dataset)[s], collapse=", "), s.len)
 
    current <- current + 1
  }
 
  if (!is.null(filename))
    dev.off()
}
 
# Example:
A <- matrix(rnorm(2000, mean=3, sd=2), ncol=8)
qqmultinorm(A, 2, 2, "multinorm-optimal.size", use.optimale.size = T)
qqmultinorm(A, 2, 2, "multinorm", use.optimale.size = F)
Aug 27
qqmultinorm.R - evaluating the normality of a sample
icon1 Mikkel Meyer Andersen | icon4 August 27, 2009 at 11:44 (UTC) | icon3 No Comments »
icon3 ,

I wrote a R-script that can be used to evaluate for multivariate normality of a sample. It's a kind of generalisation to the qqnorm, but this one just uses sums of the squared statistical distance which is then chi squared distributed with degrees of freedom equalling the number of squared statistical distance summed.

The useful thing about this script, is that it's able to plot all possible combinations of the variables. It's possible to specify the minimum number of variables to compare and the maximum number of variables to compare. All possible combinations are found by calculating the power set (using binary representation through the decimal to binary conversion, dec2bin, because if a set has k elements, then its power set has 2^k elements, and the binary representation of the numbers from 1 to 2^k are then used to pick out the subsets in the power set).

Please notice that it's possible to specify a filename so that the plots are written to a png file. This is by far the easiest thing - and only possibility - if there's more than a few plots.

I haven't wrote a lot of documentation besides this, but feel free to ask if in doubt of anything!

# File name: qqmultinorm.R
#
# This R-code is made by:
# Mikkel Meyer Andersen, Denmark
# mikl [funny-a] math [.] aau [.] dk or 
# mikl [funny-a] mikl [.] dk
#
# Licence: GPLv2
#
# Feel free to use it, but if you do I'll like to hear about it (just for fun).
# If you make corrections, please submit them back so others can enjoy them as well.
 
qqchisq <- function(y, main, df=2, continuity.correction = 0.5)
{
  n <- length(y)
  y <- sort(y)
  c <- numeric(n)
 
  for (i in 1:n)
    c[i] <- qchisq((i-continuity.correction)/n, df=df)
 
  plot(c, y, xlab="Theoretical Quantiles", ylab="Sample Quantiles", main=main)
  lines(c(c[1], c[n]), c(c[1], c[n]), type="l")
}
 
dec2bin <- function(x)
{
  if (!is.vector(x) || length(x) != 1 || x < 0)
    stop("x must be a non-negative integer")
 
  N <- length(x)
  ndigits <- floor(log2(x)) + 1
  bin <- numeric(ndigits)
 
  for (i in (ndigits-1):0)
  {
    tmp <- 2^i
 
    if (x %/% tmp >= 1)
    {
      bin[i+1] <- 1
      x <- x - tmp
    }
  }
 
  return(rev(bin))
}
 
# Returns the power set without the empty set
power.set <- function(v)
{
  n <- length(v)  
  N <- 2^n - 1
  ps <- vector("list", N)
 
  for (i in 1:(N-1))
  {
    Nbin <- dec2bin(i)
    Nbin <- c(numeric(n-length(Nbin)), Nbin)
    Nbin <- rev(Nbin)
    ps[[i]] <- v[which(Nbin == 1)]
  }
 
  ps[[N]] <- v
 
  return(ps)
}
 
# dataset: variables in columns and observations as rows
# subset.min.size, subset.max.size: inclusive limits
# filename: if specified, the plot are saved as a png file with this filename
qqmultinorm <- function(dataset, subset.min.size = 1, subset.max.size = 4, filename = NULL)
{
  p <- ncol(dataset)
  n <- nrow(dataset)
 
  if (subset.min.size < 1) stop("subset.min.size < 1")
  if (subset.min.size > p) stop("subset.min.size > p")
  if (subset.max.size < 1) stop("subset.max.size < 1")
  if (subset.max.size > p) stop("subset.max.size > p")
  if (subset.min.size > subset.max.size) stop("subset.min.size > subset.max.size")
 
  if (is.null(colnames(dataset)))
    colnames(dataset) <- 1:p
 
  # We have p variables. If all is to be checked against each other,
  # then we have a power-set with 2^p subsets (including the empty set)
 
  # Here we get subset containing indexes of the variables to include
  # Note that power.set doesn't include the empty set.
  subsets <- power.set(1:p)
  subsets.len <- length(subsets)
 
  # To get the plots with the fewest variables first, we do a litte trick:
  # While we find out which plots to include, we build a list
  # where each element of a list is the index of the subset,
  # and the index of the element is the size of the subset.
  # (The +1 is because the limits are includesive!)
  s.included <- vector("list", subset.max.size - subset.min.size + 1)
 
  plots <- 0
 
  for (i in 1:subsets.len)
  {
    s <- subsets[[i]]
    s.len <- length(s)
 
    if (s.len >= subset.min.size && s.len <= subset.max.size)
    {
      plots <- plots + 1
      s.included[[s.len - subset.min.size + 1]] <- c(s.included[[s.len - subset.min.size + 1]], i)
    }
  }
 
  # Now it's possible to build the subset index vector; 
  # we disregard the size of each subset; no more need to know it.
  s.indexes <- c()
 
  for (s in s.included)
  {
    s.indexes <- c(s.indexes, s)
  }
 
  # We want a squared view
  plot.per.row <- ceiling(plots^(1/2))
  plot.per.column <- ceiling(plots^(1/2))
  plot.width <- 200
  plot.height <- 200
 
  if (!is.null(filename))
    png(file=paste(filename, ".png", sep=""), bg="white", width = plot.per.row * plot.width, height = plot.per.column * plot.height)
 
  par(mfrow = c(plot.per.row, plot.per.column))  
 
  # There's no need to calculate a whole lot several times:
  ybar <- as.vector(colMeans(dataset))
 
  S <- as.matrix(var(dataset))
 
  current <- 1  
  for (i in s.indexes)
  {
    s <- subsets[[i]]
    s.len <- length(s)
 
    if (s.len > subset.max.size)
      next
 
    cat("Processing subset no.", current, "out of", plots, "\n")
 
    # Container for our values
    squared.dist <- numeric(n)
 
    # qr.solve(A) = A^(-1)  
    Sinv <- qr.solve(S[s,s])
 
    # Then calculate the squared distance for each datapoint
    for (i in 1:n)
    {
      c <- dataset[i,s] - ybar[s]
      squared.dist[i] <- t(c) %*% Sinv %*% c
    }
 
    # Finding the order statistic
    squared.dist <- sort(squared.dist)
 
    qqchisq(squared.dist, paste(colnames(dataset)[s], collapse=", "), s.len)
 
    current <- current + 1
  }
 
  if (!is.null(filename))
    dev.off()
}
 
# Example:
A <- matrix(rnorm(2000, mean=3, sd=2), ncol=8)
qqmultinorm(A, 1, 3, "multinorm-1-3")
#qqmultinorm(A, 4, 4, "multinorm-4")
#qqmultinorm(A, 5, 5, "multinorm-5")
#qqmultinorm(A, 6, 8, "multinorm-6-8")
multinorm-1-3

multinorm-1-3

Oct 27

Although R supports a lot of different probability distributions, it does not support the scaled inverse chi-square distribution - well, not directly. But it can of course be implemented.

Read the rest of this entry »