On finding the union of sets in PHP

July 30th, 2010

How do we find the union of two sets in PHP? Suppose we have two sets, a = {1, 2, 5, 6}, and b = {1, 2, 4}, and we want to determine a ∪ b = {1, 2, 4, 5, 6}. PHP doesn’t really have a data type that represents a set, but we can instead use the array type. There are a couple of different ways we can achieve this:

1) Use array_merge and array_unique
The array_merge function will append the values of one array to the end of another. In our case, we will end up with some duplicate values as some elements of set b are also contained in set a. To filter out these duplicates, we can use the array_unique function.

$a = array(1, 2, 5, 6);
$b = array(1, 2, 4);
$union = array_unique(array_merge($a, $b));
print_r($union);
//Array
//(
//    [0] => 1
//    [1] => 2
//    [2] => 5
//    [3] => 6
//    [6] => 4
//)

Since a set is not an ordered list, this method produces the correct result.

2) Another way of calculating the union is to use the array union operator (+)
We need to be careful here, as the array union operator acts on the keys of an array, and not it’s values. In order to calculate the union of the array values, we can use the array_fill_keys function:

$a = array(1, 2, 5, 6);
$b = array(1, 2, 4);
$union = array_keys(array_fill_keys($a, true) + array_fill_keys($b, true));
print_r($union);
//Array
//(
//    [0] => 1
//    [1] => 2
//    [2] => 5
//    [3] => 6
//    [4] => 4
//)

Again, this technique produces the correct result.

So which is the best method to use? If you’re only looking at small sets such as the above, I don’t suppose it makes much difference. However for larger sets, array_merge followed by array_unique is by a long way the slower of the two methods. Try this simple test to see the difference:

  // Set up two arrays, each with 10,000 (not necessarily unique) values
  $a = array();
  $b = array();
  for ($i=0; $i < 10000; $i++)
  {
    $a[] = rand(1, 50000);
    $b[] = rand(1, 50000);
  }

  $startTime = microtime(true);
  $union = array_unique(array_merge($a, $b));
  $arrayMergeDuration = microtime(true) - $startTime;

  $startTime = microtime(true);
  $union = array_keys(array_fill_keys($a, true) + array_fill_keys($b, true));
  $arrayFillDuration = microtime(true) - $startTime;

  echo <<<eot
array_merge method took $arrayMergeDuration s
array_fill_keys method took $arrayFillDuration s

eot;

// Output from my test machine:
array_merge method took 0.330797195435 s
array_fill_keys method took 0.0230300426483 s

This simple test indicates that the array_fill_keys technique is about 14 times quicker when working with arrays of non-trivial size. Obviously, when dealing with large sets, you should consider using a tool custom built for the job (ie a relational database), but there may be some cases where this could come in handy. YMMV.

Leave a Reply