In this post, I would like to start a topic about functional algorithm design. I will use OCaml language to implement the designs. Each post is about a “functional pearl” from the book “Pearls of Functional Algorithm Design” by Richard Bird.

Introduction

Given a set X of n natural numbers, we want to find the smallest natural number such that it is not in X. In practice, this problem can be considered as the following common programming task. There is a number of object ids and X is the set of objects currently in use. The task is to find some object not in use, say the one with the smallest id.

Our objective is to design an algorithm whose asymptotic complexity is linear, where n is the number of elements in X.

Non-linear Solution

We assume that X is a list without duplicates in no particular order. Otherwise, let’s say X is in increasing order, then the linear solution is trivial: simply look for the first gap and check the first element is different from 0. For example,

X = [1; 2; 3; 6; 9], then we first check 1 is different from 0, thus 0 is the smallest number we are looking for. Consider an other example, X = [0; 1; 2; 3; 6; 9], then we check the gap between two continuous elements from the beginning of the list. In this case, 6 - 3 > 1, thus 3+1 is the number we are looking for.

The implementation is given as follows.

(* l is in increasing order *)
let trivial_smallest l =
  let rec aux l res =
  match l with
  | [] -> res
  | x::[] -> aux [] (x + 1)
  | x1::x2::xs -> if x2 - x1 > 1 then (x1 + 1) else aux (x2::xs) res
  in 
  if (List.length l = 0 ) || (List.hd l <> 0) then 0
  else aux l 0;; 

Consider the following quadratic algorithm where filter : int list -> int list -> int list takes the list [0;1;...;n] and input list X and returns the list of elements in the first input list in which any occurrence of x in X is removed. Then, the number we are looking for is the head of the returned list. The implementation is given below.

(* use the List's mem and filter functions *)
let quadratic_smallest l =
  let n = List.length l in
  let rec loop i l =
    if i = 0 then (i::l)
    else
      loop (pred i) (i::l)
  in
  let pre_l = loop n [] in
  let filter l1 l2 =
    List.filter (fun x -> not (List.mem x l2)) l1
  in
  List.hd (filter pre_l l)
(* test function *)
let remove_duplicate l =
  let rec aux l res =
    match l with
    | [] -> List.rev res
    | x::xs -> let filtered = List.filter (fun y -> not (x = y)) xs in
      aux filtered (x::res)
  in
  aux l [];;
  
let create_input n =  
  let rec loop i l =
    if i = 0 then (i::l)
    else
      loop (pred i) (i::l)
  in
  loop n [];;

let test f n =
  let l = create_input n in
  let t = Sys.time() in
  let s = f l in
  Printf.printf "Execution time: %fs\n" (Sys.time() -. t);
  s

The funtion quadratic_smallest works, however whose asymptotic complexity is quadratic because List.mem and List.filter are where n is the length of the input list X.

One can think about the other solution is that we first sort the input list X, then employ the above linear trivial algorithm. However, the complexity of the sort algorithm is at least . Hence, this solution is . In the following, I will discuss two linear algorithms based on array data structure and divide and conquer.

The function test is used to estimate the execution time of the quadratic_smallest function. The following results show the runtime with different lengths of the input list X. For example, with n = 10000, the runtime is almost 1 second. If we increase the length 10 times, then the runtime is 101 seconds. Roundly speaking, the runtime is squared, .

test quadratic_smallest 10000;;
Execution time: 0.953102s
- : int = 10001

test quadratic_smallest 100000;;
Execution time: 100.936317s
- : int = 100001

An Array-based Solution

The key idea of the array-based and divide and conquer solutions is that not every natural number in the range [0...len] where len is the length of the input list X can be in X. Thus, the smallest number not in X is the number we are looking for. The array-based solution uses an array with n + 1 Boolean value (0: not present and 1: present) elements as a checklist of those numbers present in X. Then the index of the first 0 element in the checklist array is the finding smallest number. The array elements are initialized to 0.

The algorithm can be implemented with Array data structure in OCaml as follows. In which, we assume that the access function of array is constant.

(* returns a checklist array *)
let array_filter a l len =
  if (Array.length a <> (len + 1)) then
    failwith "Input is not consistent";
  let rec aux a l len =
  match l with
  | [] -> a
  | x::xs ->
    if x < 0 || x > len then
      aux a xs len
    else
      begin
        a.(x) <- 1;
	aux a xs len
      end
  in
  aux a l len
in
(* returns the first `0` index *)
let get_smallest a =
  let n = Array.length a in
  let rec loop i =
    if i = n then -1
    else if a.(i) = 0 then i
    else loop (succ i)
  in
  loop 0
in
let linear_smallest l =
  let n = List.length l in
  let a = Array.make (n + 1) 0 in
  get_smallest (array_filter a l n)

Because the array access function is , thus the asymptotic complexity of the array_filter function is where n is the length of the input list. The function get_smallest is where n is the length of the array. Hence, the complexity of the linear_smallest function is , in other words, it is linear on the length of the input list.

We run the function test with linear_smallest as parameter to see its execution time. The results show that if we increase the length of the input list 10 times (from 10000 to 100000), the the runtime is 0.01 compared to 0.001. Thus it is almost 10 time longer that reflects the linear complexity. We also see that the algorithm scales much better than the quadratic version. For instance, for the list of 100000000 elements, it takes only 17.5 seconds where the quadratic version seems to run forever (in fact, it is more than one day)! The experiments were run on a machine with an Intel Core i5 2.4 GHz processor and 8GB of RAM under Mac OS X 10.12.5.

test linear_smallest 10000;;
Execution time: 0.001119s
- : int = 10001

test linear_smallest 100000;;
Execution time: 0.010943s
- : int = 100001

test linear_smallest 100000000;;
Execution time: 17.463857s
- : int = 100000001

Divide and Conquer

As I’ve mentioned before, the key idea of divide and conquer solution is that not every natural number in the range [0...len] where len is the length of the input list X can be in X. We first observe the following properties of filter (denoted \\) and append (denoted @) functions. Recall that filter u v returns the first list u in which the occurrences of elements in the second list v are filtered out. append u v concatenates two lists.

(u @ v) \\ t = (u \\ t) @ (v \\ t)

u \\ (v @ t) = (u \\ v) \\ t

(u \\ v) \\ t = (u \\ t) \\ v

If we have u1, u2, v1, and v2 such that u1 \\ v2 = u1 and u2 \\ v1 = u2. In other words, u1 and v2 (respectively u2 and v1) are disjoint. Then, following the properties above, we get

(u1 @ u2) \\ (v1 @ v2) = (u1 \\ v1) @ (u2 \\ v2)

Given a natural number a and an input list X whose elements are not duplicated and bigger than a. For any natural number b that is strictly bigger than a, we have [a; ... ; a + len] \\ X = [a; ... ; b - 1] \\ X1 @ [b; ... ; a + len] \\ X2 where len is the length of X, X1 is the sublist of X containing all elements that are smaller than b, and X2 is the sublist of X containing all elements that are bigger than b, that is, X is divided into two partitions. Hence, the smallest number from a is List.hd [a; ... ; a + len] \\ X = List.hd ([a; ... ; b - 1] \\ X1 @ [b; ...; a + len] \\ X2). If [a; ... ; b - 1] \\ X1 is empty then the smallest number is List.hd [b; ... ; a + len] \\ X2). Otherwise, it is List.hd [a; ... ; b - 1] \\ X1. In case that X is empty, the smallest number is a.

The next questions are: Can we implement the checking that [a; ... ; b - 1] \\ X1 is empty with a linear runtime complexity algorithm (the direct computation, e.g., using List.mem, takes quadratic runtime in the length of X1? And how to choose b such that the recurrence relation of _divice and conquer__ gives us a linear solution?

Because X1 is a list of distinct natural numbers less than b and [a; ... ; b - 1] contains b - a numbers from a to b - 1. Hence, to check that [a; ... ; b - 1] \\ X1 is empty, we only need to compute the length of X1 and show that it equals to b - a. This algorithm takes linear runtime in the length of X1 in the worst-case. For the second question, if we choose b = a + 1 + len div 2 where len is the length of X, then we have length of the first partition is smaller or equal to len div 2, the length of the second partition is len - len div 2 - 1 and smaller than len div 2. Hence, we get the following recurrence relation:

T(len) = T(len div 2) + O(len)

By the Master Theorem, the solution is T(len) = O(len). The implementation using List.partition function is given as follows.

let rec smallest_from a l = 
  match l with
  | [] -> a
  | list -> 
    let b = a + 1 + List.length l / 2 in
    let x1, x2 = List.partition (fun y -> (y < b)) list in
    if (List.length x1 = b - a) then 
      smallest_from b x2
    else
      smallest_from a x1
in
let divide_smallest l = 
  smallest_from 0 l

We can avoid to compute the length of the input list in the function smallest_from by giving it as an argument like

let rec smallest_from a (len, l) = 
  if len = 0 then a
  else 
    begin
      let b = a + 1 + len / 2 in
      let x1, x2 = List.partition (fun y -> (y < b)) l in
      let len1 = List.length x1 in
      if len1 = (b - a) then 
        smallest_from b ((len - len1), x2)
      else
        smallest_from a (len1, x1)
    end
in
let divide_smallest l = 
  smallest_from 0 (List.length l, l)

As before we run the function test with divide_smallest as parameter to see its execution time. The results show that if we increase the length of the input list 10 times (from 1000000 to 10000000), the the runtime is 4.570747 compared to 0.461642. Thus it is almost 10 time longer that reflects the linear complexity. We also see that this algorithm scales much better than the quadratic version. For instance, for the list of 100000000 elements, it takes only 94.211405 seconds. It turns out that it is a bit less efficient than the array-based solution.

test divide_smallest 1000000;;
Execution time: 0.461642s
- : int = 10001

test divide_smallest 10000000;;
Execution time: 4.570747s
- : int = 100001

test divide_smallest 100000000;;
Execution time: 47.099375s
- : int = 100000001