In this post, I would like to start a topic about functional algorithm design. I will use OCaml language to implement the designs. Each post is about a “functional pearl” from the book “Pearls of Functional Algorithm Design” by Richard Bird.
Introduction
Given a set X
of n
natural numbers, we want to find the smallest natural number such that it is not in X
. In practice, this problem can
be considered as the following common programming task. There is a number of object ids and X
is the set of objects currently in use. The task
is to find some object not in use, say the one with the smallest id.
Our objective is to design an algorithm whose asymptotic complexity is linear, \(O(n)\) where n
is the number of elements in X
.
Non-linear Solution
We assume that X
is a list without duplicates in no particular order. Otherwise, let’s say X
is in increasing order, then the linear solution is trivial: simply
look for the first gap and check the first element is different from 0
. For example,
X
= [1; 2; 3; 6; 9], then we first check 1
is different from 0
, thus 0
is the smallest number we are looking for.
Consider an other example, X
= [0; 1; 2; 3; 6; 9], then we check the gap between two continuous elements from the beginning of the list. In this case,
6 - 3 > 1
, thus 3+1
is the number we are looking for.
The implementation is given as follows.
(* l is in increasing order *)
let trivial_smallest l =
let rec aux l res =
match l with
| [] -> res
| x::[] -> aux [] (x + 1)
| x1::x2::xs -> if x2 - x1 > 1 then (x1 + 1) else aux (x2::xs) res
in
if (List.length l = 0 ) || (List.hd l <> 0) then 0
else aux l 0;;
Consider the following quadratic algorithm where filter : int list -> int list -> int list
takes the list [0;1;...;n]
and input list X
and
returns the list of elements in the first input list in which any occurrence of x
in X
is removed. Then, the number we are looking for is the head of
the returned list. The implementation is given below.
(* use the List's mem and filter functions *)
let quadratic_smallest l =
let n = List.length l in
let rec loop i l =
if i = 0 then (i::l)
else
loop (pred i) (i::l)
in
let pre_l = loop n [] in
let filter l1 l2 =
List.filter (fun x -> not (List.mem x l2)) l1
in
List.hd (filter pre_l l)
(* test function *)
let remove_duplicate l =
let rec aux l res =
match l with
| [] -> List.rev res
| x::xs -> let filtered = List.filter (fun y -> not (x = y)) xs in
aux filtered (x::res)
in
aux l [];;
let create_input n =
let rec loop i l =
if i = 0 then (i::l)
else
loop (pred i) (i::l)
in
loop n [];;
let test f n =
let l = create_input n in
let t = Sys.time() in
let s = f l in
Printf.printf "Execution time: %fs\n" (Sys.time() -. t);
s
The funtion quadratic_smallest
works, however whose asymptotic complexity is quadratic \(O(n^2)\) because List.mem
and List.filter
are \(O(n)\) where n
is the length of the input list X
.
One can think about the other solution is that we first sort the input list X
, then employ the above linear trivial algorithm. However, the complexity
of the sort algorithm is at least \(O(nlogn)\). Hence, this solution is \(O(nlogn)\). In the following, I will discuss two linear algorithms based on array data structure and divide and conquer.
The function test
is used to estimate the execution time of the quadratic_smallest
function. The following results show the runtime with different lengths of the input list X
. For example, with n = 10000
, the runtime is almost 1
second. If we increase the length 10
times, then the runtime is 101
seconds. Roundly speaking, the runtime is squared, \(10^2 = 100\).
test quadratic_smallest 10000;;
Execution time: 0.953102s
- : int = 10001
test quadratic_smallest 100000;;
Execution time: 100.936317s
- : int = 100001
An Array-based Solution
The key idea of the array-based and divide and conquer solutions is that not every natural number in the range [0...len]
where len
is the length of the input list X
can be in X
. Thus, the smallest number not in X
is the number we are looking for. The array-based solution uses an array with n + 1
Boolean value (0: not present and 1: present) elements as a checklist of those numbers present in X
. Then the index of the first 0
element in the checklist array is the finding smallest number. The array elements are initialized to 0
.
The algorithm can be implemented with Array
data structure in OCaml as follows. In which, we assume that the access function of array is constant.
(* returns a checklist array *)
let array_filter a l len =
if (Array.length a <> (len + 1)) then
failwith "Input is not consistent";
let rec aux a l len =
match l with
| [] -> a
| x::xs ->
if x < 0 || x > len then
aux a xs len
else
begin
a.(x) <- 1;
aux a xs len
end
in
aux a l len
in
(* returns the first `0` index *)
let get_smallest a =
let n = Array.length a in
let rec loop i =
if i = n then -1
else if a.(i) = 0 then i
else loop (succ i)
in
loop 0
in
let linear_smallest l =
let n = List.length l in
let a = Array.make (n + 1) 0 in
get_smallest (array_filter a l n)
Because the array access function is \(O(1)\), thus the asymptotic complexity of the array_filter
function is \(O(n)\) where n
is the length of the input list. The function get_smallest
is \(O(n)\) where n
is the length of the array. Hence, the complexity of the linear_smallest
function is \(O(n)\), in other words, it is linear on the length of the input list.
We run the function test
with linear_smallest
as parameter to see its execution time. The results show that if we increase the length of the input list 10
times (from 10000
to 100000
), the the runtime is 0.01
compared to 0.001
. Thus it is almost 10
time longer that reflects the linear complexity.
We also see that the algorithm scales much better than the quadratic version. For instance, for the list of 100000000
elements, it takes only 17.5
seconds where the quadratic version seems to run forever (in fact, it is more than one day)! The experiments were run on a machine with an Intel Core i5 2.4 GHz
processor and 8GB
of RAM under Mac OS X 10.12.5
.
test linear_smallest 10000;;
Execution time: 0.001119s
- : int = 10001
test linear_smallest 100000;;
Execution time: 0.010943s
- : int = 100001
test linear_smallest 100000000;;
Execution time: 17.463857s
- : int = 100000001
Divide and Conquer
As I’ve mentioned before, the key idea of divide and conquer solution is that not every natural number in the range [0...len]
where len
is the length of the input list X
can be in X
. We first observe the following properties of filter
(denoted \\
) and append
(denoted @
) functions. Recall that filter u v
returns the first list u
in which the occurrences of elements in the second list v
are filtered out. append u v
concatenates two lists.
(u @ v) \\ t = (u \\ t) @ (v \\ t)
u \\ (v @ t) = (u \\ v) \\ t
(u \\ v) \\ t = (u \\ t) \\ v
If we have u1
, u2
, v1
, and v2
such that u1 \\ v2 = u1
and u2 \\ v1 = u2
. In other words, u1
and v2
(respectively u2
and v1
) are disjoint. Then, following the properties above, we get
(u1 @ u2) \\ (v1 @ v2) = (u1 \\ v1) @ (u2 \\ v2)
Given a natural number a
and an input list X
whose elements are not duplicated and bigger than a
. For any natural number b
that is strictly bigger than a
, we have [a; ... ; a + len] \\ X = [a; ... ; b - 1] \\ X1 @ [b; ... ; a + len] \\ X2
where len
is the length of X
, X1
is the sublist of X
containing all elements that are smaller than b
, and X2
is the sublist of X
containing all elements that are bigger than b
, that is, X
is divided into two partitions. Hence, the smallest number from a
is List.hd [a; ... ; a + len] \\ X = List.hd ([a; ... ; b - 1] \\ X1 @ [b; ...; a + len] \\ X2)
. If [a; ... ; b - 1] \\ X1
is empty then the smallest number is List.hd [b; ... ; a + len] \\ X2)
. Otherwise, it is List.hd [a; ... ; b - 1] \\ X1
. In case that X
is empty, the smallest number is a
.
The next questions are: Can we implement the checking that [a; ... ; b - 1] \\ X1
is empty with a linear runtime complexity algorithm (the direct computation, e.g., using List.mem
, takes quadratic runtime in the length of X1
? And how to choose b
such that the recurrence relation of _divice and conquer__ gives us a linear solution?
Because X1
is a list of distinct natural numbers less than b
and [a; ... ; b - 1]
contains b - a
numbers from a
to b - 1
. Hence, to check that [a; ... ; b - 1] \\ X1
is empty, we only need to compute the length of X1
and show that it equals to b - a
. This algorithm takes linear runtime in the length of X1
in the worst-case. For the second question, if we choose b = a + 1 + len div 2
where len
is the length of X
, then we have length of the first partition is smaller or equal to len div 2
, the length of the second partition is len - len div 2 - 1
and smaller than len div 2
. Hence, we get the following recurrence relation:
T(len) = T(len div 2) + O(len)
By the Master Theorem, the solution is T(len) = O(len)
. The implementation using List.partition
function is given as follows.
let rec smallest_from a l =
match l with
| [] -> a
| list ->
let b = a + 1 + List.length l / 2 in
let x1, x2 = List.partition (fun y -> (y < b)) list in
if (List.length x1 = b - a) then
smallest_from b x2
else
smallest_from a x1
in
let divide_smallest l =
smallest_from 0 l
We can avoid to compute the length of the input list in the function smallest_from
by giving it as an argument like
let rec smallest_from a (len, l) =
if len = 0 then a
else
begin
let b = a + 1 + len / 2 in
let x1, x2 = List.partition (fun y -> (y < b)) l in
let len1 = List.length x1 in
if len1 = (b - a) then
smallest_from b ((len - len1), x2)
else
smallest_from a (len1, x1)
end
in
let divide_smallest l =
smallest_from 0 (List.length l, l)
As before we run the function test
with divide_smallest
as parameter to see its execution time. The results show that if we increase the length of the input list 10
times (from 1000000
to 10000000
), the the runtime is 4.570747
compared to 0.461642
. Thus it is almost 10
time longer that reflects the linear complexity.
We also see that this algorithm scales much better than the quadratic version. For instance, for the list of 100000000
elements, it takes only 94.211405
seconds. It turns out that it is a bit less efficient than the array-based solution.
test divide_smallest 1000000;;
Execution time: 0.461642s
- : int = 10001
test divide_smallest 10000000;;
Execution time: 4.570747s
- : int = 100001
test divide_smallest 100000000;;
Execution time: 47.099375s
- : int = 100000001