Understanding Hashcons
Motivation
Aside. Data, value, object, component in this post means the same thing but from different perspectives.
Folklore. Hash-consing, as the name may suggest, is a hash value cons-ed (appended) after the immutable data (hash key). The immediate benefit is to reuse the hash value for the unchanged data. When used in recursive datatypes, the hash value of a data can be computed from the new payload part and the recursive part whose hash value is already cons-ed.
One-step further. It's obvious hash-consing is straightforward for immutable data. However, with immutable data, more aggresive designs can be made. Hash-consing libraries in real-world usually coincide with Flyweight pattern. They have the same targets:
- Encapsulate the object creation so that any distinct object is created just once.
- Generate and save a unique id to identify distinct objects.
- Save the hash value in the objects and choose a hash function that is aware of the saved hash values for its recursive components.
Some Concept (or Implementation) Details
Hash functions. With a unique id cons-ed, the datatype can provide a new equal
function and a new hash
function based on this id. In whole scenarios, three hash
functions can be used.
- Global
hash
. Language provide e.g.Hashtbl.hash
. It's used internally for convenience. - Data's
hash
. The hash function for your hash-consed data provided by users. - Hash-consing library
hash
. The hash function for your hash-consed data provided by the library.
The existence and difference between 2 and 3 is not so clear depending on the library design. However, the ultimate target is to provide a better 2 with or without 3 than the old structural hash
.
Weak references. Hash-consing library needs to provide an internal store to save the unique objects. This storage can be a weak array or a weak hash table. The internal storage should be weak because it should not prevent the garbage collection if any elements are not used outside.
Hash collision. Hash collision and hash resize is handled during the process to generate the unique id with the internal storage. The data's equal
and data's hash
function will be used. Data's hash
collision outside of the internal storage is not concerned here.
Structural equality. A natural consequence of unique objects and ids is the structural equality check between two values can be replaced by physical equality. Refresh: structural equality checks whether two values per-component of their structures. Physical equality checks whether two values are at the same memory address. If all the objects is created inside the hash consing library, strutural equality can be replaced by phisical equality. =
in OCaml approximates a structural equality check. ==
in OCaml is phisical equality check.
Reading library code
Now we are ready to look closely into two OCaml libraries backtracking/ocaml-hashcons
(hashcons
on opam) and fpottier/fix
with Fix.HashCons
(fix
on opam).
backtracking/ocaml-hashcons
The repo is backtracking/ocaml-hashcons.
hashcons.mli
defines the type:
type +'a hash_consed = private {
hkey: int;
tag : int;
node: 'a;
}
It's a bit subtle since hkey
is computed value from the user-provided H.hash
on the data before adding to the internal store. tag
is the unique id which is either old from a previous object or new if added. node
is the data before hash-consed.
The following code snippets are from test.ml
:
open Hashcons
(* a quick demo of Hashcons using lambda-terms *)
type node =
| Var of string
| App of term * term
| Lam of string * term
and term = node hash_consed
(* the key here is to make a O(1) equal and hash functions, making use of the fact that sub-terms are already hash-consed and thus we can
1. use == on sub-terms to implement equal
2. use .tag from sub-terms to implement hash
*)
module X = struct
type t = node
let equal t1 t2 = match t1, t2 with
| Var s1, Var s2 -> s1 = s2
| App (t11, t12), App (t21, t22) -> t11 == t21 && t12 == t22
| Lam (s1, t1), Lam (s2, t2) -> s1 = s2 && t1 == t2
| _ -> false
let hash = function
| Var s -> Hashtbl.hash s
| App (t1, t2) -> t1.tag * 19 + t2.tag
| Lam (s, t) -> Hashtbl.hash s * 19 + t.tag
end
module H = Make(X)
let ht = H.create 17
let var s = H.hashcons ht (Var s)
let app t1 t2 = H.hashcons ht (App (t1,t2))
let lam s t = H.hashcons ht (Lam (s,t))
let x = var "x"
let delta = lam "x" (app x x)
let omega = app delta delta
let () = assert (var "x" == x)
let () = assert (app x x == app x x)
X.hash
is the data's hash
. Global hash
is used both in X.hash
inside of H.hashcons
. X.equal
uses physical equality for objects. X.hash
also uses the unique ids for components. The last two assert
s check the objects created at different application shares the same memory addresses.
Module Hashcons
also provides Hset
and Hmap
. They're external containers which is aware of your hash-consed data. Don't confuse them with the internal storage, which is also a hash-based container.
fpottier/fix
The repo is fpottier/fix.
Fix.HashCons
(ml,mli) looks very lightweighted because the internal storage is achieved by another module Fix.MEMOIZER
.
type 'data cell =
{ id: int; data: 'data }
id
is the unique id while data
is your datatype to hash-cons. The another difference worth mentioning is with backtracking/ocaml-hashcons
the user is in change of the objects pool getting from H.create
while with Fix.HashCons
the pool is shared. It's explained in HashCons.ml.
(* M : MEMOIZER *)
let make =
M.memoize (fun data -> { id = gensym(); data })
MEMOIZER
is a module relying on the user-provide Map.S
which contains find
and add
. No data's hash
is required, but making MEMOIZER
requires a HashedType
. The result module of HashCons.Make
provides a hash
which relies on the unique id.
The demo code demos/hco
HashConsDemo.ml is challenging to read if one is not aware of these hash
functions in use.
open Fix
module MySkeleton = struct
type 'a t =
| Leaf
| Node of int * 'a * 'a
let equal equal sk1 sk2 =
match sk1, sk2 with
| Leaf, Leaf ->
true
| Node (x1, l1, r1), Node (x2, l2, r2) ->
x1 = x2 && equal l1 l2 && equal r1 r2
| Node _, Leaf
| Leaf, Node _ ->
false
let hash hash sk =
match sk with
| Leaf ->
0
| Node (x, l, r) ->
x + hash l + hash r
end
type tree =
skeleton HashCons.cell
and skeleton =
S of tree MySkeleton.t [@@unboxed]
module M =
HashCons.ForHashedTypeWeak(struct
type t = skeleton
let equal (S sk1) (S sk2) =
MySkeleton.equal HashCons.equal sk1 sk2
let hash (S sk) =
MySkeleton.hash HashCons.hash sk
end)
let leaf () : tree =
M.make (S MySkeleton.Leaf)
let node x l r : tree =
M.make (S (MySkeleton.Node (x, l, r)))
let example() =
node 0
(leaf())
(leaf())
let () =
assert (example() == example());
Printf.printf "Size of example tree is %d.\n" (size (example()));
print_endline "Success."
MySkeleton
provides equal
and hash
, which require another equal
and hash
as arguments respectively. The hash-consing aware equal
and hash
is provided in HashCons.ForHashedTypeWeak
. The resulting code shares the same motivation as the previous demo.
Summary and Questions (To-do)
I am short on time to polish the post and complete all my to-do now. My motivation is the fixed point computation runs very slow on several test cases, and the profiling results show it's doing too much repeated structural hashing. The post mainly shares how to understand hash-consing libraries. If I may, I will rename both hash consing
or flyweight
to with_unique
or with_uid
.
Besides these two OCaml libraries, I also refer to wiki/Hashconing and papers
- Sylvain Conchon and Jean-Christophe FilliĆ¢tre. Type-Safe Modular Hash-Consing. In ACM SIGPLAN Workshop on ML, Portland, Oregon, September 2006.
- Implementing and reasoning about hash-consed data structures in Coq
How to bring hash-consing to Core
is my immediate problem.
Furthermore, as string intern is widely used for many languages, why is hash-consing not the default implementation inside OCaml compilers?
The functions on the raw types need to be reimplemented with the hash cons-ed type with just tedious fmap. Is it another case for the Expression Problem or Open Recursion?
I am also interested in the implementation of camlp5/pa_ppx_hashcons and coq-core/Hashcons.