DIY fast random tree generation

DIY fast random tree generation



Berlin - Feb 24, 2018
;; A live-coding presentation made with klipse ;; (thank you Yehonathan/viebel!) (defn showme [s] [:h3 (str s)]) [:div (showme (js/Date.))]







Want to play with the presentation ?
   ⇒ https://fredokun.github.io/talk-clojureD-2018
      (repo: https://github.com/fredokun/talk-clojureD-2018)

(... thanks to Antoine & Mathieu (maths guys), Hiram and the Paris Clojure Meetup, and Yeonathan for Klipse...)
  • associate professor at Sorbonne University (ex-UPMC)
  • researcher at the Lip6 computer science laboratory
  • (live) programming & maths geek
  • long-time Lisper (scheme, CL, clojure(script))

Warning! this presentation is code heavy! The whole source code is in the slides
⇒ the final generator is about 100 lines of Clojure …​

The random source …​

Our goal is to generate (large) trees…​

It all starts with a good random source.
(goodness: uniformity, reproducibilty, efficiency
  ⇒ cf. https://www.gigabytes.it/data/prng/slides.pdf)

For data generation we’ll go a long way with a good Pseudo Random Number Generator (PRNG).
   ⇒ there’s a good one in test.check

(require '[clojure.test.check.random
         :refer [make-random  ;; create source with seed
	         split        ;; two generators from one
		 rand-double  ;; uniform double in range [0.0;1.0[
		 ]])
""

Our source: an immutable PRNG

;; generate a double between 0.0 (inclusive) and 1.0 (exclusive)
(defn next-double [src]
  (let [[src' src''] (split src)] ;; XXX: throw src?
    [(rand-double src') src'']))

(next-double (make-random 424242))
""

Generating trees

Why ?
- trees everywhere:
  → elements/compounds (files/directories, shapes/groups,…​)
  → structured documents (sections/subsections,…​)
  → tree-shaped datastructures
  → expression trees (generating programs?),
  → etc.
- a non-trivial case study for uniform random generation
- beautiful maths and algorithms

How ?
- use spec and test.check?
- ad-hoc algorithms for specific cases: binary trees and general trees
- generic algorithms: recursive method, boltzmann sampling, …​

Tree generation with spec

A spec of binary trees …​

(require '[clojure.spec :as s])

;; a spec for binary trees (with int labels)
(s/def ::bintree
  (s/or :tip nil?
        :node (s/tuple ::label ::bintree ::bintree)))
(s/def ::label int?)

;; example
(def ex-btree [1,
               [2 nil nil],
               [3 [4 nil,
                   [5 nil nil]],
               [6 nil nil]]])

(s/valid? ::bintree ex-btree)

Random generation with spec (via test.check):

(s/exercise ::bintree 1)
""

Observations
- non-uniform generation: seems to be biased for small-ish trees
- lack of control: what if I want a bigger tree?

Generating binary trees with test.check

test.check has dedicated support for recursive structures

(require '[clojure.test.check.generators :as gen])

(def node-gen (fn [inner-gen]
                (gen/tuple gen/int inner-gen inner-gen)))

(def bt-gen (gen/recursive-gen node-gen (gen/return nil)))

(gen/generate bt-gen 10)
""

Observations
- non-uniform generation: also small-ish trees
- lack of control: the size parameter does not seem to "work" (for trees)…​

Uniformity?

Unbiased sampling means sampling in the uniform distribution.

Defined for a combinatorial class ⇒ e.g. the integers in interval [1;100]

  • each object has a finite size n ⇒ all integers in the interval have size 1
  • there is a finite number Cn of objects of size n ⇒ there are 100 integers of size 1

Uniform distribution: the probability of sampling an object
of size n is (/ 1 Cn)
⇒ the probability of sampling an integer in the interval is (/ 1 100)

Step 1: Tree grammars

We spec a simple (map-based) DSL for tree grammars.

(s/def ::grammar (s/map-of keyword? ::elem))

(s/def ::elem (s/or :neutral ::neutral
                    :atom ::atom
                    :sum ::sum
                    :prod ::prod
                    :ref ::ref))

(s/def ::neutral #{1}) ;; ≅ empty
(s/def ::atom #{'z})   ;; an atom has size 1
;; (+ <e1> <e2> ...) either ... or ...
(s/def ::sum (s/cat :sum #{'+} :elems (s/+ ::elem)))
;; (* <e1> <e2> ...) tupling
(s/def ::prod (s/cat :prod #{'*} :elems (s/+ ::elem)))
;; recursion
(s/def ::ref keyword?)
""

Example 1: binary trees

(def bt-gram '{:btree (+ :tip :node)
               :node (* z :btree :btree)
               :tip (* 1)})

(s/valid? ::grammar bt-gram)

Example 2: general trees

(def gt-gram '{:gtree (* z :gtrees)
               :gtrees (+ 1 :forest)
	       :forest (* z :gtree :gtrees)})

(s/valid? ::grammar gt-gram)

Non-trivial tree grammars

Example:

(def tt-gram '{:ttree (+ :one :two :three)
               :one (* z)
               :two (* z z :ttree :ttree)
	       :three (* z z z :ttree :ttree :ttree)})

(s/valid? ::grammar tt-gram)

Remark: z is "zed" …​ power of z = size

thus in english: trees with internal nodes of arity either 2 and 3
such that:
- the leaves have size 1,
- the binary nodes have size 2
- and the ternary nodes size 3.

Step 2: From tree grammars to functional equations

The tree grammars look suspiciously like equations…​

'{:btree (+ :tip (* z :btree :btree))
  :tip (* 1)}
""

… something like btree(z) = 1 + z * btree(z)²

Incremental evaluation of tree expressions …​

(defn eval-elem [elem z prev]
  (cond
    (= elem 1) 1.0
    (= elem 'z) z
    (keyword? elem) (get prev elem)
    :else (let [[op & args] elem]
      (case op
        + (apply + (map #(eval-elem % z prev) args))
        * (apply * (map #(eval-elem % z prev) args))))))

(eval-elem '(* z :btree :btree) 0.25 {:btree 2.0, :tip 1.0, :node 2.0})
""

… and tree grammars

(defn mapkv [f m]
  (into {} (map (fn [[k v]] (f k v)) m)))

(defn eval-grammar [grammar z prev]
  (mapkv (fn [ref elem] [ref (eval-elem elem z prev)]) grammar))

(eval-grammar bt-gram 0.25 {:btree 2.0 :tip 1.0 :node 2.0})
""

Boltzmann sampling

Sampling of an object with a Boltmann model (tree expression) C of parameter z:


  • if C is 1 (a constant) then return the empty object (of size 0)
  • if C is z (an atom) then return the corresponding object of size 1
  • if C is (+ A B) (disjoint sum) then:
            ⇒ return an object of A with probability (/ (A z) (B z))
            ⇒ otherwise return an object of B
  • if C is (* A B) generate a pair [a b] with a an object of A and b an object of B

Guarantee: the sampling is uniform


Remark: the control parameter is z, not (directly) the size of the tree.

Questions:
- what should be the value of z?
- how to compute efficiently the probabilities for disjoint sums?

Singular sampling

What the underlying maths tell us about the functional equations
corresponding to tree grammars:
- they converge (i.e. are analytic) in 0
- there exists a radius of convergence between 0 (inclusive) and 1.0 (exclusive)

⇒ The singularity is the limit of convergence


Theorem: By choosing the singularity as the Boltzmann parameter,
the expected size of an object generated by a Boltmann sampler is infinite.


Singular Boltzmann sampling = generating objects near or at the singularity.

Question: how to find the singularity for a given tree grammar?

      ⇒ numerical computation

Step 3 : Singularity oracle

(declare iter)     ;; newton iteration
(declare diverge?) ;; divergence (we went too far)

;; Mr Oracle, please find the singularity
(defn oracle [class zmin zmax eps-iter eps-div debug?]
  (when debug? (println "[search] zmin=" zmin "zmax=" zmax))
  (if (< (- zmax zmin) eps-iter)
    [zmin (iter class zmin eps-div true)]
    (let [z (/ (+ zmin zmax)
               2.0)
          v (iter class z eps-div false)]
      (when debug? (println "  => z=" z "v=" v))
      (if (diverge? v eps-div)
        (recur class zmin z eps-iter eps-div debug?)
        (recur class z zmax eps-iter eps-div debug?)))))



;; (oracle tt-gram 0.0 1.0 0.00001 0.00000001 true)

;; (oracle gt-gram 0.0 1.0 0.001 0.000001 true)

;; (oracle bt-gram 0.0 1.0 0.001 0.000001 true)

""

Convergence vs. divergence: the details

⇒ Newton iteration, cf. SICP section 1.1.7

;; distance between vectors v1 and v2
(defn norm [v1 v2]
  (reduce-kv (fn [norm elem y1]
               (let [y2 (get v2 elem)
                     y (Math/abs (- y1 y2))]
                 (if (> y norm) y norm))) 0.0 v1))

(norm {:a 0.1 :b 0.3} {:a 0.2 :b -0.2})
""
;; iteration until distance is less than `eps` (thank you Mr. Newton)
(defn iter [gram z eps debug]
  (loop [v1 (mapkv (fn [k _] [k 0.0]) gram)]
    (let [v2 (eval-grammar gram z v1)]
      ;; (when debug (println "[iter] v2=" v2 "norm=" (norm v1 v2)))
      (if (<= (norm v1 v2) eps)
        v2
        (recur v2)))))

;; (oracle bt-gram 0.0 1.0 0.001 0.01 false)
""
(defn some-kv [pred m]
  (reduce-kv (fn [res k v]
               (let [r (pred k v)]
                 (if r
                   (reduced r)
                   res))) nil m))

;; vector has diverged wrt. eps?
(defn diverge? [v eps]
  (some-kv (fn [_ w] (or (< w 0.0) (> w (/ 1.0 eps)))) v))

""

Step 5: Weighted grammars

An interesting pre-computation is to calculate the weights of
the operands of disjoint sums (+ operator)

(defn weighted-args [args z weights]
  (let [eargs (mapv (fn [arg] [arg (eval-elem arg z weights)]) args)
        total (apply + (map second eargs))]
    (loop [eargs eargs, acc 0.0, wargs []]
      (if (seq eargs)
        (let [[arg weight] (first eargs)
              acc' (+ acc weight)]
          (recur (rest eargs) acc' (conj wargs [arg (/ acc' total)])))
        ;; no more arg
        wargs))))

(defn weighted-elem [elem z weights]
  (cond
    (#{1 'z} elem) elem
    (keyword? elem) elem
    :else (let [[kind & args] elem]
            (case kind
              * elem
              + (cons '+ (weighted-args args z weights))))))

(defn weighted-gram [class z weights]
  (mapkv (fn [ref elem] [ref (weighted-elem elem z weights)]) class))
""

Tree grammar examples

;; binary trees
(let [[z v] (oracle bt-gram 0.0 1.0 0.00001 0.000001 false)]
  (def bt-sing z)
  (def bt-wgram (weighted-gram bt-gram z v)))

;; general trees
(let [[z v] (oracle gt-gram 0.0 1.0 0.00001 0.000001 false)]
  (def gt-sing z)
  (def gt-wgram (weighted-gram gt-gram z v)))

;; one-two-three trees
(let [[z v] (oracle tt-gram 0.0 1.0 0.00001 0.000001 false)]
  (def tt-sing z)
  (def tt-wgram (weighted-gram tt-gram z v)))
""
tt-gram

Non-deterministic choice

(defn choose [src choices]
  (let [[x src'] (next-double src)]
    (some (fn [[elem proba]]
            (and (<= x proba) [src' elem]))
	  choices)))

(choose (make-random) [[:one 0.5] [:two 0.8] [:three 1.0]])
""

Step 5: Boltzmann sampler

Based on the weighted grammer, the random generator is rather simple
(but there’s an issue lurking, can you find it?)

(defn gentree [src wgram label elem]
  (cond
    (= elem 1) [src nil]
    (= elem 'z) [src 'z]
    (keyword? elem) (recur src wgram elem (get wgram elem))
    :else (let [[op & args] elem]
      (case op
        + (let [[src' choice] (choose src args)]
            (recur src' wgram label choice))
        * (reduce (fn [[src v] arg]
            (let [[src' res] (gentree src wgram label arg)]
	      (if (#{nil 'z} res)
                [src' v]
                [src' (conj v res)]))) [src [label]] args)))))

;; (gentree (make-random) tt-wgram :ttree :ttree)
""

Problem: the generated tree is very often very small (e.g. size 1)
and could be rather infrequently very (very very …​) large because:
- the Boltzmann distribution of sizes is largely biased ("peaked") towards small trees
- the expected size is infinite

   ⇒ how to control the size?

Idea 1: Computing the size without building the tree

but be careful with (very very …​) large trees

;; (tail recursive) generation of size
(defn gensize
  [src wgram maxsize elem]
   (loop [src src, elem elem, size 0, cont '()]
     (if (>= size maxsize) ;; <== important!
       [-1 src]
       (cond
         (nil? elem) (if (seq cont)
                        (recur src (first cont) size (rest cont))
                        [size src])
         (= elem 1) (recur src nil size cont)
         (= elem 'z) (recur src nil (inc size) cont)
	 (keyword? elem) (recur src (get wgram elem) size cont)
	 :else
         (let [[op & args] elem]
           (case op
             + (let [[src' elem'] (choose src args)]
                 (recur src' elem' size cont))
             * (recur src (first args) size (concat (rest args) cont))))))))

;; (gensize (make-random) tt-wgram 1000 :ttree)
""

Problem: many trees are small, the random source must be "heated" to
obtain better results

Heating the source

(defn gensizes [src wgram maxsize elem]
  (let [[size src'] (gensize src wgram maxsize elem)]
    (if (> size 0)
      (lazy-seq (cons [size src] (gensizes src' wgram maxsize elem)))
      (recur src' wgram maxsize elem))))

(take 20 (map first
  (gensizes (make-random) tt-wgram 1000 :ttree)))
""

Idea 2: reject small-ish trees

⇒ this is called rejection sampling …​ but that’s just filter!

(take 15 (map first (filter (fn [[size _]] (> size 100))
  (gensizes (make-random) tt-wgram 1000 :ttree))))
""

Final step (?): The Boltzmann sampler

(defn boltzmann [src wgram sing elem min-size max-size]
  (let [[size src'] (first (filter (fn [[size _]] (>= size min-size))
	                      (gensizes src wgram max-size elem)))
	[src'' tree] (gentree src' wgram elem elem)]
    [src'' size tree]))
""

but wait! There’s something’s fishy …​

    ⇒ Our generator cannot generate a tree with one million nodes because …​

because ???

because gentree is not tail-recursive!

(bummer!)

Tree generation (tail recursive)

Looking for a Clojure Kata?

(defn gentree' [src wgram maxsize elem]
  (loop [src src, label elem, elem elem, size 0, cont '()]
    (cond
      (>= size maxsize) [nil src]
      (keyword? elem) (recur src elem (get wgram elem) size cont)
      (= elem 1) (recur src nil nil size cont)
      (seq? elem)
      (let [[op & args] elem]
        (case op
          + (let [[src' elem'] (choose src args)]
              (recur src' label elem' size cont))
          * (recur src label (first args) size (cons [label (rest args)] cont))))
      :else
      (let [[[node rst] & cont'] cont]
        (cond
          (= elem 'z) (if (vector? node)
                        (recur src nil nil (inc size) cont)
                        (recur src nil nil (inc size) (cons [[node] rst] cont')))
          (nil? elem)
          (cond
            (seq rst) (recur src nil (first rst) size (cons [node (rest rst)] cont'))
            (seq cont')  (let [[[node' rst'] & cont''] cont']
                           (recur src nil nil size (cons [(conj node' node) rst'] cont'')))
            :else [src node]))))))
""

Final final step: A generic uniform random tree generator

(defn boltzmann' [src wgram sing elem min-size max-size]
  (let [[size src'] (first (filter (fn [[size _]] (>= size min-size))
	                      (gensizes src wgram max-size elem)))
	[src'' tree] (gentree' src' wgram elem elem)]
    [src'' size tree]))
""

Generating one-two-three’s

;; (boltzmann' (make-random) tt-wgram tt-sing :ttree 9 10)
""

Generating binary trees

;; (boltzmann' (make-random) bt-wgram bt-sing :btree 10 30)
""

Generating general trees

;; (boltzmann' (make-random) gt-wgram gt-sing :gtree 4 100)
""

Wrapup

  • random generation is (beautiful) science, art, and craft
  • unbiased sampling should be the default (bias should be intentional)
  • better automated testing is possible
  • …​ the missing part: generation by the recursive method

(a lot) More on the topic?

Boltzmann Samplers for the Random Generation of Combinatorial Structures

Duchon, Flajollet, Louchard and Shaeffer, 2004

Thank you!

powered by Klipse /