The Demonstration Language

Delphyne includes a demonstration language for writing and maintaining few-shot prompting examples, in the form of coherent scenarios of navigating search trees. The demonstration language is amenable to a test-driven development workflow. This workflow is supported by a dedicated VSCode extension, which is described in the next chapter.

Demonstration Files

Demonstrations can be written in demonstration files with a .demo.yaml extension. A demonstration file features a list of demonstrations (Demo). Each demonstration can be separately evaluated. Many short examples can be found in the demonstration file from Delphyne's test suite:

Source for tests/example_strategies.demo.yaml

#####
##### Unit Tests for the Demonstration Language
#####

# Note: some demonstrations feature errors or failing tests.
# Some demonstrations also contain an additional `expect` section
# describing parts of the expected demonstration interpreter feedback.


##### Used for testing the UI

- strategy: make_sum
  args:
    allowed: [9, 6, 2]
    goal: 11
  tests:
    - run
  queries: []


##### Testing standalone query demos


- demonstration: MakeSum_demo
  query: MakeSum
  args:
    allowed: [9, 6, 2]
    goal: 11
  answers:
    - answer: "[9, 2]"
    - label: "wrong_sum"
      answer: "[9, 6]"

- demonstration: Unknown_query
  query: Unknown
  args: {}
  answers: []


##### Testing make_sum


- demonstration: make_sum_demo
  strategy: make_sum
  args:
    allowed: [9, 6, 2]
    goal: 11
  tests:
    - run | success
    - take '' | success  # same result as above but more fragile
    - run 'wrong_sum' | success  # error
    - run 'alt_order'
    - take cands{'alt_order'} | success
    - failure | run  # error
  queries:
    - query: MakeSum
      args:
        allowed: [9, 6, 2]
        goal: 11
      answers:
        - answer: "[9, 2]"
        - label: "wrong_sum"
          answer: "[9, 6]"
        - label: "alt_order"
          answer: "[2, 9]"
  expect:
    trace:
      nodes:
        1: {kind: Branch}
        2: {kind: Success}
        3: {kind: Fail}
        4: {kind: Success}
    test_feedback:
      - node_id: 2
        diagnostics: __empty__
      - node_id: 2
        diagnostics: __empty__
      - node_id: 3
        diagnostics:
          - severity: "error"
            message: "Success check failed."
      - node_id: 4
        diagnostics: __empty__
      - node_id: 4
        diagnostics: __empty__
      - node_id: 1
        diagnostics:
          - severity: "error"
            message: "Failure check failed."


- demonstration: make_sum_selectors
  strategy: make_sum
  args:
    allowed: [9, 6, 2]
    goal: 11
  tests:
    - run 'wrong_sum'
    - run 'unknown_hint'
  queries:
    - query: MakeSum
      args:
        allowed: [9, 6, 2]
        goal: 11
      answers:
        - answer: "[9, 2]"
        - label: "wrong_sum"
          answer: "[9, 6]"
  expect:
    trace:
      nodes:
        1: {kind: Branch}
        2: {kind: Fail}
        3: {kind: Success}
    test_feedback:
      - node_id: 2
      - node_id: 3
        diagnostics:
          - severity: "warning"
            message: "Unused hints: 'unknown_hint'."


- demonstration: make_sum_at
  strategy: make_sum
  args:
    allowed: [9, 6, 2]
    goal: 11
  tests:
    - at Unknown
    - at MakeSum | run
  queries:
    - query: MakeSum
      args:
        allowed: [9, 6, 2]
        goal: 11
      answers:
        - answer: "[9, 2]"
        - label: "wrong_sum"
          answer: "[9, 6]"
  expect:
    trace:
      nodes:
        1: {kind: Branch}
        2: {kind: Success}
    test_feedback:
      - node_id: 2
        diagnostics:
          - severity: "warning"
            message: "Leaf node reached before 'Unknown'."
      - diagnostics: __empty__


- demonstration: make_sum_stuck
  strategy: make_sum
  args:
    allowed: [9, 6, 2]
    goal: 11
  tests:
    - run
  queries: []
  expect:
    trace:
      nodes: {1: {kind: Branch}}
    test_feedback:
      - diagnostics:
          - severity: "warning"
            message: "Test is stuck."
        node_id: 1


- demonstration: make_sum_test_parse_error
  strategy: make_sum
  args:
    allowed: [9, 6, 2]
    goal: 11
  tests:
    - bad_command
  queries: []
  expect:
    test_feedback:
      - diagnostics:
          - severity: "error"
            message: "Syntax error."


- demonstration: trivial_strategy
  strategy: trivial_strategy
  args: {}
  tests: [run]
  queries: []
  expect: {trace: {nodes: {1: {kind: Success}}}}


- demonstration: buggy_strategy
  strategy: buggy_strategy
  args: {}
  tests: [run]
  queries: []
  expect:
    global_diagnostics:
      - severity: "error"


- demonstration: strategy_not_found
  strategy: unknown_strategy
  args: {}
  tests: [run]
  queries: []
  expect:
    global_diagnostics:
      - severity: "error"


- demonstration: invalid_arguments
  strategy: make_sum
  args:
    bad_arg: "foo"
  tests: [run]
  queries: []
  expect:
    global_diagnostics:
      - severity: "error"


- demonstration: unknown_query
  strategy: make_sum
  args:
    allowed: [9, 6, 2]
    goal: 11
  tests:
    - run
  queries:
    - query: MakeSum
      args:
        allowed: [9, 6, 2]
        goal: 11
      answers: []
    - query: UnknownQuery
      args: {}
      answers: []
  expect:
    query_diagnostics:
      - [1, severity: "error"]


- demonstration: invalid_answer
  strategy: make_sum
  args:
    allowed: [9, 6, 2]
    goal: 11
  tests:
    - run
  queries:
    - query: MakeSum
      args:
        allowed: [9, 6, 2]
        goal: 11
      answers:
        - answer: "'foo'"
  expect:
    answer_diagnostics:
      - [[0, 0], severity: "error"]


##### Testing synthetize_fun


- demonstration: synthetize_fun_demo
  strategy: synthetize_fun
  args:
    vars: &vars_1 ["x", "y"]
    prop: &prop_1 [["a", "b"], "F(a, b) == F(b, a) and F(0, 1) == 2"]
  tests:
    - run | success
    - run 'invalid' | failure
    - at conjecture_expr | go disprove('wrong1') | save wrong1
    - load wrong1 | run | success
    - load wrong1 | run 'bad_cex' | failure
    - load wrong1 | run 'malformed_cex' | failure
    - at conjecture_expr | go aggregate(['', 'wrong1', 'wrong2'])
    - at conjecture_expr | go aggregate(['', 'unknown', 'wrong2'])
    - at conjecture_expr | answer aggregate(['wrong1', 'wrong2'])
    - at conjecture_expr | answer aggregate(['', 'wrong1', 'wrong2'])
  queries:
    - query: ConjectureExpr
      args: {vars: *vars_1, prop: *prop_1}
      answers:
        - label: right
          answer: "2*(x + y)"
        - label: wrong1
          answer: "x + 2*y"
        - label: wrong2
          answer: "2*y + x"
        - label: invalid
          answer: "sys.exit()"
    - query: ProposeCex
      args: {prop: *prop_1, fun: [[x, y], "x + 2*y"]}
      answers:
        - answer: "{a: 0, b: 1}"
        - label: malformed_cex
          answer: "{x: 1, y: 1}"
        - label: bad_cex
          answer: "{a: 0, b: 0}"
    - query: RemoveDuplicates
      args:
        exprs: ["2*(x + y)", "x + 2*y", "2*y + x"]
      answers:
        - answer: '["2*(x + y)", "x + 2*y"]'
  expect:
    test_feedback:
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics:
        - severity: "error"
          message: "Not a nested tree: aggregate(['', 'wrong1', 'wrong2'])."
      - diagnostics:
        - severity: "warning"
          message: "Unused hints: 'unknown'."
      - diagnostics:
        - severity: "error"
      - diagnostics: __empty__

    global_diagnostics: __empty__
    saved_nodes: {wrong1: __any__}


##### Testing pick_nice_boy_name


- demonstration: test_iterate
  strategy: pick_nice_boy_name
  args:
    names: ["Adeline", "Noah", "Julia", "Jonathan"]
  tests:
    - run | success
    - run 'girl_name' | failure
    - run 'other_boy_name' | failure
    - go cands | go next(nil)
    - go cands | go next(next(nil){'other_boy_name'}[1]) | run | success
    - go cands | go next(next(next(nil){'other_boy_name'}[1]){''}[1]) | run | failure
    # Valid selectors
    - at iterate
    - at pick_boy_name  # synonym because `iterate` uses `inherit_tags`
    - at iterate&pick_boy_name
    - at iterate/pick_boy_name/PickBoyName
    # Invalid selectors
    - at iterate/pick_boy_name  # `Iteration` node has no tags or primary space

    # Mistakenly send the wrong value to `next` by using index 0 instead of 1. The strategy raises
    # an exception because it explicitly checks the type of its arguments but without this
    # assertion, we would get stuck on a query with ill-typed arguments.
    - go cands | go next(next(nil){'other_boy_name'}[0]) | run | success
  queries:
    - query: PickBoyName
      args:
        names: [Adeline, Noah, Julia, Jonathan]
        picked_already: []
      answers:
      - answer: "Jonathan"
      - label: girl_name
        answer: "Julia"
      - label: other_boy_name
        answer: "Noah"
    - query: PickBoyName
      args:
        names: [Adeline, Noah, Julia, Jonathan]
        picked_already: [Noah]
      answers:
        - answer: "Jonathan"
    - query: PickBoyName
      args:
        names: [Adeline, Noah, Julia, Jonathan]
        picked_already: [Noah, Jonathan]
      answers:
        - answer: "Sigmund"

  expect:
    test_feedback:
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics:
          - severity: "warning"
      - diagnostics:
          - severity: "error"


##### Testing generate_pairs and advanced selectors

- demonstration: test_generate_pairs
  strategy: generate_pairs
  args: {}
  tests:
    - run | success
    - at PickPositiveInteger#1
    - at PickPositiveInteger#2
    - at PickPositiveInteger#3  # error
  queries:
    - query: PickPositiveInteger
      args: {prev: null}
      answers:
        - answer: "1"
    - query: PickPositiveInteger
      args: {prev: 1}
      answers:
        - answer: "2"
  expect:
    test_feedback:
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics:
          - severity: "warning"


##### Testing cached computations

- demonstration: comp_result_in_cache
  strategy: test_cached_computations
  args: {n: 2}
  tests:
    - run | success
  queries:
    - query: __Computation__
      args:
        fun: expensive_computation
        args: {n: 2}
      answers:
        - answer: "[2, 3]"
    - query: __Computation__
      args:
        fun: expensive_computation
        args: {n: 3}
      answers:
        - answer: "[3, 4]"
  expect:
    test_feedback:
      - diagnostics: __empty__

- demonstration: comp_result_outside_cache
  strategy: test_cached_computations
  args: {n: 2}
  tests:
    - run | success
  queries: []
  expect:
    test_feedback:
      - diagnostics: __empty__
    implicit_answers:
      computations:
        - query_name: __Computation__
          query_args: {fun: expensive_computation}
        - query_name: __Computation__
          query_args: {fun: expensive_computation}


- demonstration: structured_output
  query: StructuredOutput
  args:
    topic: "Music"
  answers:
    - answer:
        title: "Understanding Bach"
        authors: ["Brigitte Mouterde"]


- demonstration: tool_use
  strategy: propose_article
  args:
    user_name: Jonathan
  tests:
    - run | success
  queries:
    - query: ProposeArticle
      args:
        user_name: Jonathan
        prefix: []
      answers:
        - answer: ""
          call:
            - tool: GetUserFavoriteTopic
              args: {user_name: Jonathan}
    - query: ProposeArticle
      args:
        user_name: Jonathan
        prefix:
          - kind: oracle
            answer:
              mode: null
              content: ""
              tool_calls:
                - name: GetUserFavoriteTopic
                  args:
                    user_name: Jonathan
          - kind: tool
            call:
              name: GetUserFavoriteTopic
              args:
                user_name: Jonathan
            result: Soccer
      answers:
      - answer: ""
        call: [{tool: Article, args: {title: "All about Messi", authors: ["Raf"]}}]
  expect:
    test_feedback:
      - diagnostics: __empty__


- demonstration: flags
  strategy: pick_flag
  args: {}
  tests:
    - run | success
    - run '#alt' | success
    - run '#unk'
  queries: []
  expect:
    test_feedback:
      - diagnostics: __empty__
      - diagnostics: __empty__
      - diagnostics:
        - severity: "warning"


- demonstration: flags_global
  strategy: pick_flag
  args: {}
  tests:
    - run | success
  queries:
    - query: MethodFlag
      args: {}
      answers: [answer: alt]
  expect:
    test_feedback:
      - diagnostics: __empty__


- demonstration: abduction
  strategy: obtain_item
  args:
    market: &market
      - name: Joe
        asked_items: [apple, cherry]
        offered_item: banana
      - name: Eric
        asked_items: []
        offered_item: apple
      - name: Alice
        asked_items: []
        offered_item: cherry
    goal: banana
  tests:
    - run | success
  queries: 
    - query: ObtainItem
      args:
        market: *market
        possessed_items: []
        item: banana
      answers:
        - answer: {items: [apple, cherry]}
  expect:
    test_feedback:
      - diagnostics: __empty__


- demonstration: trivial_untyped_strategy
  strategy: trivial_untyped_strategy
  args:
    string: "hello"
    integer: 42
  tests:
    - run | success
  queries: []
  expect:
    test_feedback:
      - diagnostics: __empty__


- demonstration: make_sum_fetched_answers
  strategy: make_sum
  args:
    allowed: [9, 6, 2]
    goal: 11
  tests:
    - run | success
  using:
    - command: "commands/run_make_sum"
  queries: []
  expect:
    test_feedback:
      - diagnostics: __empty__


# In this demonstration, we use the hindsight feedback from a command 
# to reach a success node using only one answer.
- demonstration: using_hindsight_feedback
  strategy: get_magic_number
  args: {}
  tests:
    - run | success
  using:
    - command: "commands/run_get_magic_number"
      backprop_with: [first]
  queries: []
  expect:
    test_feedback:
      - diagnostics: __empty__
    implicit_answers:
      fetched:
        - answer_content: "83"


# This is a variant of `using_hindsight_feedback`, where this time
# hindsight feedback is explicitly excluded.
- demonstration: without_hindsight_feedback
  strategy: get_magic_number
  args: {}
  tests:
    - run | success
  using:
    - command: "commands/run_get_magic_number"
  queries: []
  expect:  # two implicit answers are used this time
    test_feedback:
      - diagnostics: __empty__
    implicit_answers:
      fetched:
        - query_name: AskNumber
        - query_name: AskNumber


- demonstration: loading_data
  strategy: strategy_loading_data
  args: {key: "yang25"}
  tests:
    - run | success
  queries: []
  expect:  # two implicit answers are used this time
    test_feedback:
      - diagnostics: __empty__
    implicit_answers:
      data: []


- demonstration: loading_bad_data
  strategy: strategy_loading_data
  args: {key: "bazin23"}
  tests:
    - run | success
  queries: []
  expect:  # two implicit answers are used this time
    test_feedback:
      - diagnostics:
        - severity: "error"
    implicit_answers:
      data: []

On Reading Demonstration Files

Demonstration files are much easier to read and understand using Delphyne's VSCode extension. Standard shortcuts can be used to fold and unfold sections. The additional Cmd+D+Cmd+K shortcut can be used to automatically fold all large sections. Demonstrations can be evaluated and the path followed by each test inspected in the extension's Tree View.

A demonstration is either a standalone query demonstration¹ or a strategy demonstration. A query demonstration describes a query instance along with one or several associated answers. A strategy demonstration bundles multiple query demonstrations with unit tests that describe tree navigation scenarios.

Warning

It is possible to specify few shot examples using one standalone query demonstration per example and nothing else. However, doing so is not recommended. Indeed, such demonstrations are harder to write since tooling cannot be leveraged to generate query descriptions automatically. More importantly, they are harder to read and maintain because individual examples are presented without proper context. Strategy demonstrations allow grounding examples in concrete scenarios, while enforcing this relationship through unit tests.

Strategy demonstrations have the following shape:

- demonstration: ...    # optional demonstration name
  strategy: ...         # name of a strategy function decorated with @strategy
  args: ...             # dictionary of arguments to pass to this strategy
  tests:
    - ...
    - ...
  queries:
    - query: ...       # Query name
      args: ...        # Query arguments
      answers:
        - label: ...   # Optional label (to be referenced in tests)
          example: ... # Whether to use as an example (optional boolean) 
          tags: ...    # Optional set of tags
          answer: |
            ...
        - ...

The Delphyne VSCode extension automatically checks the syntactic well-formedness of demonstrations (in addition to allowing their evaluation). For explanations on specific fields, see the API Reference. Tests are expressed using a custom DSL that we describe below.

Demonstration Tests

Evaluating a demonstration consists in evaluating all its tests in sequence. Each test describes a path through the tree, starting from the root. The Delphyne VSCode extension allows visualizing this path. A test can succeed, fail, or be stuck. A test is said to be stuck if it cannot terminate due to a missing query answer. In this case (and as demonstrated in the Overview), the extension allows locating such a query and adding it to the demonstration.

Each test is composed of a sequence of instructions separated by |. The most common sequence by far is run | success, which we describe next.

Walking through the Tree

Starting at the current node, the run instruction uses answers from the queries section to walk through the tree until either a leaf node is reached or an answer is missing (in which case the test is declared as stuck). Each node type (e.g. Branch) defines a navigation function that describes how the node should be traversed.

Navigation Functions

A node's navigation function returns a generator that yields local spaces, receives corresponding elements and ultimately returns an action. This is best understood through examples:

Example: Navigation function for Branch nodes

@dataclass(frozen=True)
class Branch(dp.Node):
    cands: OpaqueSpace[Any, Any]

    @override
    def navigate(self):
        return (yield self.cands)

Example: Navigation function for Join nodes

@dataclass(frozen=True)
class Join(dp.Node):
    subs: Sequence[dp.EmbeddedTree[Any, Any, Any]]

    @override
    def navigate(self):
        ret: list[Any] = []
        for sub in self.subs:
            ret.append((yield sub))
        return tuple(ret)

Whenever run needs to select an element from a space defined by a query, it looks for this query in the demonstration’s queries section and picks the first provided answer. If no answer is found, it gets stuck at the current node. When run encounters a space defined by a tree, it recursively navigates this tree. The run command stops when a leaf is reached. It is often composed with the success command, which ensures that the current node is a success leaf.

Nothing more than run | success is needed to demonstrate taking a direct path to a solution. The more advanced instructions we discuss next are useful to describe more complex scenarios.

Advanced Tests

This section describes more advanced test instructions. In addition to the examples from the test suite, demonstrations with advanced tests are featured in the find_invariants example:

Extract from examples/find_invariants/abduct_and_branch.demo.yaml

- strategy: prove_program_via_abduction_and_branching
  args:
    prog: ... # (1)!
  tests:
    - run | success
    - run 'partial' | success
    # Demonstrating `EvaluateProofState`
    - at EvaluateProofState#1 'partial' | answer eval
    - at EvaluateProofState#2 'partial propose_same' | answer eval
    # Demonstrating `IsProposalNovel`
    - at iterate#1 | go cands | go next(next(nil){'partial'}[1]) | save second_attempt
    - load second_attempt | at IsProposalNovel 'blacklisted' | answer cands
    - load second_attempt | at IsProposalNovel 'not_blacklisted' | answer cands
  queries: ... # (2)!

See details in original file.
See details in original file.

Exploring alternative paths with hints

The run function can be passed a sequence of answer labels as hints, specifying alternate paths through the tree. Whenever a query is encountered, it checks if an answer is available whose label matches the first provided hint. If so, this answer is used and the hint is consumed. For example, instruction run 'foo bar' can be interpreted as:

Walk through the tree, using answer foo whenever applicable and then bar.²

This design allows describing paths concisely, by only specifying the few places in which they differ from a default path. This works well for demonstrations, which typically describe shallow traces centered around a successful scenario, with side explorations (e.g., showing how a bad decision leads to a low value score, or demonstrating how redundant candidates can be removed at a particular step).

Stopping at particular nodes

The at instruction works like run, except that it takes as an additional argument a node selector that specifies a node at which the walk must stop. The simplest form of node selector consists in a tag to match. For example, instruction at EvalProg 'wrong' behaves similarly to run 'wrong', except that it stops when encountering a node tagged with EvalProg. By default, all spaces are tagged with the name of the associated query or strategy, and each node inherits the tags of its primary space if it has one. Custom space tags can be added using the SpaceBuilder.tagged method³. Finally, the #n operator can be used to match the \(n^{th}\) instance of a tag. For example, at PickPositiveInteger#2 stops at the second encountered node tagged with PickPositiveInteger⁴.

Warning

Importantly, at can only stop within the same tree that it started in, and not inside a nested tree. In order to stop at a node tagged with bar within a space tagged with foo, you can use at foo/bar. This design choice is mandated by modularity: individual strategies can be made responsible for setting unambiguous tags for nodes that they control, but cannot ensure the absence of clashing tags in other strategies.

Entering nested spaces

The go instruction allows entering a tree nested within the current node. For example, if the current node is a Conjecture node (defined in tests/example_strategies.py), go cands enters the tree that defines the cands space or errors if cands is defined by a query. This instruction can be shortened as go, since cands is the primary space of Conjecture nodes.

More interestingly, suppose the demonstration already explores two paths within cands that reach different success leaves and thus correspond to two different candidates. Each of these paths can be described through a sequence of hints: the first candidate is identified by '' (i.e. default path) and the second by 'foo' (i.e. use answer 'foo' when appropriate). Then, instruction go aggregate([cands{''}, cands{'foo'}]) can be used to enter the strategy tree comparing those two candidates. It can be shortened to go aggregate(['', 'foo']) since cands is a primary space.

In general, any element of a local space can be referenced via a (possibly empty) sequence of hints. For spaces defined by queries, at most one hint is expected that indicates which answer to use. For spaces defined by trees, a sequence of hints is expected that leads to a success leaf by calling run recursively.

The answer instruction is similar to go. It takes a space selector as an argument but expects to find a query instead of a tree when entering this space. It succeeds if the corresponding query is answered in the demonstration and fails otherwise.

Entering a child

The take instruction can be used to move to a child of a current node. It takes as an argument a hint-based value reference denoting an action.

For example, at a branching node, take 'foo' (a synonym for take cands{'foo'} since cands is the primary space of Branch nodes) goes to the child corresponding to the candidate associated with hint foo. At a Join node with two subgoals, take [subs[0]{''}, subs[1]{''}] goes to the default child (the same one that would be visited using run).

An example of a standalone query demonstration is MakeSum_demo in tests/example_strategies.demo.yaml. ↩
A warning is issued if the run command reaches a leaf node while unused hints remain. ↩
See tests/example_strategies.py:dual_number_generation for an example. ↩
See tests/example_strategies.demo.yaml:test_generate_pairs. ↩