The Demonstration Language
Delphyne includes a demonstration language for writing and maintaining few-shot prompting examples, in the form of coherent scenarios of navigating search trees. The demonstration language is amenable to a test-driven development workflow. This workflow is supported by a dedicated VSCode extension, which is described in the next chapter.
Demonstration Files
Demonstrations can be written in demonstration files with a .demo.yaml
extension. A demonstration file features a list of demonstrations (Demo
). Each demonstration can be separately evaluated. Many short examples can be found in the demonstration file from Delphyne's test suite:
Source for tests/example_strategies.demo.yaml
#####
##### Unit Tests for the Demonstration Language
#####
# Note: some demonstrations feature errors or failing tests.
# Some demonstrations also contain an additional `expect` section
# describing parts of the expected demonstration interpreter feedback.
##### Used for testing the UI
- strategy: make_sum
args:
allowed: [9, 6, 2]
goal: 11
tests:
- run
queries: []
##### Testing standalone query demos
- demonstration: MakeSum_demo
query: MakeSum
args:
allowed: [9, 6, 2]
goal: 11
answers:
- answer: "[9, 2]"
- label: "wrong_sum"
answer: "[9, 6]"
- demonstration: Unknown_query
query: Unknown
args: {}
answers: []
##### Testing make_sum
- demonstration: make_sum_demo
strategy: make_sum
args:
allowed: [9, 6, 2]
goal: 11
tests:
- run | success
- run 'wrong_sum' | success # error
- run 'alt_order'
- failure | run
queries:
- query: MakeSum
args:
allowed: [9, 6, 2]
goal: 11
answers:
- answer: "[9, 2]"
- label: "wrong_sum"
answer: "[9, 6]"
- label: "alt_order"
answer: "[2, 9]"
expect:
trace:
nodes:
1: {kind: Branch}
2: {kind: Success}
3: {kind: Fail}
4: {kind: Success}
test_feedback:
- node_id: 2
diagnostics: __empty__
- node_id: 3
diagnostics:
- ["error", "Success check failed."]
- node_id: 4
diagnostics: __empty__
- node_id: 1
diagnostics:
- ["error", "Failure check failed."]
- demonstration: make_sum_selectors
strategy: make_sum
args:
allowed: [9, 6, 2]
goal: 11
tests:
- run 'wrong_sum'
- run 'unknown_hint'
queries:
- query: MakeSum
args:
allowed: [9, 6, 2]
goal: 11
answers:
- answer: "[9, 2]"
- label: "wrong_sum"
answer: "[9, 6]"
expect:
trace:
nodes:
1: {kind: Branch}
2: {kind: Fail}
3: {kind: Success}
test_feedback:
- node_id: 2
- node_id: 3
diagnostics:
- ["warning", "Unused hints: 'unknown_hint'."]
- demonstration: make_sum_at
strategy: make_sum
args:
allowed: [9, 6, 2]
goal: 11
tests:
- at Unknown
- at MakeSum | run
queries:
- query: MakeSum
args:
allowed: [9, 6, 2]
goal: 11
answers:
- answer: "[9, 2]"
- label: "wrong_sum"
answer: "[9, 6]"
expect:
trace:
nodes:
1: {kind: Branch}
2: {kind: Success}
test_feedback:
- node_id: 2
diagnostics:
- ["warning", "Leaf node reached before 'Unknown'."]
- diagnostics: __empty__
- demonstration: make_sum_stuck
strategy: make_sum
args:
allowed: [9, 6, 2]
goal: 11
tests:
- run
queries: []
expect:
trace:
nodes: {1: {kind: Branch}}
test_feedback:
- diagnostics: [["warning", "Test is stuck."]]
node_id: 1
- demonstration: make_sum_test_parse_error
strategy: make_sum
args:
allowed: [9, 6, 2]
goal: 11
tests:
- bad_command
queries: []
expect:
test_feedback:
- diagnostics: [["error", "Syntax error."]]
- demonstration: trivial_strategy
strategy: trivial_strategy
args: {}
tests: [run]
queries: []
expect: {trace: {nodes: {1: {kind: Success}}}}
- demonstration: buggy_strategy
strategy: buggy_strategy
args: {}
tests: [run]
queries: []
expect:
global_diagnostics:
- ["error"]
- demonstration: strategy_not_found
strategy: unknown_strategy
args: {}
tests: [run]
queries: []
expect:
global_diagnostics:
- ["error"]
- demonstration: invalid_arguments
strategy: make_sum
args:
bad_arg: "foo"
tests: [run]
queries: []
expect:
global_diagnostics:
- ["error"]
- demonstration: unknown_query
strategy: make_sum
args:
allowed: [9, 6, 2]
goal: 11
tests:
- run
queries:
- query: MakeSum
args:
allowed: [9, 6, 2]
goal: 11
answers: []
- query: UnknownQuery
args: {}
answers: []
expect:
query_diagnostics:
- [1, ["error"]]
- demonstration: invalid_answer
strategy: make_sum
args:
allowed: [9, 6, 2]
goal: 11
tests:
- run
queries:
- query: MakeSum
args:
allowed: [9, 6, 2]
goal: 11
answers:
- answer: "'foo'"
expect:
answer_diagnostics:
- [[0, 0], ["error"]]
##### Testing synthetize_fun
- demonstration: synthetize_fun_demo
strategy: synthetize_fun
args:
vars: &vars_1 ["x", "y"]
prop: &prop_1 [["a", "b"], "F(a, b) == F(b, a) and F(0, 1) == 2"]
tests:
- run | success
- run 'invalid' | failure
- at conjecture_expr | go disprove('wrong1') | save wrong1
- load wrong1 | run | success
- load wrong1 | run 'bad_cex' | failure
- load wrong1 | run 'malformed_cex' | failure
- at conjecture_expr | go aggregate(['', 'wrong1', 'wrong2'])
- at conjecture_expr | go aggregate(['', 'unknown', 'wrong2'])
- at conjecture_expr | answer aggregate(['wrong1', 'wrong2'])
- at conjecture_expr | answer aggregate(['', 'wrong1', 'wrong2'])
queries:
- query: ConjectureExpr
args: {vars: *vars_1, prop: *prop_1}
answers:
- label: right
answer: "2*(x + y)"
- label: wrong1
answer: "x + 2*y"
- label: wrong2
answer: "2*y + x"
- label: invalid
answer: "sys.exit()"
- query: ProposeCex
args: {prop: *prop_1, fun: [[x, y], "x + 2*y"]}
answers:
- answer: "{a: 0, b: 1}"
- label: malformed_cex
answer: "{x: 1, y: 1}"
- label: bad_cex
answer: "{a: 0, b: 0}"
- query: RemoveDuplicates
args:
exprs: ["2*(x + y)", "x + 2*y", "2*y + x"]
answers:
- answer: '["2*(x + y)", "x + 2*y"]'
expect:
test_feedback:
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics:
- - "error"
- "Not a nested tree: aggregate(['', 'wrong1', 'wrong2'])."
- diagnostics:
- - "warning"
- "Unused hints: 'unknown'."
- diagnostics:
- - "error"
- diagnostics: __empty__
global_diagnostics: __empty__
saved_nodes: {wrong1: __any__}
##### Testing pick_nice_boy_name
- demonstration: test_iterate
strategy: pick_nice_boy_name
args:
names: ["Adeline", "Noah", "Julia", "Jonathan"]
tests:
- run | success
- run 'girl_name' | failure
- run 'other_boy_name' | failure
- go cands | go next(nil)
- go cands | go next(next(nil){'other_boy_name'}[1]) | run | success
- go cands | go next(next(next(nil){'other_boy_name'}[1]){''}[1]) | run | failure
# Valid selectors
- at iterate
- at pick_boy_name # synonym because `iterate` uses `inherit_tags`
- at iterate&pick_boy_name
- at iterate/pick_boy_name/PickBoyName
# Invalid selectors
- at iterate/pick_boy_name # `Iteration` node has no tags or primary space
# Mistakenly send the wrong value to `next` by using index 0 instead of 1. The strategy raises
# an exception because it explicitly checks the type of its arguments but without this
# assertion, we would get stuck on a query with ill-typed arguments.
- go cands | go next(next(nil){'other_boy_name'}[0]) | run | success
queries:
- query: PickBoyName
args:
names: [Adeline, Noah, Julia, Jonathan]
picked_already: []
answers:
- answer: "Jonathan"
- label: girl_name
answer: "Julia"
- label: other_boy_name
answer: "Noah"
- query: PickBoyName
args:
names: [Adeline, Noah, Julia, Jonathan]
picked_already: [Noah]
answers:
- answer: "Jonathan"
- query: PickBoyName
args:
names: [Adeline, Noah, Julia, Jonathan]
picked_already: [Noah, Jonathan]
answers:
- answer: "Sigmund"
expect:
test_feedback:
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics:
- - "warning"
- diagnostics:
- - "error"
##### Testing generate_pairs and advanced selectors
- demonstration: test_generate_pairs
strategy: generate_pairs
args: {}
tests:
- run | success
- at PickPositiveInteger#1
- at PickPositiveInteger#2
- at PickPositiveInteger#3 # error
queries:
- query: PickPositiveInteger
args: {prev: null}
answers:
- answer: "1"
- query: PickPositiveInteger
args: {prev: 1}
answers:
- answer: "2"
expect:
test_feedback:
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics:
- - "warning"
##### Testing cached computations
- demonstration: comp_result_in_cache
strategy: test_cached_computations
args: {n: 2}
tests:
- run | success
queries:
- query: __Computation__
args:
fun: expensive_computation
args: {n: 2}
answers:
- answer: "[2, 3]"
- query: __Computation__
args:
fun: expensive_computation
args: {n: 3}
answers:
- answer: "[3, 4]"
expect:
test_feedback:
- diagnostics: __empty__
- demonstration: comp_result_outside_cache
strategy: test_cached_computations
args: {n: 2}
tests:
- run | success
queries: []
expect:
test_feedback:
- diagnostics: __empty__
implicit_answers:
- query_name: __Computation__
query_args: {fun: expensive_computation}
- query_name: __Computation__
query_args: {fun: expensive_computation}
- demonstration: structured_output
query: StructuredOutput
args:
topic: "Music"
answers:
- answer:
title: "Understanding Bach"
authors: ["Brigitte Mouterde"]
- demonstration: tool_use
strategy: propose_article
args:
user_name: Jonathan
tests:
- run | success
queries:
- query: ProposeArticle
args:
user_name: Jonathan
prefix: []
answers:
- answer: ""
call:
- tool: GetUserFavoriteTopic
args: {user_name: Jonathan}
- query: ProposeArticle
args:
user_name: Jonathan
prefix:
- kind: oracle
answer:
mode: null
content: ""
tool_calls:
- name: GetUserFavoriteTopic
args:
user_name: Jonathan
- kind: tool
call:
name: GetUserFavoriteTopic
args:
user_name: Jonathan
result: Soccer
answers:
- answer: ""
call: [{tool: Article, args: {title: "All about Messi", authors: ["Raf"]}}]
expect:
test_feedback:
- diagnostics: __empty__
- demonstration: flags
strategy: pick_flag
args: {}
tests:
- run | success
- run '#alt' | success
- run '#unk'
queries: []
expect:
test_feedback:
- diagnostics: __empty__
- diagnostics: __empty__
- diagnostics:
- - "warning"
- demonstration: flags_global
strategy: pick_flag
args: {}
tests:
- run | success
queries:
- query: MethodFlag
args: {}
answers: [answer: alt]
expect:
test_feedback:
- diagnostics: __empty__
- demonstration: abduction
strategy: obtain_item
args:
market: &market
- name: Joe
asked_items: [apple, cherry]
offered_item: banana
- name: Eric
asked_items: []
offered_item: apple
- name: Alice
asked_items: []
offered_item: cherry
goal: banana
tests:
- run | success
queries:
- query: ObtainItem
args:
market: *market
possessed_items: []
item: banana
answers:
- answer: {items: [apple, cherry]}
expect:
test_feedback:
- diagnostics: __empty__
- demonstration: trivial_untyped_strategy
strategy: trivial_untyped_strategy
args:
string: "hello"
integer: 42
tests:
- run | success
queries: []
expect:
test_feedback:
- diagnostics: __empty__
On Reading Demonstration Files
Demonstration files are much easier to read and understand using Delphyne's VSCode extension. Standard shortcuts can be used to fold and unfold sections. The additional Cmd+D+Cmd+K shortcut can be used to automatically fold all large sections. Demonstrations can be evaluated and the path followed by each test inspected in the extension's Tree View.
A demonstration is either a standalone query demonstration1 or a strategy demonstration. A query demonstration describes a query instance along with one or several associated answers. A strategy demonstration bundles multiple query demonstrations with unit tests that describe tree navigation scenarios.
Warning
It is possible to specify few shot examples using one standalone query demonstration per example and nothing else. However, doing so is not recommended. Indeed, such demonstrations are harder to write since tooling cannot be leveraged to generate query descriptions automatically. More importantly, they are harder to read and maintain because individual examples are presented without proper context. Strategy demonstrations allow grounding examples in concrete scenarios, while enforcing this relationship through unit tests.
Strategy demonstrations have the following shape:
- demonstration: ... # optional demonstration name
strategy: ... # name of a strategy function decorated with @strategy
args: ... # dictionary of arguments to pass to this strategy
tests:
- ...
- ...
queries:
- query: ... # Query name
args: ... # Query arguments
answers:
- label: ... # Optional label (to be referenced in tests)
example: ... # Whether to use as an example (optional boolean)
tags: ... # Optional set of tags
answer: |
...
- ...
The Delphyne VSCode extension automatically checks the syntactic well-formedness of demonstrations (in addition to allowing their evaluation). For explanations on specific fields, see the API Reference. Tests are expressed using a custom DSL that we describe below.
Demonstration Tests
Evaluating a demonstration consists in evaluating all its tests in sequence. Each test describes a path through the tree, starting from the root. The Delphyne VSCode extension allows visualizing this path. A test can succeed, fail, or be stuck. A test is said to be stuck if it cannot terminate due to a missing query answer. In this case (and as demonstrated in the Overview), the extension allows locating such a query and adding it to the demonstration.
Each test is composed of a sequence of instructions separated by |
. The most common sequence by far is run | success
, which we describe next.
Walking through the Tree
Starting at the current node, the run
instruction uses answers from the queries
section to walk through the tree until either a leaf node is reached or an answer is missing (in which case the test is declared as stuck). Each node type (e.g. Branch
) defines a navigation function that describes how the node should be traversed.
Navigation Functions
A node's navigation function returns a generator that yields local spaces, receives corresponding elements and ultimately returns an action. This is best understood through examples:
Example: Navigation function for Branch
nodes
@dataclass(frozen=True)
class Branch(dp.Node):
cands: OpaqueSpace[Any, Any]
@override
def navigate(self):
return (yield self.cands)
Example: Navigation function for Join
nodes
@dataclass(frozen=True)
class Join(dp.Node):
subs: Sequence[dp.EmbeddedTree[Any, Any, Any]]
@override
def navigate(self):
ret: list[Any] = []
for sub in self.subs:
ret.append((yield sub))
return tuple(ret)
Whenever run
needs to select an element from a space defined by a query, it looks for this query in the demonstration’s queries
section and picks the first provided answer. If no answer is found, it gets stuck at the current node. When run
encounters a space defined by a tree, it recursively navigates this tree. The run
command stops when a leaf is reached. It is often composed with the success
command, which ensures that the current node is a success leaf.
Nothing more than run | success
is needed to demonstrate taking a direct path to a solution. The more advanced instructions we discuss next are useful to describe more complex scenarios.
Advanced Tests
This section describes more advanced test instructions. In addition to the examples from the test suite, demonstrations with advanced tests are featured in the find_invariants
example:
Extract from examples/find_invariants/abduct_and_branch.demo.yaml
- strategy: prove_program_via_abduction_and_branching
args:
prog: ... # (1)!
tests:
- run | success
- run 'partial' | success
# Demonstrating `EvaluateProofState`
- at EvaluateProofState#1 'partial' | answer eval
- at EvaluateProofState#2 'partial propose_same' | answer eval
# Demonstrating `IsProposalNovel`
- at iterate#1 | go cands | go next(next(nil){'partial'}[1]) | save second_attempt
- load second_attempt | at IsProposalNovel 'blacklisted' | answer cands
- load second_attempt | at IsProposalNovel 'not_blacklisted' | answer cands
queries: ... # (2)!
- See details in original file.
- See details in original file.
Exploring alternative paths with hints
The run
function can be passed a sequence of answer labels as hints, specifying alternate paths through the tree. Whenever a query is encountered, it checks if an answer is available whose label matches the first provided hint. If so, this answer is used and the hint is consumed. For example, instruction run 'foo bar'
can be interpreted as:
Walk through the tree, using answer
foo
whenever applicable and thenbar
.2
This design allows describing paths concisely, by only specifying the few places in which they differ from a default path. This works well for demonstrations, which typically describe shallow traces centered around a successful scenario, with side explorations (e.g., showing how a bad decision leads to a low value score, or demonstrating how redundant candidates can be removed at a particular step).
Stopping at particular nodes
The at
instruction works like run
, except that it takes as an additional argument a node selector that specifies a node at which the walk must stop. The simplest form of node selector consists in a tag to match. For example, instruction at EvalProg 'wrong'
behaves similarly to run 'wrong'
, except that it stops when encountering a node tagged with EvalProg
. By default, all spaces are tagged with the name of the associated query or strategy, and each node inherits the tags of its primary space if it has one. Custom space tags can be added using the SpaceBuilder.tagged
method3. Finally, the #n
operator can be used to match the \(n^{th}\) instance of a tag. For example, at PickPositiveInteger#2
stops at the second encountered node tagged with PickPositiveInteger
4.
Warning
Importantly, at
can only stop within the same tree that it started in, and not inside a nested tree. In order to stop at a node tagged with bar
within a space tagged with foo
, you can use at foo/bar
. This design choice is mandated by modularity: individual strategies can be made responsible for setting unambiguous tags for nodes that they control, but cannot ensure the absence of clashing tags in other strategies.
Entering nested spaces
The go
instruction allows entering a tree nested within the current node. For example, if the current node is a Conjecture
node (defined in tests/example_strategies.py
), go cands
enters the tree that defines the cands
space or errors if cands
is defined by a query. This instruction can be shortened as go
, since cands
is the primary space of Conjecture
nodes.
More interestingly, suppose the demonstration already explores two paths within cands
that reach different success leaves and thus correspond to two different candidates. Each of these paths can be described through a sequence of hints: the first candidate is identified by ''
(i.e. default path) and the second by 'foo'
(i.e. use answer 'foo'
when appropriate). Then, instruction go aggregate([cands{''}, cands{'foo'}])
can be used to enter the strategy tree comparing those two candidates. It can be shortened to go aggregate(['', 'foo'])
since cands
is a primary space.
In general, any element of a local space can be referenced via a (possibly empty) sequence of hints. For spaces defined by queries, at most one hint is expected that indicates which answer to use. For spaces defined by trees, a sequence of hints is expected that leads to a success leaf by calling run
recursively.
The answer
instruction is similar to go
. It takes a space selector as an argument but expects to find a query instead of a tree when entering this space. It succeeds if the corresponding query is answered in the demonstration and fails otherwise.
-
An example of a standalone query demonstration is
MakeSum_demo
intests/example_strategies.demo.yaml
. ↩ -
A warning is issued if the
run
command reaches a leaf node while unused hints remain. ↩ -
See
tests/example_strategies.py:dual_number_generation
for an example. ↩ -
See
tests/example_strategies.demo.yaml:test_generate_pairs
. ↩