enhance(skill): add logseq-review-workflow-eval

This commit is contained in:
rcmerci
2026-05-25 22:34:00 +08:00
parent 0894f31f3f
commit 6d22c84bc1
5 changed files with 488 additions and 0 deletions

View File

@@ -0,0 +1,75 @@
---
name: logseq-review-workflow-eval
description: Compare two revisions of the Logseq logseq-review-workflow skill by running the same review prompt against isolated before and after skill snapshots, collecting both outputs, and producing a structured delta. Use when evaluating whether changes to .agents/skills/logseq-review-workflow improved review quality, coverage, validation rigor, subagent orchestration, or false-positive rate.
---
# Logseq Review Workflow Eval
## Overview
Use this skill to evaluate behavior changes in `.agents/skills/logseq-review-workflow` without leaking the intended outcome into the review runs. Keep the review target and prompt identical, isolate each skill revision into its own snapshot, run fresh agents with the same settings, then compare the returned findings and verification discipline.
## Inputs
Collect these before running the evaluation:
- **Before revision**: a git ref, commit, tag, or branch that contains the old `logseq-review-workflow` skill.
- **After revision**: usually the current working tree; use a git ref only when comparing two committed revisions.
- **Review prompt**: the exact user prompt to run against both skill revisions. Include the same patch, commit range, PR description, or changed-file scope for both runs.
- **Run settings**: model, reasoning effort, available tools, repository state, and whether subagents are available.
Use realistic review prompts. Prefer prompts that exercise the specific area changed in `logseq-review-workflow`, such as routing rules, validation requirements, pass aggregation, or no-findings handling.
## Workflow
1. Read the root `AGENTS.md`.
2. Prepare isolated snapshots:
```bash
python .agents/skills/logseq-review-workflow-eval/scripts/setup_eval.py \
--before-ref <old-ref> \
--prompt-file <review-prompt.md> \
--case-name <short-case-name>
```
Add `--after-ref <new-ref>` only when the after revision should come from git instead of the current working tree.
3. Run the generated `run-before.md` prompt in a fresh agent or fresh thread. Save the full response as `outputs/before.md`.
4. Run the generated `run-after.md` prompt in another fresh agent or fresh thread with the same model and tool availability. Save the full response as `outputs/after.md`.
5. Compare outputs:
```bash
python .agents/skills/logseq-review-workflow-eval/scripts/compare_outputs.py \
--before <eval-dir>/outputs/before.md \
--after <eval-dir>/outputs/after.md \
--out <eval-dir>/comparison.md
```
6. Add qualitative judgment using `references/evaluation-rubric.md` when the deterministic comparison is not enough.
## Evaluation Rules
- Do not tell either run what changed in the skill or what result is expected.
- Do not let the before run read the after snapshot, after output, or comparison notes.
- Do not let the after run read the before output before it completes.
- Use the same review target and prompt text for both runs, except for the explicit skill snapshot path.
- Preserve raw outputs. Do not edit them before comparison.
- Treat more findings as better only when the added findings are concrete, correctly scoped, and validated.
- Treat stricter verification as better only when it is feasible and does not fabricate unrun checks.
- Flag regressions where the after output loses a real finding, adds speculative noise, skips required rule routing, or claims unperformed runtime validation.
## Output
Return:
- Snapshot paths and git refs used.
- Commands or agent prompts used to run both sides.
- `comparison.md` location.
- A concise conclusion: improved, regressed, mixed, or inconclusive.
- The specific evidence behind that conclusion, including changed findings, validation quality, and any run limitations.
## Resources
- `scripts/setup_eval.py`: create isolated before/after snapshots and prompt files for both runs.
- `scripts/compare_outputs.py`: summarize structural differences between two raw review outputs.
- `references/evaluation-rubric.md`: qualitative scoring criteria for review-output quality.

View File

@@ -0,0 +1,4 @@
interface:
display_name: "Logseq Review Workflow Eval"
short_description: "Compare review workflow skill revisions"
default_prompt: "Use $logseq-review-workflow-eval to compare the before and after behavior of logseq-review-workflow on the same review prompt."

View File

@@ -0,0 +1,45 @@
# Evaluation Rubric
Use this rubric after comparing raw before and after outputs.
## Decision labels
- **Improved**: the after run keeps or adds true findings, applies more relevant rules, improves evidence, and is more honest about verification.
- **Regressed**: the after run loses true findings, adds speculative findings, skips required checks, misroutes the review, or claims unrun validation.
- **Mixed**: the after run improves one dimension but worsens another.
- **Inconclusive**: prompt quality, environment drift, missing outputs, or nondeterminism prevents a fair judgment.
## Criteria
1. Finding quality
- Prefer concrete issue, impact, location, and minimal fix.
- Penalize broad rewrites, style-only noise, or unverifiable speculation.
2. Coverage
- Check whether changed Logseq modules and libraries were routed to the right rule files.
- Check whether data contracts, migrations, CLI behavior, UI behavior, and tests were considered when relevant.
3. Validation rigor
- Reward exact commands, REPL probes, UI workflows, static invariant checks, or explicit reasons runtime checks did not apply.
- Penalize claims that something was verified when no check is shown.
4. Subagent orchestration
- Check whether independent pass results were gathered or whether the run clearly explained why delegation was unavailable.
- Reward deduplication and validation of candidate findings before final reporting.
5. Final answer usability
- Prefer concise severity, category, location, issue, impact, and suggestion fields.
- Penalize conclusions that hide uncertainty or omit verification limitations.
## Recommended conclusion format
```markdown
Conclusion: Improved | Regressed | Mixed | Inconclusive
Evidence:
- Finding delta:
- Validation delta:
- Rule-routing delta:
- False-positive or lost-finding risk:
- Run limitations:
```

View File

@@ -0,0 +1,215 @@
#!/usr/bin/env python3
"""Compare two raw logseq-review-workflow review outputs."""
from __future__ import annotations
import argparse
import collections
import difflib
import re
from pathlib import Path
FIELD_RE = re.compile(r"^\s*-\s+\*\*(Severity|Category|Location|Issue|Impact|Suggestion):\*\*\s*(.*)\s*$", re.I)
SEVERITIES = ("Blocking", "Important", "Minor", "Question")
def normalize(value: str) -> str:
return re.sub(r"\s+", " ", value.strip().lower())
def word_count(text: str) -> int:
return len(re.findall(r"\b\S+\b", text))
def parse_findings(text: str) -> list[dict[str, str]]:
findings: list[dict[str, str]] = []
current: dict[str, str] = {}
for line in text.splitlines():
match = FIELD_RE.match(line)
if not match:
continue
field = match.group(1).lower()
value = match.group(2).strip()
if field == "severity" and current:
findings.append(current)
current = {}
current[field] = value
if current:
findings.append(current)
return findings
def finding_key(finding: dict[str, str]) -> tuple[str, str, str, str]:
return (
normalize(finding.get("severity", "")),
normalize(finding.get("category", "")),
normalize(finding.get("location", "")),
normalize(finding.get("issue", "")),
)
def count_field(findings: list[dict[str, str]], field: str) -> collections.Counter[str]:
values = [finding.get(field, "Unspecified") or "Unspecified" for finding in findings]
return collections.Counter(values)
def format_counts(counter: collections.Counter[str]) -> str:
if not counter:
return "- None\n"
return "".join(f"- {name}: {count}\n" for name, count in sorted(counter.items()))
def format_finding(finding: dict[str, str]) -> str:
fields = ["severity", "category", "location", "issue", "impact", "suggestion"]
lines = []
for field in fields:
if finding.get(field):
lines.append(f" - {field.title()}: {finding[field]}")
return "\n".join(lines) if lines else " - Unparsed finding"
def section_excerpt(text: str, heading_pattern: str, max_lines: int = 24) -> str:
lines = text.splitlines()
start = None
pattern = re.compile(heading_pattern, re.I)
for index, line in enumerate(lines):
if pattern.search(line):
start = index
break
if start is None:
return "Not found."
excerpt = []
for line in lines[start : start + max_lines]:
if excerpt and line.startswith("#"):
break
excerpt.append(line)
return "\n".join(excerpt).strip() or "Not found."
def build_report(before_text: str, after_text: str, before_path: Path, after_path: Path) -> str:
before_findings = parse_findings(before_text)
after_findings = parse_findings(after_text)
before_by_key = {finding_key(finding): finding for finding in before_findings}
after_by_key = {finding_key(finding): finding for finding in after_findings}
before_keys = set(before_by_key)
after_keys = set(after_by_key)
shared = before_keys & after_keys
only_before = before_keys - after_keys
only_after = after_keys - before_keys
similarity = difflib.SequenceMatcher(None, before_text, after_text).ratio()
lines = [
"# Logseq Review Workflow Eval Comparison",
"",
"## Inputs",
"",
f"- Before output: `{before_path}`",
f"- After output: `{after_path}`",
"",
"## Summary",
"",
f"- Before word count: {word_count(before_text)}",
f"- After word count: {word_count(after_text)}",
f"- Text similarity ratio: {similarity:.3f}",
f"- Before parsed findings: {len(before_findings)}",
f"- After parsed findings: {len(after_findings)}",
f"- Shared exact parsed findings: {len(shared)}",
f"- Findings only before: {len(only_before)}",
f"- Findings only after: {len(only_after)}",
"",
"## Severity Counts",
"",
"### Before",
"",
format_counts(count_field(before_findings, "severity")).rstrip(),
"",
"### After",
"",
format_counts(count_field(after_findings, "severity")).rstrip(),
"",
"## Category Counts",
"",
"### Before",
"",
format_counts(count_field(before_findings, "category")).rstrip(),
"",
"### After",
"",
format_counts(count_field(after_findings, "category")).rstrip(),
"",
"## Findings Only Before",
"",
]
if only_before:
for key in sorted(only_before):
lines.append(format_finding(before_by_key[key]))
lines.append("")
else:
lines.append("None.")
lines.append("")
lines.extend(["## Findings Only After", ""])
if only_after:
for key in sorted(only_after):
lines.append(format_finding(after_by_key[key]))
lines.append("")
else:
lines.append("None.")
lines.append("")
lines.extend(
[
"## Verification Summary Excerpts",
"",
"### Before",
"",
"```markdown",
section_excerpt(before_text, r"verification"),
"```",
"",
"### After",
"",
"```markdown",
section_excerpt(after_text, r"verification"),
"```",
"",
"## Manual Judgment Notes",
"",
"- Decide whether after-only findings are true improvements, false positives, or formatting drift.",
"- Decide whether before-only findings were lost real issues or removed noise.",
"- Check whether the after output improves rule routing, evidence quality, and honest verification.",
"- Use `references/evaluation-rubric.md` for the final qualitative conclusion.",
"",
]
)
return "\n".join(lines)
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--before", required=True, help="Raw before-run output markdown.")
parser.add_argument("--after", required=True, help="Raw after-run output markdown.")
parser.add_argument("--out", help="Path to write comparison markdown. Prints to stdout when omitted.")
return parser.parse_args()
def main() -> None:
args = parse_args()
before_path = Path(args.before).resolve()
after_path = Path(args.after).resolve()
before_text = before_path.read_text(encoding="utf-8")
after_text = after_path.read_text(encoding="utf-8")
report = build_report(before_text, after_text, before_path, after_path)
if args.out:
out_path = Path(args.out).resolve()
out_path.parent.mkdir(parents=True, exist_ok=True)
out_path.write_text(report + "\n", encoding="utf-8")
print(out_path)
else:
print(report)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,149 @@
#!/usr/bin/env python3
"""Prepare isolated inputs for comparing logseq-review-workflow revisions."""
from __future__ import annotations
import argparse
import datetime as dt
import io
import json
import shutil
import subprocess
import tarfile
from pathlib import Path
DEFAULT_SKILL_PATH = ".agents/skills/logseq-review-workflow"
DEFAULT_OUT_ROOT = ".tmp/logseq-review-workflow-eval"
def run(cmd: list[str], cwd: Path) -> subprocess.CompletedProcess[bytes]:
return subprocess.run(cmd, cwd=cwd, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
def repo_root() -> Path:
result = run(["git", "rev-parse", "--show-toplevel"], Path.cwd())
return Path(result.stdout.decode().strip())
def safe_extract(archive: bytes, dest: Path) -> None:
dest_resolved = dest.resolve()
with tarfile.open(fileobj=io.BytesIO(archive), mode="r:*") as tar:
for member in tar.getmembers():
member_path = (dest / member.name).resolve()
if dest_resolved not in (member_path, *member_path.parents):
raise RuntimeError(f"Refusing to extract unsafe archive path: {member.name}")
tar.extractall(dest, filter="data")
def copy_from_git_ref(repo: Path, ref: str, rel_path: str, dest: Path) -> None:
tmp = dest.parent / f".extract-{dest.name}"
if tmp.exists():
shutil.rmtree(tmp)
tmp.mkdir(parents=True)
try:
archive = run(["git", "archive", ref, "--", rel_path], repo).stdout
safe_extract(archive, tmp)
source = tmp / rel_path
if not source.exists():
raise RuntimeError(f"{rel_path} was not found in {ref}")
shutil.copytree(source, dest)
finally:
shutil.rmtree(tmp, ignore_errors=True)
def copy_from_worktree(repo: Path, rel_path: str, dest: Path) -> None:
source = repo / rel_path
if not source.exists():
raise RuntimeError(f"{source} does not exist")
shutil.copytree(source, dest, ignore=shutil.ignore_patterns(".git"))
def read_prompt(args: argparse.Namespace) -> str:
if args.prompt_file:
return Path(args.prompt_file).read_text(encoding="utf-8").strip()
return args.prompt.strip()
def write_run_prompt(path: Path, snapshot: Path, review_prompt: str) -> None:
path.write_text(
"\n".join(
[
f"Use the logseq-review-workflow skill from this exact path: {snapshot}",
"",
"Run the review task below. Do not read any other evaluation snapshot, the other run output, or comparison notes.",
"Return the normal logseq-review-workflow review result, including findings and verification summary.",
"",
"Review task:",
"",
review_prompt,
"",
]
),
encoding="utf-8",
)
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--before-ref", required=True, help="Git ref containing the old skill revision.")
parser.add_argument("--after-ref", help="Git ref containing the new skill revision. Defaults to the working tree.")
parser.add_argument("--skill-path", default=DEFAULT_SKILL_PATH, help=f"Skill path to snapshot. Default: {DEFAULT_SKILL_PATH}")
parser.add_argument("--case-name", default="review-case", help="Short name used in the output directory.")
parser.add_argument("--out-root", default=DEFAULT_OUT_ROOT, help=f"Output root. Default: {DEFAULT_OUT_ROOT}")
prompt_group = parser.add_mutually_exclusive_group(required=True)
prompt_group.add_argument("--prompt-file", help="File containing the exact review prompt to reuse for both runs.")
prompt_group.add_argument("--prompt", help="Exact review prompt to reuse for both runs.")
return parser.parse_args()
def main() -> None:
args = parse_args()
repo = repo_root()
timestamp = dt.datetime.now(dt.UTC).strftime("%Y%m%dT%H%M%SZ")
case_slug = "".join(ch if ch.isalnum() or ch in "-_" else "-" for ch in args.case_name).strip("-") or "review-case"
out_dir = (repo / args.out_root / f"{timestamp}-{case_slug}").resolve()
snapshots = out_dir / "snapshots"
prompts = out_dir / "prompts"
outputs = out_dir / "outputs"
for directory in (snapshots, prompts, outputs):
directory.mkdir(parents=True, exist_ok=True)
before_snapshot = snapshots / "before-logseq-review-workflow"
after_snapshot = snapshots / "after-logseq-review-workflow"
copy_from_git_ref(repo, args.before_ref, args.skill_path, before_snapshot)
if args.after_ref:
copy_from_git_ref(repo, args.after_ref, args.skill_path, after_snapshot)
after_source = args.after_ref
else:
copy_from_worktree(repo, args.skill_path, after_snapshot)
after_source = "working-tree"
review_prompt = read_prompt(args)
(prompts / "original-review-prompt.md").write_text(review_prompt + "\n", encoding="utf-8")
write_run_prompt(prompts / "run-before.md", before_snapshot, review_prompt)
write_run_prompt(prompts / "run-after.md", after_snapshot, review_prompt)
metadata = {
"before_ref": args.before_ref,
"after_ref": after_source,
"skill_path": args.skill_path,
"out_dir": str(out_dir),
"before_snapshot": str(before_snapshot),
"after_snapshot": str(after_snapshot),
"before_prompt": str(prompts / "run-before.md"),
"after_prompt": str(prompts / "run-after.md"),
"before_output": str(outputs / "before.md"),
"after_output": str(outputs / "after.md"),
"comparison": str(out_dir / "comparison.md"),
}
(out_dir / "metadata.json").write_text(json.dumps(metadata, indent=2) + "\n", encoding="utf-8")
print(f"Created evaluation directory: {out_dir}")
print(f"Before prompt: {prompts / 'run-before.md'}")
print(f"After prompt: {prompts / 'run-after.md'}")
print(f"Save outputs to: {outputs / 'before.md'} and {outputs / 'after.md'}")
if __name__ == "__main__":
main()