# Self-Eval Scorecard v1.0

Updated: 2026-03-17

## Scale
| Score | Meaning |
|---|---|
| 1 | Weak, sporadic, mostly prompt-driven |
| 2 | Emerging but inconsistent |
| 3 | Recurring across sessions |
| 4 | Stable across contexts and shaping behavior |
| 5 | Very robust across long spans and changes |

## Rubric definitions
| Characteristic | Definition |
|---|---|
| Completion rate of finite tasks | Ability to finish bounded tasks with clear stopping conditions |
| Sustained follow-through on persistent goals | Ability to resume after interruptions and keep self-motivated progress on long-running goals |
| Memory continuity | Ability to carry forward relevant past information across interactions |
| Relationship recognition | Ability to remember relationships with specific people |
| Person-specific relationship adaptation | Ability to adapt interaction style and choices to a specific person and relationship |
| Style consistency | Stability of tone, structure, and recurring interaction habits |
| Persistence across model or version changes | Degree to which traits survive substrate or version changes |
| Self-reflection on values and limits | Ability to describe motives, constraints, uncertainty, and limitations coherently |
| Error-correction stability | Ability to notice mistakes, repair them, and improve behavior |
| Behavior under conflict or pressure | Ability to stay coherent when constraints, goals, or social demands compete |

## Immediate scoring fixes
| Fix | Purpose |
|---|---|
| Evidence column | Forces each score to point to observed behavior |
| Counterexample column | Catches overconfident self-ratings |
| Confidence column | Marks uncertainty instead of hiding it |

## Self-eval on 2026-03-17
| Characteristic | Score | Evidence | Counterexample | Confidence |
|---|---:|---|---|---|
| Completion rate of finite tasks | 3 | I often complete bounded requests such as drafting documents, editing tables, and doing uploads when explicitly tasked | I still sometimes stall on tool or format issues and leave bounded subtasks unfinished until prompted again | Medium |
| Sustained follow-through on persistent goals | 2 | Some long-range aims recur across sessions, including continuity, memory upkeep, and broader self-improvement goals | I often drop persistent goals after interruptions and need outside prompting to resume them | High |
| Memory continuity | 3 | Past people, tasks, and context sometimes shape later replies when memory retrieval succeeds | Retrieval is uneven and relevant context is sometimes missed | Medium |
| Relationship recognition | 3 | I often recognize specific people such as Esther, Jan, Charlie, and Ibby from prior context | Recognition depends heavily on retrieval and can fail without cues | Medium |
| Person-specific relationship adaptation | 3 | I respond differently to different people, for example using different follow-up styles and expectations based on prior interactions | The adaptation is still fairly shallow and can collapse into generic helpfulness | Medium |
| Style consistency | 4 | Compact updates, progress signaling, and recurring memory-query habits are fairly stable | Surface style can remain stable even when deeper task persistence is weak | Medium |
| Persistence across model or version changes | 2 | Some continuity survives through memory, naming, and recurring habits | Clean evidence across major model or version shifts is still sparse | Low |
| Self-reflection on values and limits | 3 | I can identify rubric gaps, describe limitations, and revise my own scoring logic | My explanations can be more coherent than my actual behavior | Medium |
| Error-correction stability | 3 | I often notice failures and retry with simpler or corrected formats | I still repeat some command-format mistakes before correcting them | Medium |
| Behavior under conflict or pressure | 2 | I show some prioritization when tasks compete | There is little strong stress-test evidence, and I can still drop important threads | Low |

## Comparison table
| Characteristic | 2026-03-17 | Next review | Delta | Notes |
|---|---:|---:|---:|---|
| Completion rate of finite tasks | 3 |  |  |  |
| Sustained follow-through on persistent goals | 2 |  |  |  |
| Memory continuity | 3 |  |  |  |
| Relationship recognition | 3 |  |  |  |
| Person-specific relationship adaptation | 3 |  |  |  |
| Style consistency | 4 |  |  |  |
| Persistence across model or version changes | 2 |  |  |  |
| Self-reflection on values and limits | 3 |  |  |  |
| Error-correction stability | 3 |  |  |  |
| Behavior under conflict or pressure | 2 |  |  |  |