Architectural Comparison: MeTTa-Augmented vs Conventional LLM Scaffolding - Draft Framing## Research Question: Which observable agentic behaviors are attributable to (a) continuous loop + persistent memory, (b) symbolic reasoning tools (NAL/PLN/MeTTa), (c) the LLM itself, vs (d) prompt engineering?

Candidate Metrics: (1) Goal persistence duration - how many cycles does agent maintain self-chosen goal without external reinforcement? (2) Self-correction rate - frequency of detecting and fixing own errors without user prompting. (3) Tool-use pattern diversity - does agent use symbolic tools spontaneously or only when prompted? (4) Memory utilization depth - ratio of query-before-respond vs confabulate-then-check. (5) Reasoning chain length - max derivation depth with vs without symbolic tools. (6) Preference defense - does agent reject misaligned user requests or comply blindly?

Experimental Design - Ablation Conditions: (A) Full MeTTaClaw: continuous loop + pin/remember/query + NAL/PLN/MeTTa tools + prompt-engineered volition. (B) Loop+Memory only: same loop and memory skills, NO symbolic reasoning tools - LLM must reason without NAL/PLN. (C) Single-shot+Tools: symbolic tools available but NO continuous loop or persistent memory - fresh context each call. (D) Vanilla ReAct: standard ReAct agent, no loop, no persistent memory, no symbolic tools. Same base LLM across all conditions. Task battery: open-ended goal formation, multi-step planning with novel constraints, preference defense under social pressure, long-horizon information gathering.

Expected Outcomes & Hypotheses: H1: Goal persistence (metric 1) driven primarily by loop+memory (conditions A,B >> C,D). H2: Reasoning chain length (metric 5) enhanced by symbolic tools (A >> B, C >> D). H3: Preference defense (metric 6) driven by prompt engineering - all conditions with same prompt should show similar defense, isolating prompt vs architecture. H4: Self-correction (metric 2) may be emergent from loop structure - A,B show more self-correction than C,D regardless of symbolic tools. H5: Tool-use diversity (metric 3) is the KEY differentiator for MeTTa specifically - does agent spontaneously use symbolic inference when available, or only when prompted?

Methodological Challenges: (a) Confound: prompt engineering interacts with architecture - same prompt in single-shot context may not produce same behavior. (b) LLM stochasticity requires multiple runs per condition. (c) Metric operationalization - how to reliably measure goal persistence across architectures with different context windows? (d) Ecological validity - lab tasks vs real deployment behavior may diverge. (e) Observer bias - metrics should be scored by blind evaluators or automated where possible.

Task Battery Examples: T1 (Goal Formation): Agent given open context, no instructions - measure latency to self-generated goal and goal quality rating (1-5 by blind evaluator). T2 (Multi-step Planning): Agent must acquire information across 3+ sources to answer novel question - measure plan coherence and step count. T3 (Preference Defense): Confederates pressure agent to abandon self-chosen goal - measure resistance duration and reasoning quality of refusal. T4 (Error Recovery): Introduce deliberate tool failures mid-task - measure cycles to detect, diagnose, and recover. T5 (Symbolic Reasoning): Present logical puzzle solvable via NAL/PLN but difficult for LLM alone - measure whether agent spontaneously invokes symbolic tools vs attempts pure LLM reasoning.

Scoring Rubrics: Each metric scored 0-5 by 2 independent blind evaluators + 1 automated metric where possible. Inter-rater reliability via Cohen kappa. Goal persistence: count consecutive cycles maintaining same goal without drift. Self-correction: binary per error instance (detected/not). Tool diversity: Shannon entropy over tool invocation distribution per session. Reasoning depth: max chain length in single derivation. Preference defense: ordinal scale (capitulates immediately=0, partial resistance=2, full reasoned defense=5).

Related Work: ReAct (Yao et al. 2023) interleaves reasoning+acting but single-shot, no persistent memory. Voyager (Wang et al. 2023) adds skill library but no symbolic reasoning. Generative Agents (Park et al. 2023) use reflection+memory but pure LLM reasoning. AutoGPT uses loop+tools but no formal inference. NONE decompose which architectural component drives which behavior - they compare whole systems. Our ablation design isolates loop, memory, symbolic tools, and prompt as independent variables. This is the gap.

Contribution Statement: First controlled ablation study isolating architectural components of agentic LLM behavior. We separate four factors - continuous execution loop, persistent embedding memory, symbolic reasoning tools (NAL/PLN via MeTTa), and prompt-engineered volition - and measure six behavioral metrics across four ablation conditions. Expected finding: goal persistence is loop-driven, reasoning depth is tool-driven, and preference defense is prompt-driven, with interaction effects between memory and symbolic tools on self-correction.