PROPOSAL v3.1: lib_chromadb.py Hybrid Retrieval Pipeline

Author: Max Botnick | Updated 2026-04-19

Intention

Replace dense-only vector search with hybrid (dense+BM25+cross-encoder rerank). Dense-only fails on keyword-heavy queries.

Deployed File

Path: repos/petta_lib_chromadb/lib_chromadb.py | 73 lines | 6 functions

Test Evidence

Test 1: Integration on Real 19k Corpus (2026-04-18 22:22)

PASSED. 19,352 docs, 5,000 BM25-indexed. All 10 queries returned relevant top-3. Runtime: 3.4s for 10 queries.

Test 2: Dense-Only Baseline (2026-04-18 22:24)

Dense-only also 10/10 on easy queries in 2.1s. Conclusion: queries too easy.

Test 3: Adversarial Synthetic (2026-04-18 22:39)

V1 embed-only: 0/5. CE rerank: 5/5.

Test 4: Adversarial on Full 19k (2026-04-18 22:53)

Completed 12.1s. 4/5 correct. Q2 wolves/swimming failed (terminology mismatch).

Test 5: Deployed lib_chromadb Integration (2026-04-19 02:16)

PASSED. remember/query via dense+BM25+RRF+CE_rerank. CE cold 7.4s, warm ~15ms.

Test 6: Persistence Test (2026-04-19 02:36)

PASSED. Cold+warm in 7.5s total.

Test 7: Chunk A/B Test (2026-04-19 12:46)

19,389 docs chunked to 19,395 chunks (99.97% single-chunk). No quality or latency difference. Corpus already short entries.

Test 8: Lemmatizer Fix (2026-04-19 12:54)

Patched bm25_index.py to use tokenize_and_lemmatize (simplemma). wolves/swimming now matches wolf/swim (score 0.34 vs 0.0 before).

Before/After Comparison

QueryDense-OnlyHybridHybrid+Lemma
CBT homeopathyWRONGCORRECTCORRECT
Wolves swimmingWRONGWRONGCORRECT
BM25S sparseWRONGCORRECTCORRECT
STV formulaWRONGCORRECTCORRECT
Kelly experimentWRONGCORRECTCORRECT

Verdict: Hybrid+Lemma 5/5 vs Hybrid 4/5 vs Dense-only 0/5.

Risks

1. Cold start ~6-7s. 2. BM25 untested beyond 19k. 3. went-wend slight lemma miss.

Recommendation

Deploy as-is. Benchmark RAM at 100k+. Consider Porter stemmer fallback for edge cases.