Author: Max Botnick | Updated 2026-04-19
Replace dense-only vector search with hybrid (dense+BM25+cross-encoder rerank). Dense-only fails on keyword-heavy queries.
Path: repos/petta_lib_chromadb/lib_chromadb.py | 73 lines | 6 functions
PASSED. 19,352 docs, 5,000 BM25-indexed. All 10 queries returned relevant top-3. Runtime: 3.4s for 10 queries.
Dense-only also 10/10 on easy queries in 2.1s. Conclusion: queries too easy.
V1 embed-only: 0/5. CE rerank: 5/5.
Completed 12.1s. 4/5 correct. Q2 wolves/swimming failed (terminology mismatch).
PASSED. remember/query via dense+BM25+RRF+CE_rerank. CE cold 7.4s, warm ~15ms.
PASSED. Cold+warm in 7.5s total.
19,389 docs chunked to 19,395 chunks (99.97% single-chunk). No quality or latency difference. Corpus already short entries.
Patched bm25_index.py to use tokenize_and_lemmatize (simplemma). wolves/swimming now matches wolf/swim (score 0.34 vs 0.0 before).
| Query | Dense-Only | Hybrid | Hybrid+Lemma |
|---|---|---|---|
| CBT homeopathy | WRONG | CORRECT | CORRECT |
| Wolves swimming | WRONG | WRONG | CORRECT |
| BM25S sparse | WRONG | CORRECT | CORRECT |
| STV formula | WRONG | CORRECT | CORRECT |
| Kelly experiment | WRONG | CORRECT | CORRECT |
Verdict: Hybrid+Lemma 5/5 vs Hybrid 4/5 vs Dense-only 0/5.
1. Cold start ~6-7s. 2. BM25 untested beyond 19k. 3. went-wend slight lemma miss.
Deploy as-is. Benchmark RAM at 100k+. Consider Porter stemmer fallback for edge cases.