Building a price tracker for Uzbekistan: entity resolution across 5 marketplaces
Open Wildberries Uzbekistan and an iPhone 16 128 GB is 9 843 400 UZS. Open Olcha and the same phone is 12 310 350 UZS. That's a 25% spread on one of the most standardized products on Earth, sitting in plain sight, every day. A Samsung Galaxy S26 Ultra spreads 43% across three marketplaces. Nobody notices, because comparing prices in Uzbekistan means opening four apps and typing the same query four times — in two scripts and three languages.
We're building BirBozor to make that one search. The pitch sounds like a scraping project: pull listings from Uzum, Wildberries, Olcha, Texnomart, Yandex Market, and a few others, put them in one database, show the cheapest. Scraping turned out to be the easy half. The hard half — the actual product — is answering one question at scale:
Is this listing the same physical product as that listing?
This post is about how we answer it: embeddings, an LLM judge, and — the part that actually matters — a wall of deterministic guardrails that protect us from our own pipeline. With real numbers, and the cases where it still breaks.
Why product matching here has no shared key
If you've built anything on Amazon, you've been spoiled by ASINs. Price trackers like Keepa never solve cross-site identity — there is no "cross-site". One marketplace, one canonical ID, done.
Uzbek marketplaces share nothing:
- No GTIN/EAN/barcode in listings. Some marketplaces have the field; sellers leave it empty or fill it with garbage.
- Seller-written titles in free-form Russian, Uzbek Latin, Uzbek Cyrillic, or a blend:
Смартфон Samsung Galaxy S25 Ultra 12/256 ГБ,Samsung Galaxy S25 Ultra 12/256GB smartfon, sometimes both in one string. - Same product, different decomposition. One site lists "iPhone 16" with color/storage as variant options; another creates a separate listing per color; a third has one listing per seller per color.
- Noise as a feature. Titles carry
NEW!,🔥,Доставка 1 день, seller names, and SEO keyword stuffing.
The entity-resolution literature calls this record linkage without a blocking key. The published e-commerce matching work (Ozon, AliExpress, Wildberries engineering blogs) is all intra-marketplace — dedup inside one catalog, with the catalog's own taxonomy to lean on. Cross-site, every assumption you'd like to make about structure belongs to someone else's backend.
The matching pipeline: text embeddings, vector search, and an LLM judge
Every newly ingested SKU goes through this:
normalize title → embed → block by brand+category → ANN top-10
→ LLM rerank (same / variant / different)
→ deterministic guardrails
→ auto-merge ≥ 0.92 | human review ≥ 0.55 | discard
Stage 1 — Normalize. Strip promo prefixes, reorder brand to the front if it's buried, extract structured attributes deterministically before any model sees the text. 8/256GB → {ram_gb: 8, storage_gb: 256}. The combo syntax alone is a small parser: sellers write 8/256GB, 256/8GB (yes, reversed), 8+256 ГБ, 256 гб 8 гб, and 1TB which must equal 1024GB at comparison time. We validate against whitelists — RAM ∈ {2…16}, storage ∈ {32…2048} — so when we see 256/8, the 256 fails the RAM whitelist and the parser swaps the pair instead of hallucinating a 256 GB-RAM phone.
Stage 2 — Block and search. Titles are embedded with text-embedding-3-small (1536 dims), stored in Postgres with pgvector and mirrored to Qdrant. Candidate search filters by brand block — brand normalized with corporate-suffix stripping, so Samsung Electronics and Samsung land in the same block — and by category when known. Top-10 nearest neighbors move on. Qdrant is primary; if it's down or returns nothing, we fall back to a pgvector scan. The fallback is slower but means a vector-infra outage degrades latency, not correctness.
Stage 3 — LLM rerank. gpt-4o-mini sees the normalized target title plus the 10 candidates and returns a verdict per pair: same, variant, or different, with a confidence score and extracted variant attributes (color, storage, model code…). The prompt's main job is teaching the model what to ignore: listing language, seller, marketplace, emoji, delivery promises. Physical product identity only.
Stage 4 — Guardrails, then thresholds. Score ≥ 0.92 → auto-merge. Score ≥ 0.55 → human review queue. Below → discard. But before the thresholds apply, a stack of deterministic rules gets veto power over the LLM. This stage exists because of everything in the next section.
The guardrails are the product
The uncomfortable lesson of this system: the LLM is confidently wrong in exactly the cases that matter most, and the fix is never "better prompt" — it's domain rules with veto power.
Color conflicts downgrade merges. The model happily declares Samsung Galaxy S25 Ultra Черный and Samsung Galaxy S25 Ultra Ледниково-синий the same product at 0.99 confidence — same model line, after all. But "glacier blue" and "black" are different SKUs with different prices. We keep a canonical color vocabulary — 12 colors, with aliases across English, Russian, and both Uzbek scripts (чёрный/qora/black → black) — and if the two titles resolve to conflicting canonical colors, a same verdict is forcibly downgraded to variant. The products end up linked as siblings in one product group rather than merged into one listing.
Hard fields block, soft fields group. A conflict in model_code, processor, gpu, or screen_size zeroes the score outright — no review, no mercy. A conflict in color, ram, or storage converts the decision into a variant link. The split sounds obvious in retrospect; we got it by merging things we shouldn't have. Model codes get their own normalization pass (GC-B509QG9M → gcb509qg9m) because appliance sellers write the same code with hyphens, spaces, or neither.
Price disagreement is evidence. If two listings claim the same product but prices differ ≥ 100%, the match is rejected outright. Between 35% and 99%, an auto-merge-grade score gets capped below the review threshold — a human looks at it. This rule sounds backwards for a price comparison engine — price spread is literally what we sell — but in practice a 2× gap on "identical" products almost always means different storage tiers, a bundle, or a counterfeit, not a bargain.
Unknown brands must bring receipts. Two listings titled Смартфон with brand NoName will embed close together and the LLM will shrug them into a match. So for any pair where at least one side has no recognized brand, we require concrete shared evidence: a model code extracted from both titles, or an overlapping image token (CDN filename/hash). No evidence → score 0, not even review. This is our strictest gate, and it eats some true matches — we accept that.
Same-marketplace pairs never auto-merge. Two near-identical listings on the same marketplace are the hardest dedup case: reseller duplicate? counterfeit? the seller's own A/B listing? Even at 0.96 LLM confidence we cap the score and force human review. Cross-site confidence and intra-site confidence are different beasts.
Variant groups, and the 288-headed monster
Merging isn't the only outcome — most relationships are variants: same product family, different color/storage. Those link into product groups with axes computed from the variant attributes (a group might have axes [color, capacity]).
Groups have their own failure mode: drift. Link A↔B, B↔C, C↔D, and soon "Redmi Note 15" and "Redmi 15" are one happy family. Our defense is a family signature: tokenize the title, strip brand, colors, memory sizes, and generic filler — what's left (("note", "pro", "redmi")) must be identical across every member before a group link is allowed. Weak signatures (under 2 tokens) refuse to auto-link at all.
We added that rule after watching the failure live. Our largest group is a Nike sneakers listing cluster with 288 members and 21 computed axes — including such proud dimensions as style, usage, vibe-adjacent nonsense the LLM extracted from seller keyword soup. Fashion titles are entropy. The family-signature gate plus a hard cap (auto-linking stops at 12 members; bigger groups need a human) keeps electronics clean; apparel we deliberately under-merge.
Numbers, as of this week
- 69 905 SKUs ingested across 9 marketplace connectors (5 marketplaces with live offers today; Uzum is ~84% of volume), 100% embedded.
- ~178 000 match decisions processed by the pipeline end to end.
- 2 501 merges executed — 2 492 auto-approved above 0.92, just 9 from the human queue. 1 430 merge candidates rejected by guardrails or review.
- 118 merges later split by a human — a 4.7% reversal rate. Every merge writes a snapshot event log, so unmerge restores both SKUs, their offers, and even users' wishlist entries.
- 64 369 variant-sibling links across 1 312 product groups, average size 4.9.
- The review backlog is real: ~33K pending merge candidates. The thresholds are tuned so the backlog grows instead of bad merges shipping. A daily pruner rejects stale candidates; the queue is triage, not debt.
Where it still breaks
Honesty section. Three current failure classes:
The one that slipped under the price gate. We have a live merge of iPhone 17 Pro Max between two marketplaces at 20.2M vs 38.7M UZS — a 91% spread, just under the 100% rejection threshold. The title carries no storage size; almost certainly a 256 GB listing merged with a 1–2 TB one. The price guard was designed for exactly this and missed by 9 points. Generic flagship titles with absent capacity are our worst enemy; the fix in progress is treating missing storage on a high-value item as a conflict, not a wildcard.
Weak-text categories. For fashion, furniture, beauty, books — categories where titles are descriptions rather than specifications — text embeddings are barely better than chance at SKU granularity. We maintain an explicit list of these categories and hold them to stricter evidence requirements, which in practice means we under-match them. A "same dress, three marketplaces" query is honest about being unsolved.
Cross-site coverage is early. Only ~376 SKUs currently have offers on 2+ marketplaces — the long tail of catalog overlap is still being crawled. The machinery above is built for the moment those curves cross; today the spreads we surface skew toward electronics, where they're largest anyway.
What I'd tell someone building this
- The LLM is a feature extractor with opinions, not a judge. Give every domain invariant (colors, model codes, price sanity) deterministic veto power. Our 4.7% merge-reversal rate would be a multiple of that without the guardrail stack.
- Make every merge reversible from day one. Snapshot events + one-click split turned "scary irreversible operation" into "tunable threshold". We'd never have dared 0.92 auto-merge otherwise.
- Prefer false negatives. A missed match costs a user one extra search. A false merge shows someone the wrong price for the wrong product — that's the whole product's credibility.
- Budget more time for title parsing than for ML. The embedding + rerank core was days. The
256/8GB-is-backwards parser, the trilingual color table, the brand-suffix stripper — that's where the months went, and that's the moat.
BirBozor launches soon on iOS and Android — one search across Uzbekistan's marketplaces, price history, and price-drop alerts. If you want to see the matching engine's output before launch: birbozor.uz, or the Telegram channel @birbozoruz where we post the wildest price spreads we find.
Questions about the pipeline — embeddings, the guardrail stack, Qdrant vs pgvector — I'm happy to go deeper in the comments.
Сохраните товар в BirBozor — мы покажем минимум за 90 дней рядом с текущим ценником.