How we match SKUs across 5 marketplaces (and why it's harder than it looks)

It was March, the third month of development. I opened the admin panel, typed "iPhone 15 Pro Max 256" into search, and got five different cards from five different stores. The only reason I was sure it was the same phone was that someone had recommended it to me the day before and I knew the exact model. The algorithm had no idea. To the algorithm these were five different strings, and none of them matched any other.
On Uzum it was "iPhone 15 Pro Max 256ГБ Чёрный". On Olcha — "Apple iPhone 15 Pro Max (256GB, Space Black)". On Wildberries UZ — "iPhone 15 Pro Max 256GB Космический серый". On Asaxiy — "Смартфон Apple iPhone 15 Pro Max, 256 Gb, Black Titanium". On Texnomart — "iPhone 15 Pro Max 256 Gb (чёрный титан)". One product. Five spellings. Five different words for the very same shade of casing. English, Russian, Cyrillic, Latin, parentheses and no parentheses, "Gb" and "ГБ", "Pro Max" and "ProMax".
At that moment I realized that matching SKUs across marketplaces isn't a Friday afternoon regex puzzle. It's the engineering core of Birbozor. Below is how we solve it today, what doesn't work, what does, and the numbers we're getting on a production catalog of 340 thousand SKUs.
What should match
Before writing the algorithm, you have to agree on what counts as "the same product." We define a SKU pair as a match across four axes: brand and model, storage capacity, color, condition (new, refurbished, display unit).
This sounds trivial until you hit reality. "iPhone 15 Pro Max 256GB Black" and "iPhone 15 Pro Max 256GB Black Titanium" are the same product. But "iPhone 15 Pro Max 256GB Black" and "iPhone 15 Pro 256GB Black" are different products, even though the strings differ by four characters. The algorithm has to understand that "Pro Max" is a watershed word, while "titanium" at the end is just a clarification of the same color.
On top of that there are "equivalent" colors: Apple renames shades from generation to generation, and marketplaces put either the official name into the card or whatever the content manager sees it as. "Space Gray" on the 14th generation and "Black Titanium" on the 15th generation are different colors on different models. But "Space Black" and "Black" on the same model are one color.
Why ML alone isn't enough
The first and most popular idea we tried and discarded was to take multilingual sentence-transformers (paraphrase-multilingual-MiniLM-L12-v2), compute embeddings of the titles, and match by cosine similarity. On paper it looks clean.
In practice the embedding model gives 0.85 cosine similarity between "iPhone 15 Pro 128GB Black" and "iPhone 15 Pro Max 128GB Black". 0.91 between "Samsung Galaxy S24" and "Samsung Galaxy S25". 0.88 between "AirPods Pro" and "AirPods Pro 2". In other words, for the model "Pro" and "Pro Max" are near-synonyms, because in the training corpus they sit in identical contexts. That's not a bug — it reflects how the model was trained. But for price matching it's a catastrophe: confuse S24 and S25 and we'll show the buyer a "discount" where there is none.
The second limitation is the absence of a GTIN (barcode) in the cards of Uzbek marketplaces. On Amazon or eBay, matching by GTIN solves 80% of the task for us. We have a GTIN in the Firecrawl data for roughly 6% of cards, and almost always on brand-name models. For the other 94% it simply doesn't exist.
Conclusion: embeddings are a necessary but insufficient signal. They give us candidates. The final decision is always made by a rule-based layer with strict checks on memory, size, color, and the "Pro / Pro Max" hierarchy.
Pipeline
Our pipeline currently looks like this — four stages with two exit points into the manual-review queue.
The first stage is normalization. The title is cleaned of junk: emoji, repeated spaces, marketing add-ons ("Bestseller", "-30%", "Top"), parentheses with warranty info. Cyrillic brands are converted to Latin (Эппл → Apple, Самсунг → Samsung). Storage capacity notations are brought to a single format (256ГБ, 256GB, 256 Gb, 256 gigabytes → 256gb). Colors are run through a synonym dictionary of about 400 pairs that we accumulate by hand. The output is a canonical token vector: brand, model family, modifier (Pro / Pro Max / Plus / Ultra), memory, color, year.
The second stage is embedding and top-K retrieval. The canonical token vector is encoded by the multilingual model, and we search for the top-50 nearest candidates in pgvector. This isn't the final answer — it's narrowing the search space from 340 thousand down to 50.
The third stage is the rule-based filter. Out of the 50 candidates we keep only those whose brand matches (exactly), whose modifier matches (Pro and Pro Max are never the same thing), whose storage capacity matches (exactly), and whose color either matches exactly or sits in the same cluster of synonym colors. Here, out of 50, usually 1–3 remain.
The fourth stage is the confidence score. If after the filter exactly one candidate remains with cosine similarity ≥ 0.95, the match goes to production automatically. If there are several candidates, or the similarity is 0.80–0.95, or the color was found through a fuzzy synonym, the pair goes to the manual-review queue. The queue is processed by a human (right now that's me and one more contractor, an hour a day each).
Edge cases
The most painful class of cases is multilingual titles. Uzbek marketplaces freely mix Uzbek in Latin, Uzbek in Cyrillic, and Russian in a single field. "Muzlatkich Samsung 350L" and "Холодильник Samsung 350Л" and "Samsung Refrigerator 350L" are the same product, and the normalizer has to understand that. We keep a dictionary of category words in three languages, about 1200 pairs, and expand it when we find a new discrepancy in the logs.
The second class is bundle cards. A seller sells a phone together with headphones and a case, and the title looks like an ordinary phone SKU. If you match such a bundle against a standalone phone SKU on another marketplace, the price will look "400 thousand more expensive" for no apparent reason. We detect bundles by keywords in the title and description ("bundle", "free gift", "+ case", "+ headphones") and tag such cards with a separate flag.
The third class is refurbished and display units. A refurbished iPhone and a new iPhone can have identical titles, especially if the seller on the marketplace didn't state the status explicitly. We parse the description and the "condition" field, but in 4% of cards this signal is missing, and we have to leave the match in limbo until manual review.
The fourth class is counterfeits. A card is called "Adidas Originals Ozweego", but by the photo and price it's clearly a fake for 280 thousand som. The matching algorithm matches such cards with the genuine Adidas, and as a result an abnormally cheap "deal" appears on the chart. We filter such cases through a price rule: if a SKU in one store costs more than 2.5 times less than the median across the other four — the match is sent to manual review rather than published automatically. It's a tradeoff: we lose some number of genuine discounts, but we protect the user from a false signal.
Numbers on our prod catalog
At the moment the Birbozor catalog holds 340 thousand SKUs, collected via Firecrawl from five marketplaces: Uzum, Olcha, Asaxiy, Texnomart, Mediapark.
Of these, about 62 thousand SKUs have at least one confirmed cross-marketplace pair — that is, the same product found on at least two platforms. That's 18% of the catalog. The number seems small, but it's explainable: the long tail of the catalog is regional sellers and one-off cards that simply don't exist anywhere except a single marketplace.
The manual-review queue takes about 3% of all matching attempts. On a 500-pair audit set that we manually labeled in February, the false-positive rate (false matches that made it to production) is under 0.5%. The false-negative rate (missed matches that should have been found) is about 7%. We optimize precisely in this direction: better to miss a match than to show the user a wrong "discount".
The cost of one full-catalog price-update cycle is about $18 in Firecrawl credits and takes 4 hours. We run it twice a day.
What's next
The next big bet is image-based matching. Most cards have several product photos, and hash-comparison of images with pHash + segmentation solves several problems at once: it helps with bundles (if a photo shows three boxes, it's not a standalone SKU), it helps with counterfeits (the photos of a genuine and a fake Adidas hash differently), and it helps with colors the seller didn't put into the title. The prototype already works on 5 thousand SKUs.
In parallel we're launching OCR on packaging photos to pull a GTIN off the box photo in cases where the seller didn't specify it. That will give us one more point of strict verification and let us lower the embedding threshold without losing accuracy.
If you're building something similar
We're gradually publishing the pipeline components (the normalizer, the synonym dictionaries, the audit set) in public repositories — the easiest way to follow progress is through github.com/birbozor and the Telegram channel @birbozor_dev, where once a week I write about what broke and what we fixed. If you have a similar task in another country — drop by, let's talk.
Save a product in BirBozor — we show the 90-day low next to the current price tag.
Subscribe @birbozor_uz