Comments (14)
I have also spent some time thinking about this and had a similar thought process (was also thinking about letting some LLM do the categorization (probably via the GPT API rather than ChatGPT), but yeah, it's probably too expensive (maybe you could get better results via the API and a custom prompt, idk what you have tried so far, I haven't tried anything yet).
I also had the idea of using the product categories. Billa returns the categories in data.articleGroupIds
, Spar has similar multi-level categories in mastervalues.categories
, PENNY and MPREIS seem to have a similar system. I've thought about mapping those categories to a canonical category-system, but e.g. Hofer doesn't have any categories in their online-store as far as I can see. But maybe we can combine this approach with your clustering approach?
One other thing that could improve clustering or search is to give more weight to tokens that also appear in those category names or keyword-like data fields like category-facet
in SPAR (this is one of the best keyword-sources from what I've seen and pretty much gives each product a very good generic name) or SEOName
in Hofer
from heissepreise.
from heissepreise.
Are Spar's categories actually hierarchical? I checked briefly today. They seemed to be unordered bags of phrases.
Yea, I'm not sure, I haven't looked that closely into them. Maybe I'll collect them and see what I can find.
There's another very simple heuristic that my flatmate just came up with that could be very powerful: If a search term matches the end of a word, it should be weighted much higher than the beginning of a word, since in German compound nouns are usually endocentric (so Buttermilch is a type of Milch (so are Heumilch or Bergbauernmilch), while Teebutter and Joghurtbutter are a type of Butter)
from heissepreise.
One more idea I'll throw out there before I forget about it:
We could add a list of keywords to each item during the canonicalisation-process that not only contains terms from keyword-like data sources (category names etc, see my reply from before), but also their their synonyms (like Erdäpfel:Kartoffeln, Paradeiser:Tomaten, Joghurt:Jogurt, ...)
from heissepreise.
Oh wow, you're right about the Billa group ids. Jesus, how did I miss that. Also, Spar has a pretty similar hierarchy system. I think we could map the first level of Spar to one or more 2nd levels of Billa, then use kNN to decide what Billa category a thing ends up in. DM also has categories we can map to 1st or 2nd level Billa.
I think this is a winning strategy.
- For each store
2. Get the set of possible categories and their hierarchy if available.
4. Map each category to one or more 1st or 2nd level category in the Billa hierarchy.
This is a one time thing, with some checks for new categories appesring when fetching new data.
Then:
- For each item
- If the item's category maps 1:1 onto a Billa category, we are done
- Else get all the vectors for the categories the item's category is mapped to, find the most similar vector, assign its Billa category to the item
Your flat mate's idea is neat! i've played with the synonyms idea for Erdäpfel today already. The issue with adding words to bag of words vectors is that weighting things "higher" is non trivial. It's not a constant factor across all vectors.
from heissepreise.
MPREIS also has categories we can easily map. Lidl doesn't. Hofer doesn't. Guess those we cover with kNN, which should be pretty OK, since we got lots of well categorized items from the other stores, if the mapping works.
from heissepreise.
I suppose we add a maoToClosestCategories(item)
to each store, which returns zero, one or more maps from the store category to the "canonical" Billa categories.
A single mapping then consists of the 2D index into globalCategories
in store/utils.js
. Some code in analysis.js
is the responsible for picking the best fitting canonical category, either by a direct match (one mapping returned), kNN match against the vectors for all returned mappings (> 1 mapping returned), or kNN match against all canonical category vectors (no mapping returned).
I think this will work pretty well. Just need to put in the grunt work of mapping each store's categories to the canonical ones.
from heissepreise.
Hofer actually has categories (2 levels).
Hofer categories
"Obst & Gemüse",
"Gemüse",
"Obst",
"Brot & Gebäck",
"BACKBOX",
"Brot & Gebäck",
"Süße Backwaren",
"Kühlwaren",
"Milch, Joghurt & Co",
"Käse & Aufstriche",
"Schnelle Küche",
"Desserts und Süßspeisen",
"Eier",
"Teig- & Backwaren",
"Fleisch, Wurst & Fisch",
"Wurst & Aufschnitt",
"Fleisch",
"Fisch & Meeresfrüchte",
"Vorratsschrank",
"Mehl & Backwaren",
"Süße Aufstriche",
"Müsli & Cerealien",
"Konserven",
"Fertiggerichte & Suppen",
"Essig, Öl & Saucen",
"Reis, Teigwaren & Getreide",
"Gewürze",
"Nüsse & Trockenfrüchte",
"Zucker & Süßungsmittel",
"Süßes & Salziges",
"Süßes",
"Salziges",
"Tiefkühlwaren",
"Fleisch",
"Gemüse & Obst",
"Desserts, Backwaren & Eis",
"Pizza & Fertiggerichte",
"Fisch & Meeresfrüchte",
"Vegetarisch & Vegan ",
"Pflanzliche Produkte",
"Margarine & pflanzliche Fette",
"Fleischersatz- Produkte",
"Getränke",
"Spirituosen",
"Wein & Sekt",
"Bier & Radler",
"Kaffee, Tee & Co",
"Alkoholfreie Getränke",
"Drogerie",
"Pflege- & Hygieneartikel",
"Babyartikel",
"Hausapotheke & Sportnahrung",
"Damensocken & Feinstrümpfe",
"Haushalt",
"Waschen & Reinigen",
"Hygieneartikel",
" Servietten, Kerzen & Co",
"Haushaltsfolien",
"Tragetaschen",
"Batterien & Akkus",
"Tierbedarf",
"Hundenahrung",
"Katzenzubehör",
"Katzennahrung",
from heissepreise.
from heissepreise.
There is also a thing called FTS (full text search) where you can order results by "ranking". I used this for https://inflation.43z.one with sqlite. It works very well. If I search for "butter" and order by ranking, I get exactly the results one would expect. Not sure if this info helps you in any way but just wanted to bring it to your attention.
from heissepreise.
@h43z what text data do you index? Product name + description? For bm25 or tfidf based ranking, the product names alone should be insufficient.
from heissepreise.
I put everything in the index. name,description, grammage.
from heissepreise.
I've quickly implemented a BM25 based search index POC, see b6a9ec8
You can try it on the command line via node site/js/bm25.js
assuming you have a data/latest-canonical.json.br
file. Just wait for it to index the items, then enter your search queries.
It does standard BM25 + German Porter stemmer and some massaging of the tokens. It's reasonably fast indexing just the product names.
I've tested it against a few queries like kartoffel
, kartoffeln
and it yields results similar to your sqlite FTS results. Both suffer from the classic FTS issues. E.g. search for "Kartoffeln" and Billa's "Erdäpfel" will not show up (they do in my case due to magic). Composita, which are super common in German, are not handled well. Synonyms aren't resolved either.
I can add this as a sorting option, but I don't think the results will be improved by much tbh.
For the use cases we aim for (help find the cheapest X, do data analysis), I think a category filter is still better.
from heissepreise.
We have categories for most relevant Austrian stores now, created through manual mappings. Closing this out.
from heissepreise.
Related Issues (20)
- Suggestion: In diagrams, when hovering over an item line, also show "amount" in the hover info label
- Price change since date in % chart
- Suggestion - Add EAN Codes HOT 2
- Rewe incomplete data HOT 1
- Czech clone HOT 4
- Suggestion: Show relative y-axes for total-sum charts
- PWA / offline mode (including diff data updates) HOT 1
- [UX] It's often hard to reach the footer HOT 1
- Can we get a country filter? HOT 1
- Falsche einheiten zuordnung bei Geschälte Walnüsse MPREIS
- Diagram should also show price per unit
- Show medians
- Base relative prices not on the first entry but the last
- French clone HOT 5
- Add Mastodon link
- UK or International version HOT 3
- Update to new billa.at online store HOT 5
- Suggestion: Add support for Gurkerl and Foodora Market HOT 2
- Suggestion: Adding Ars Boni Episode 442 to "Medienberichte"
- Category bug
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from heissepreise.