Code Monkey home page Code Monkey logo

Comments (14)

simmac avatar simmac commented on July 17, 2024

I have also spent some time thinking about this and had a similar thought process (was also thinking about letting some LLM do the categorization (probably via the GPT API rather than ChatGPT), but yeah, it's probably too expensive (maybe you could get better results via the API and a custom prompt, idk what you have tried so far, I haven't tried anything yet).

I also had the idea of using the product categories. Billa returns the categories in data.articleGroupIds, Spar has similar multi-level categories in mastervalues.categories, PENNY and MPREIS seem to have a similar system. I've thought about mapping those categories to a canonical category-system, but e.g. Hofer doesn't have any categories in their online-store as far as I can see. But maybe we can combine this approach with your clustering approach?

One other thing that could improve clustering or search is to give more weight to tokens that also appear in those category names or keyword-like data fields like category-facet in SPAR (this is one of the best keyword-sources from what I've seen and pretty much gives each product a very good generic name) or SEOName in Hofer

from heissepreise.

badlogic avatar badlogic commented on July 17, 2024

from heissepreise.

simmac avatar simmac commented on July 17, 2024

Are Spar's categories actually hierarchical? I checked briefly today. They seemed to be unordered bags of phrases.

Yea, I'm not sure, I haven't looked that closely into them. Maybe I'll collect them and see what I can find.

There's another very simple heuristic that my flatmate just came up with that could be very powerful: If a search term matches the end of a word, it should be weighted much higher than the beginning of a word, since in German compound nouns are usually endocentric (so Buttermilch is a type of Milch (so are Heumilch or Bergbauernmilch), while Teebutter and Joghurtbutter are a type of Butter)

from heissepreise.

simmac avatar simmac commented on July 17, 2024

One more idea I'll throw out there before I forget about it:

We could add a list of keywords to each item during the canonicalisation-process that not only contains terms from keyword-like data sources (category names etc, see my reply from before), but also their their synonyms (like Erdäpfel:Kartoffeln, Paradeiser:Tomaten, Joghurt:Jogurt, ...)

from heissepreise.

badlogic avatar badlogic commented on July 17, 2024

Oh wow, you're right about the Billa group ids. Jesus, how did I miss that. Also, Spar has a pretty similar hierarchy system. I think we could map the first level of Spar to one or more 2nd levels of Billa, then use kNN to decide what Billa category a thing ends up in. DM also has categories we can map to 1st or 2nd level Billa.

I think this is a winning strategy.

  1. For each store
    2. Get the set of possible categories and their hierarchy if available.
    4. Map each category to one or more 1st or 2nd level category in the Billa hierarchy.

This is a one time thing, with some checks for new categories appesring when fetching new data.

Then:

  1. For each item
    • If the item's category maps 1:1 onto a Billa category, we are done
    • Else get all the vectors for the categories the item's category is mapped to, find the most similar vector, assign its Billa category to the item

Your flat mate's idea is neat! i've played with the synonyms idea for Erdäpfel today already. The issue with adding words to bag of words vectors is that weighting things "higher" is non trivial. It's not a constant factor across all vectors.

from heissepreise.

badlogic avatar badlogic commented on July 17, 2024

MPREIS also has categories we can easily map. Lidl doesn't. Hofer doesn't. Guess those we cover with kNN, which should be pretty OK, since we got lots of well categorized items from the other stores, if the mapping works.

from heissepreise.

badlogic avatar badlogic commented on July 17, 2024

I suppose we add a maoToClosestCategories(item) to each store, which returns zero, one or more maps from the store category to the "canonical" Billa categories.

A single mapping then consists of the 2D index into globalCategories in store/utils.js. Some code in analysis.js is the responsible for picking the best fitting canonical category, either by a direct match (one mapping returned), kNN match against the vectors for all returned mappings (> 1 mapping returned), or kNN match against all canonical category vectors (no mapping returned).

I think this will work pretty well. Just need to put in the grunt work of mapping each store's categories to the canonical ones.

from heissepreise.

mhochsteger avatar mhochsteger commented on July 17, 2024

Hofer actually has categories (2 levels).

Hofer categories
"Obst & Gemüse",
    "Gemüse",
    "Obst",
"Brot & Gebäck",
    "BACKBOX",
    "Brot & Gebäck",
    "Süße Backwaren",
"Kühlwaren",
    "Milch, Joghurt & Co",
    "Käse & Aufstriche",
    "Schnelle Küche",
    "Desserts und Süßspeisen",
    "Eier",
    "Teig- & Backwaren",
"Fleisch, Wurst & Fisch",
    "Wurst & Aufschnitt",
    "Fleisch",
    "Fisch & Meeresfrüchte",
"Vorratsschrank",
    "Mehl & Backwaren",
    "Süße Aufstriche",
    "Müsli & Cerealien",
    "Konserven",
    "Fertiggerichte & Suppen",
    "Essig, Öl & Saucen",
    "Reis, Teigwaren & Getreide",
    "Gewürze",
    "Nüsse & Trockenfrüchte",
    "Zucker & Süßungsmittel",
"Süßes & Salziges",
    "Süßes",
    "Salziges",
"Tiefkühlwaren",
    "Fleisch",
    "Gemüse & Obst",
    "Desserts, Backwaren & Eis",
    "Pizza & Fertiggerichte",
    "Fisch & Meeresfrüchte",
"Vegetarisch & Vegan ",
    "Pflanzliche Produkte",
    "Margarine & pflanzliche Fette",
    "Fleischersatz- Produkte",
"Getränke",
    "Spirituosen",
    "Wein & Sekt",
    "Bier & Radler",
    "Kaffee, Tee & Co",
    "Alkoholfreie Getränke",
"Drogerie",
    "Pflege- & Hygieneartikel",
    "Babyartikel",
    "Hausapotheke & Sportnahrung",
    "Damensocken & Feinstrümpfe",
"Haushalt",
    "Waschen & Reinigen",
    "Hygieneartikel",
    " Servietten, Kerzen & Co",
    "Haushaltsfolien",
    "Tragetaschen",
    "Batterien & Akkus",
"Tierbedarf",
    "Hundenahrung",
    "Katzenzubehör",
    "Katzennahrung",

from heissepreise.

badlogic avatar badlogic commented on July 17, 2024

from heissepreise.

h43z avatar h43z commented on July 17, 2024

There is also a thing called FTS (full text search) where you can order results by "ranking". I used this for https://inflation.43z.one with sqlite. It works very well. If I search for "butter" and order by ranking, I get exactly the results one would expect. Not sure if this info helps you in any way but just wanted to bring it to your attention.

from heissepreise.

badlogic avatar badlogic commented on July 17, 2024

@h43z what text data do you index? Product name + description? For bm25 or tfidf based ranking, the product names alone should be insufficient.

from heissepreise.

h43z avatar h43z commented on July 17, 2024

I put everything in the index. name,description, grammage.

from heissepreise.

badlogic avatar badlogic commented on July 17, 2024

I've quickly implemented a BM25 based search index POC, see b6a9ec8

You can try it on the command line via node site/js/bm25.js assuming you have a data/latest-canonical.json.br file. Just wait for it to index the items, then enter your search queries.

It does standard BM25 + German Porter stemmer and some massaging of the tokens. It's reasonably fast indexing just the product names.

I've tested it against a few queries like kartoffel, kartoffeln and it yields results similar to your sqlite FTS results. Both suffer from the classic FTS issues. E.g. search for "Kartoffeln" and Billa's "Erdäpfel" will not show up (they do in my case due to magic). Composita, which are super common in German, are not handled well. Synonyms aren't resolved either.

I can add this as a sorting option, but I don't think the results will be improved by much tbh.

For the use cases we aim for (help find the cheapest X, do data analysis), I think a category filter is still better.

from heissepreise.

badlogic avatar badlogic commented on July 17, 2024

We have categories for most relevant Austrian stores now, created through manual mappings. Closing this out.

from heissepreise.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.