In many of our work within public service, some of us are tasked to sense-make the current macro and micro-trends. However, to do so, there are a lot of manhours and time needed, and some departments even tried employing headcounts to do so. After that, they will send the news to all officers within their departments. But they cannot know what the readership is like. Hence, our team created a solution and is serving an alpha version to selected public officers as early users.
The goal of this project is to enhance the current application to expand to geopolitical-related news, which focused only on tech news. Users can pick their preferred topics (not keywords), before a set of 10 articles is sent to them ranked by their topic preferences and trendiness.
Why Geopolitics? This has been a recurring theme in our discussions with agencies. Also, there are a lot of movements happening such as the de-dollarization, that might be signals of a changing world order
- Scrape data from reputable sources on international relations news (text first) a. There is also a thought to see how to use a variety of data from commentaries on Video
- Then, I need to clean the data - will explore using spaCy
- Understand the similarity and differences based on the words that are in the text
- Extract key words/phrases and entities to perform name-entity recognition, and group them by topic modelling using spaCy
- I am thinking of using BERT to group them into three level of topics : L1 (Categories), L2 (Topics), and L3 (Keywords) with the key phrases
- I will also use other transformer or large-language models such as GPT to compare the performance against BERT
- The L1 and L2 Topics are exposed to users, and they can select the topics they want to read every morning
Besides users preference, I will use a few more factors as weights
- Sources : based on the 'credibility' of the source
- Hot-news/trends : based on how fresh the news is. By datetime, and by google trends
- Geography : prioritize by countries maybe? But the hard part and depends on the entities we can abstract from the text
This is based on the users preference. Currently we only have 253 users, with 15 topics on tech. So there might be a cold-start problem
This is tougher - what is the best way to show a particular topic trending over time? I need to do some research
- Number of clicks of articles recommended from their emails
- Number of users increased through the addition of a geopolitical scope
- There might not be a lot of 'free' international journals; many are behind paywalls. So not all websites can be scraped? Need Shao Quan-fu
- The scrapper might break over time when the way websites structure their contents change (which happens often)
- During recognition of name-entity, it might be hard for the model to make sense of the importance by just the name of the country itself
- If this is productionize, without man-in-the-loop, we might be sending controversial news to users, which is tough