Code Monkey home page Code Monkey logo

myblog-search's Introduction

myblog-search

Application for searching my blog articles at https://blog.seanlee.site/

There are 147 articles: 106 articles in Chinese and 41 articles in English. All English articles have their counterpart in Chinese. There are pairs of English and Chinese articles for the same topic. The search bar provided by blogger.com does not recognize the above relationship, so there are duplication in the search results when a search keyword exists in both languages. This application is to avoid the duplication in search results.

The search application was based on Vespa Text Search Tutorial at https://docs.vespa.ai/en/tutorials/text-search.html but rewired by Elasticsearch (https://www.elastic.co/elasticsearch/) It is deployed on Google Kubernetes Engine at https://search.seanlee.site/?query=%E9%AD%9A

The back end is a single-node Elasticsearch server. It is deployed as a stateful set with persistent volumes for storing search index.

The middleware is a stateless Golang program to append parameters for Elasticsearch/Vespa to return search results in JSON format.

The frond end is a stateless reverse proxy by NGINX. It forwards queries to the middleware and render search results by Vue.js.

The last component of this search application is a cralwer that downloads the blog articles in ATOM format, convert the articles into Elasticsearch/Vespa document format in json format by Golang, and then feed the Elasticsearch/Vespa documents into the backend. It is deployed as a Kubernetes CronJob with a static persisent volume to retain the download blog feed. The retained feed is used for requesting only the recent updated blog feed instead of full feed. Also, the retained feed can be used for rebuilding/refeeding the search index.

The following is the data flow of this search application:

HTTP client --> NGINX --> Middleware --> Vespa <-- Crawler <-- Blog

Docker Compose is also used, but only in local development environment. It is for practicing and comparing the functional difference between Kubernetes and Docker Compose.

myblog-search's People

Contributors

sean1975 avatar

Stargazers

 avatar

Watchers

 avatar

myblog-search's Issues

Incorrect match for invisible URL in blog content

In Chinese blog content, there is a link to its counterpart of English blog. The link contains invisible URL, but the URL is indexed and could be matched with a query.

For example, query http://localhost:8080/search/?query=time+management matches two blog articles:
http://diaryofsean.blogspot.com/2020/05/time-management.html and http://diaryofsean.blogspot.com/2020/05/blog-post.html

The first article is the correct match because its title is "Time Management". The second one is the Chinese blog that contains a link to the first article. The link contains "time" and "management", so it is matched with the query. However, it is not the intent of the query. Such invisible URLs and other HTML tags should not be indexed.

Hyperlink from Chinese to English articles are indexed as document body

There is a special hyperlink (HTML tag) in the beginning of a Chinese article. The hyperlink is used for English readers to navigate to the corresponding English article. The following is an example of the special hyperlink in Chinese articles.

<div style="text-align: right;">
  <a href="http://diaryofsean.blogspot.com/2020/10/congratulations-your-application-has.html">English version</a>
</div>

The above hyperlink is not supposed to be indexed as document body. The consequence is when searching for word "English", almost all Chinese articles are matched. For example, the search results for https://search.seanlee.site/search/?query=English are all Chinese articles.

This special hyperlink may be useful for establishing the relationship between Chinese and English articles.

Duplicate Chinese terms are returned from search results

When a Chinese term is tokenized as multiple tokens. All tokens are returned when a search matches one of the terms matches. For example, search results for "布里斯本" (i.e. Australian city name "Brisbane") contain all tokens of "布里斯本", which are "布", "里斯", "里斯本" (This tokenization result is incorrect and to be resolved in #6)

As a result, the search result incorrectly display and highlight "布里斯里斯本", whereas the expected result is "布里斯本"

Stemming for English is not working

English query words do not match title/body of English documents when the query words and title/body are in different forms. For example, a query word "reply" does not match "replied" in body.

This happens after adding Chinese tokenizer Jieba for fixing #1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.