sean1975 / myblog-search Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 35.55 MB

Text search application by Elasticsearch/Vespa

Home Page: https://search.seanlee.site/

Shell 17.88% Dockerfile 1.18% HTML 3.12% Go 56.24% CSS 1.55% JavaScript 5.86% Java 14.17%

search-engine vespa elasticsearch kubernetes

myblog-search's Introduction

myblog-search

Application for searching my blog articles at https://blog.seanlee.site/

There are 147 articles: 106 articles in Chinese and 41 articles in English. All English articles have their counterpart in Chinese. There are pairs of English and Chinese articles for the same topic. The search bar provided by blogger.com does not recognize the above relationship, so there are duplication in the search results when a search keyword exists in both languages. This application is to avoid the duplication in search results.

The search application was based on Vespa Text Search Tutorial at https://docs.vespa.ai/en/tutorials/text-search.html but rewired by Elasticsearch (https://www.elastic.co/elasticsearch/) It is deployed on Google Kubernetes Engine at https://search.seanlee.site/?query=%E9%AD%9A

The back end is a single-node Elasticsearch server. It is deployed as a stateful set with persistent volumes for storing search index.

The middleware is a stateless Golang program to append parameters for Elasticsearch/Vespa to return search results in JSON format.

The frond end is a stateless reverse proxy by NGINX. It forwards queries to the middleware and render search results by Vue.js.

The last component of this search application is a cralwer that downloads the blog articles in ATOM format, convert the articles into Elasticsearch/Vespa document format in json format by Golang, and then feed the Elasticsearch/Vespa documents into the backend. It is deployed as a Kubernetes CronJob with a static persisent volume to retain the download blog feed. The retained feed is used for requesting only the recent updated blog feed instead of full feed. Also, the retained feed can be used for rebuilding/refeeding the search index.

The following is the data flow of this search application:

HTTP client --> NGINX --> Middleware --> Vespa <-- Crawler <-- Blog

Docker Compose is also used, but only in local development environment. It is for practicing and comparing the functional difference between Kubernetes and Docker Compose.

myblog-search's People

Contributors

Stargazers

Watchers

myblog-search's Issues

Incorrect match for invisible URL in blog content

In Chinese blog content, there is a link to its counterpart of English blog. The link contains invisible URL, but the URL is indexed and could be matched with a query.

For example, query http://localhost:8080/search/?query=time+management matches two blog articles:
http://diaryofsean.blogspot.com/2020/05/time-management.html and http://diaryofsean.blogspot.com/2020/05/blog-post.html

The first article is the correct match because its title is "Time Management". The second one is the Chinese blog that contains a link to the first article. The link contains "time" and "management", so it is matched with the query. However, it is not the intent of the query. Such invisible URLs and other HTML tags should not be indexed.

Incorrect Chinese tokenization for Australian city name "Brisbane"

Chinese terms for Australian city name are incorrectly tokenized. For example, Chinese term "布里斯本" is the Australian city name "Brisbane". However, it is tokenized as "布", "里斯", "里斯本"

Hyperlink from Chinese to English articles are indexed as document body

There is a special hyperlink (HTML tag) in the beginning of a Chinese article. The hyperlink is used for English readers to navigate to the corresponding English article. The following is an example of the special hyperlink in Chinese articles.

<div style="text-align: right;">
  <a href="http://diaryofsean.blogspot.com/2020/10/congratulations-your-application-has.html">English version</a>
</div>

The above hyperlink is not supposed to be indexed as document body. The consequence is when searching for word "English", almost all Chinese articles are matched. For example, the search results for https://search.seanlee.site/search/?query=English are all Chinese articles.

This special hyperlink may be useful for establishing the relationship between Chinese and English articles.

Duplicate Chinese terms are returned from search results

When a Chinese term is tokenized as multiple tokens. All tokens are returned when a search matches one of the terms matches. For example, search results for "布里斯本" (i.e. Australian city name "Brisbane") contain all tokens of "布里斯本", which are "布", "里斯", "里斯本" (This tokenization result is incorrect and to be resolved in #6)

As a result, the search result incorrectly display and highlight "布里斯里斯本", whereas the expected result is "布里斯本"

Cannot search by Chinese

Chinese documents can be matched by English query. For example, Chinese documents contain "fish" in body are matched by query http://localhost:8080/search/?query=fish
However, Chinese documents cannot be matched by Chinese query. For example, no documents are matched by query http://localhost:8080/search/?query=%E9%AD%9A

Stemming for English is not working

English query words do not match title/body of English documents when the query words and title/body are in different forms. For example, a query word "reply" does not match "replied" in body.

This happens after adding Chinese tokenizer Jieba for fixing #1

Non-breaking space in body should be unescaped and trimmed

There are non-breaking space ( ) in blog Atom feed. These non-breaking space are displayed as the raw format ( ) in search results. They should be unescaped and trimmed

sean1975 / myblog-search Goto Github PK

myblog-search's Introduction

myblog-search

myblog-search's People

Contributors

Stargazers

Watchers

myblog-search's Issues

Incorrect match for invisible URL in blog content

Incorrect Chinese tokenization for Australian city name "Brisbane"

Hyperlink from Chinese to English articles are indexed as document body

Duplicate Chinese terms are returned from search results

Cannot search by Chinese

Stemming for English is not working

Non-breaking space in body should be unescaped and trimmed

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent