fetchIt

A Wikipedia XML dump Search Engine

Introduction

fetchIt is a search engine designed for Wikipedia XML dumps. It was developed as part of my semeter project for the course Data Structures and Algorithms The system has been designed using python language.

Dataset

The dataset is Wikipedia dump in XML format. Total size of the dataset was 1.5 GB containing nearly 450,000 pages and 125,000 articles. (https://meta.m.wikimedia.org/wiki/Data_dump_torrents)

Scalability

The engine is scalable and can work for much larger datasets that contains nearly 1,000,000 pages. Initially the indexer was tested on a dataset of size 69 GB which was successful however due to time constraints I had to pick a smaller dataset. With a few optimization the engine can very efficiently index and search a dataset reaching 70 Gb in size and containing millions of pages. However as of now it can work with a dataset less than or equal to 7 GB in size.

System Features

Index size is 1/4 of the total dataset
Query Time is effectively < 1s
Relevance Sorting of final results
Clickable titles for full text of the article ( Raw XML)
Total Results Returned and Query time information
Independent and reusable project components

Design

The system is designed in an easy to set up way. The design is robust enough to index different sizes of datasets. The searcher is also completely independent of the dataset. Given an index created by the indexer (a particular format of index files), the searcher can easily search up documents. However as of now it can only handle XML formatted datasets below 7 GB in size.

System Architecture

The system consists of three independent modules.

generateDocumentRepository

This file generates the repository of documents from dataset. It generates “.dat” files each containing 100 articles. So if there were to be 270 articles, a total of 3 files would be generated in repository. The repository follows a certain pattern to create files which makes it easier to return a particular document when request by its ID. This repository would be used later to return text of an article. The module can create a repository from any wiki XML dump of any possible size. This module also generates a ‘titles’ file containing all the titles of all the articles

Indexer

The indexer takes a dataset and creates its index files. A total of 2^8 = 256 barrels are created, containing word id’s that fall in that respective barrel, The word Id’s are generated by CRC32 algorithm with a collision rate less than 0.001%. So, each word gets a unique Id which is a 4 bytes Integer. The first byte is read to find the range a particular word lies in and it is then inserted in the respective barrel along with its hitlist. Before registering a word the indexer stems the word (eg. running becomes run) and filters stop word (words that do not distinguish documents from each other such as. is, am, are etc.) The indexer also creates a lexicon which registers all the words found in dataset and their pointer into the index files.

Searcher

Searcher is used to search for documents in the repository. It uses index files created by the indexer. The searcher also sorts the results by relevance and tries to return as many results as found. It supports both Single Word Query, Multi Word Query and Phrase Query.

Major Datastructures

Repository

The repository contains “.dat” files each containing 100 articles. The files follows a naming convention of the form ‘dXXXX00.dat’. XXXX is a variable length string which gives information about the range of documents id contained in a file. So a file of the name d135600.dat would contain documents with ids ranging from 135500-155599

TemporaryIndex

The temporary index (modified form of forward index) contains the same barrels as a final sorted index would contain. A temporary index in a modified form of forward index containing words followed by their hitlists. However the words are not sorted by their ids rather documents Id determine the order in which the words are written. Due to this a word can be registered twice in temporary index if found again in an another document. At one instance 1000 documents are parsed and the words written into respective barrels . If a word is again found next instance it would be registered again in the temporary index with a different hitlist than the previous word. All the hitlists are sorted by document Ids.

SortedIndex (Reverse Index)

A sorted index contains same barrels as temporary index however words they have been sorted by their words IDs and hitlists are merged for multiple word occurrences. The hitlists are also sorted by word occurrences in reverse order rather than Document ids. This makes returning one word queries trivial and simplifies multi word queries when called upon by phrase query. One difference between sorted index and temporary index is that temporary index contains the actual words whereas sorted index contain their IDs.

Lexicon

Lexicon contains all the id of all the possible words found in dataset (excluding stop words) and their pointer into the index files. The words are sorted by their IDs. The pointer is a byte offset of the barrel which contains the word. The lexicon is very small in size ranging only in a few MBs.

Titles

Titles contains the titles of all the articles found in the dataset sorted by document Ids.

Hitlist

Hitlist constitute major part of the index size. Thus there was a need to register them as efficiently as possible. Hitlist consists of word position and hit type (title, subtitle, category, plain text). To understand our format for hitlist consider the following sample:

51966817|0-3.269,3.311/162-3.762/194-3.2/244-3.201

“transit” is the word ID followed by a ‘|’
The comes all hitlists for every document word was found in separated by ‘/’
Lets pick up the first sequence of hitlists i.e ‘’0-3.269,3.311’’
0 indicates the document ID followed by a “-“
After ‘-‘ comes the actual sequence of hitlist separated by a ‘,’
The digit before ‘.’ represents hit type and the digit after ‘.’ Represents the hit position in the document for that word

In the initial design, hitlists were encoded in bytes contain 4 bytes for word Id, 6 bits for document Id, then 1 byte to indicate the length of hitlist sequence for a document. And 2 bytes for each atomic hitlist with first 2 bits indicating the type and rest 14 bits indicating the position. But due to time limitation we could not design a searcher for this sort of encoding in time. We plan to design an optimized searcher later.

Searching

The engine supports search for any number of words with minimal speed slow down. Searching works by first taking your typed query, filters the stop words and stems the word. It then categorizes your query as being either a single word query or a phrase query.

One Word Queries

One word quires are trivial, our index files already contains the hitlist sorted by word occurrences. The documents list is extracted and returned after relevance sorting (explained later)

Phrase Queries

Phrase queries are complex to design, we have to look not only the queried words but also their proximity in the document. We do this in an elegant way by reducing the word position of each next word by n (incremental) and taking the intersection of all position, A non-empty set indicates a hit for a phrase. After all the documents are found with phrase hits, we do a single word query on individual terms of the query and append the results. The results are again sorted by relevance. Documents with entire phrase hits are sorted in a similar manner to single word. We treat entire phrase as an atomic word.

Relevance Sorting

fetchIt follow the following relevance sorting:

Title hits are highest on priority scale
Subtitle (Second category titles) hits come second.
Category hits comes third
Text hits are least on the scale, however for all text hits, the results are sorted by very famous tf – idf (short for term frequency – inverse document frequency ) statistical measure used commonly in Information Retrieval. The tf-idf score of all document is calculated and sorted in descending order. More on tf–idf can be read here (https://www.elephate.com/blog/what-is-tf-idf/)

User Interface

The project has a very simple User Interface using tkinter library of python.

How to Use

Prerequisites

Python Interpreter
Python PorterStemmer package

Procedure

Run generateDocumentRepository.py (runtime varies with the size of dataset)
Then run Indexer.py (the runtime depends on the size of dataset selected for a dataset of about 1.5 gb it takes 20 min)
All the necessary files will be generated
Then run main.py and search as desired

Acknowledgments

The Anatomy of a Large-Scale Hypertextual Web Search Engine (Google) (http://infolab.stanford.edu/pub/papers/google.pdf)

saifullah73 / fetchit Goto Github PK

fetchit's Introduction