Code Monkey home page Code Monkey logo

simple-dictionary-lookup's Introduction

Dictionary

PROBLEM

We were asked to implement a single-file dictionary to lookup a word and show its definition in less than O(N) (N is number of words) with some constraints:

  • A word is small string (1024 characters long) which is about 1KB at maximum
  • A word's definition is a long string (1GB long)

SOLUTION

Since the word is small and its explanation is too big to fit in memory, at the first time we read the file, we will put all the words in a dictionary structure with key is word and the value is the address of word's explanation in the file. So, whenever we check if a word is already in dictionary, we will read to tha address in dictionary file and show the result.

In order to do that, we need to organize the way we store our values:

  • For each word adding, we will need:
    1. word: 1024 bytes to store the word
    2. explanation: 4 bytes to store the size of explanation
    3. explanation_size: sizeof(explanation) in bytes

When we read through the file byte to byte, we will read first (1024+4) bytes for word and its explanation_size and store the address (index) position (address) of that word.

  • To lookup the word, we will read explanation_size bytes from address.

  • To delete a word, we set word and explanation_size to null (both in files and our in-memory cache).

  • To update a word, we check if the new explanation_size is still less or equal than current explanation_size:

    • if it is, we set current explanation = new explanation without updating the explanation_size
    • it it isn't, delete the current word and add a new word

Another problem arises when we update/delete an item by doing this way. It will create holes when we remove an item.

To fix this, let's create a command (called defragment :) ), to move all using words to the start of the file to fill up those holes.

IMPLEMENTATION

We will separate the tasks into multiple classes/functions:

  • FileReader: is used to read a dictionary files

    • ReadFile(): reads the file and returns a list of Word Data
    • ReadAtAddress(start, length): get the contains of file at index start with length length
  • FileWriter: is used to create/write to a dictionary

    • WriteWord(word, explanation): writes word and explanation to the end of the file
    • WriteAtAddress(start, word, explanation): updates word at position start
  • MemoryCache: is used to store Word Data and its position in memory for quick access, the inner cache used map<string, WordData>

    • Get(word)
    • Add(word, explanation)
    • Update(word, explanation)
    • Delete(word)

DEMO

  • Run go run main.go generate to generate file data from raw_dict.txt
  • Run go run main.go to get explanation of some words in raw_dict.txt

PSEUDO CODE (FOR MISSING PARTS IN DEMO):

  • Add a word:
func add(word, explanation) {
  append_to_file(file_name, new {
    word: word
    explanation_size: of(explanation)
    explanation: explanation
  })
}
  • To read entire file:
address := 0
while !eof {
  word := read_word()
  explanation_size := read_size()
  dict[word].add(new {
    word: word
    address: address
    explanation_size: explanation_size
  })
  address += explanation_size
}
  • To lookup a word:
if dict[word] == null:
  return null;
print read_from_address_with_size({
  address: dict[word].address + 1024+4
  size: dict[word].explanation_size
})
  • To delete a word:
if dict[word] == null:
  return null;
print clear_to_bit_0({
  address: dict[word].address
  size: 1024+4+dict[word].explanation_size
})
  • To update a word:
if dict[word] == null:
  throw 'no word existing in dict';
old_item := dict[word]
if old_item.explanation_size < sizeof(new_explanation) then
  old_item.explanation = new_explanation
else
  delete(word)
  add(word, new_explanation)

CONCLUSION

The algorithm above should lookup the word in O(1) since we're using data structure dictionary as a memory caching to store word and address of the explanation.

Reading the file and build the cache will only cost us O(N) where N is the size of the file.

Other functions (add, update, delete) will take O(1)

Although the task doesn't ask the about the Defragment Command. A good implementation should take O(N) since we're only iterating and copy all words to fill the empty holes.

simple-dictionary-lookup's People

Contributors

vietky avatar vietky-ct avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.