Code Monkey home page Code Monkey logo

phonetic_search's Introduction

Phonetic search algorithms in Python

This repository contains a few phonetic search / indexing algorithms implemented in Python.

Unless otherwise noted, these are all (C) Copyright 2015, Mads Olsgaard, released under BDS 3


Moreover this repository also contains two corpus files.

  1. names.csv
  2. badwords.csv

names.csv is a list of first and last names collected from the 1990 US census and contains 155.947 unique names.

Source: http://www.census.gov/topics/population/genealogy/data/1990_census/1990_census_namefiles.html

badwords.csv is a collection of English swearwords collected online. Words have not been checked for offensiveness or correctness.

Sources: Consist mostly of words from noswearing.com and Google's official list of bad words

Both corpora are considered public domain, and free to use.

phonetic_search's People

Contributors

olsgaard avatar walkingintopeople avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

phonetic_search's Issues

Phonix needs to return touples, encode first char as both letter and digit

According to step b, c and l, the first character is retain as a letter, while the entire word is processed to completion, and then affixed in the end.

Characters dropped in step f and g are to be stored and processed in in step m. This is the "ending sounds" which are used to asses the likelihood of a candidate when searching. So phonix must return the key as a touple.


Algorithm as outlined in T.N. Gadd, (1990),"PHONIX: The algorithm", Program, Vol. 24 Iss 4 pp. 363 - 366

a) Perform phonetic substitutions (see Appendix);
- only the specified characters are dropped, eg. the V or vowel is not dropped in the substitution of 'N' for 'PN' when 'PNv' is true;
- the parameters are applied in the specified order;
- process all occurrences of one substitution before proceeding to the next
- the result of a substitution may create new target strings for substitution by subsequent parameters.

b) Retain the first character for the retrieval code.

c) Replace by 'v' if A, E, I, O, U or Y.

d) Where names end in ES, drop the E.

e) Append an E where names end in A,I,O,U or Y.

0 Drop the last character regardless.

g) Drop the new last character if not A,E,I,O,U or Y.

h) Repeat g) until a vowel (including Y) is found. This results in a word or name without its ending-sound.

i) Strip all occurrences of A,E,I,O,U,Y,H and W.

j) Remove one of all duplicate successive consonants.

k) Replace ALL consonants by their numeric values.

  1. Prefix the retrieval code with the retained first character (may be a 'v'[lowercase - see above]).

m) Repeat i), j) and k) on the characters removed as stripped ending-sounds

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.