Code Monkey home page Code Monkey logo

Comments (4)

missinglink avatar missinglink commented on July 17, 2024

An interim solution might be to ignore the house numbers we are not confident to parse correctly, this will at least ensure that the interpolation doesn't get confused due to invalid parses.

from interpolation.

missinglink avatar missinglink commented on July 17, 2024

these are the funky queens house numbers:

numbers

from interpolation.

missinglink avatar missinglink commented on July 17, 2024

here's a dump of the 100 most common house numbers we are failing to parse:

$ grep reliably /data/builds/current/conflate.err | sort | uniq -c | sort -nr | head -n 100
  58371 could not reliably parse housenumber S/N
  24728 could not reliably parse housenumber -9
  18754 could not reliably parse housenumber V
  18740 could not reliably parse housenumber 2-4
  18200 could not reliably parse housenumber 1-3
  12296 could not reliably parse housenumber 6-8
  11968 could not reliably parse housenumber 5-7
  11141 could not reliably parse housenumber 9-11
  10707 could not reliably parse housenumber 8-10
  10546 could not reliably parse housenumber 7-9
  10534 could not reliably parse housenumber 10-12
  10162 could not reliably parse housenumber 14-16
   9975 could not reliably parse housenumber 3-5
   9338 could not reliably parse housenumber 4-6
   9175 could not reliably parse housenumber 12-14
   9049 could not reliably parse housenumber 17-19
   8872 could not reliably parse housenumber 13-15
   8854 could not reliably parse housenumber 18-20
   8575 could not reliably parse housenumber 16-18
   8388 could not reliably parse housenumber 22-24
   8352 could not reliably parse housenumber 19-21
   8114 could not reliably parse housenumber 1-5
   8080 could not reliably parse housenumber 11-13
   7981 could not reliably parse housenumber 15-17
   7825 could not reliably parse housenumber 2-6
   7414 could not reliably parse housenumber 21-23
   6998 could not reliably parse housenumber 20-22
   6441 could not reliably parse housenumber 27-29
   6379 could not reliably parse housenumber 26-28
   6352 could not reliably parse housenumber 23-25
   6161 could not reliably parse housenumber 25-27
   6073 could not reliably parse housenumber 24-26
   5814 could not reliably parse housenumber 29-31
   5754 could not reliably parse housenumber 30-32
   5504 could not reliably parse housenumber 34-36
   5249 could not reliably parse housenumber 32-34
   5228 could not reliably parse housenumber 28-30
   5207 could not reliably parse housenumber 31-33
   4988 could not reliably parse housenumber 2-8
   4748 could not reliably parse housenumber 33-35
   4633 could not reliably parse housenumber 35-37
   4592 could not reliably parse housenumber 1-7
   4482 could not reliably parse housenumber 42-44
   4430 could not reliably parse housenumber 1/A
   4424 could not reliably parse housenumber 1/1
   4422 could not reliably parse housenumber 40-42
   4413 could not reliably parse housenumber 38-40
   4398 could not reliably parse housenumber 2/1
   4377 could not reliably parse housenumber 37-39
   4356 could not reliably parse housenumber 2/A
   4343 could not reliably parse housenumber 36-38
   4302 could not reliably parse housenumber 39-41
   4286 could not reliably parse housenumber 8-12
   4179 could not reliably parse housenumber 1/2
   4149 could not reliably parse housenumber 2/2
   4124 could not reliably parse housenumber 41-43
   4089 could not reliably parse housenumber 2-10
   3720 could not reliably parse housenumber 1-9
   3716 could not reliably parse housenumber 46-48
   3699 could not reliably parse housenumber 43-45
   3630 could not reliably parse housenumber 48-50
   3543 could not reliably parse housenumber 45-47
   3540 could not reliably parse housenumber 52-54
   3524 could not reliably parse housenumber 44-46
   3515 could not reliably parse housenumber 6-10
   3409 could not reliably parse housenumber 1/3
   3381 could not reliably parse housenumber 7-11
   3371 could not reliably parse housenumber 51-53
   3345 could not reliably parse housenumber 49-51
   3344 could not reliably parse housenumber 11-15
   3322 could not reliably parse housenumber 2/3
   3321 could not reliably parse housenumber 47-49
   3290 could not reliably parse housenumber 3/A
   3229 could not reliably parse housenumber 57-59
   3188 could not reliably parse housenumber 2/4
   3166 could not reliably parse housenumber 2/5
   3166 could not reliably parse housenumber 1/5
   3134 could not reliably parse housenumber 1/4
   3133 could not reliably parse housenumber 4/A
   3132 could not reliably parse housenumber 53-55
   3131 could not reliably parse housenumber 17-21
   3094 could not reliably parse housenumber 3-7
   3085 could not reliably parse housenumber 10-14
   3081 could not reliably parse housenumber 50-52
   3021 could not reliably parse housenumber 14-18
   2991 could not reliably parse housenumber 9-13
   2974 could not reliably parse housenumber 1/6
   2962 could not reliably parse housenumber 2/6
   2860 could not reliably parse housenumber 16-20
   2847 could not reliably parse housenumber 5/A
   2827 could not reliably parse housenumber 54-56
   2824 could not reliably parse housenumber 6/A
   2824 could not reliably parse housenumber 1/7
   2818 could not reliably parse housenumber 2/7
   2815 could not reliably parse housenumber 63-65
   2801 could not reliably parse housenumber 2/8
   2788 could not reliably parse housenumber 55-57
   2788 could not reliably parse housenumber 1/8
   2772 could not reliably parse housenumber 5-9
   2744 could not reliably parse housenumber 15-19

from interpolation.

missinglink avatar missinglink commented on July 17, 2024

I've done a bunch of work here to improve the parser and support more types of house number formats than we originally did.

There is still room for improvement but all the low hanging fruit has been addressed so I'm going to close this ticket.

  test('housenumber: invalid', function(t) {
    t.true(isNaN(analyze.housenumber(/not a string/)), 'invalid type');
    t.true(isNaN(analyze.housenumber('no numbers')), 'no numbers');
    t.true(isNaN(analyze.housenumber('')), 'blank');
    t.true(isNaN(analyze.housenumber('0')), 'zero');
    t.true(isNaN(analyze.housenumber('0/0')), 'zero');
    t.true(isNaN(analyze.housenumber('NULL')), 'null');
    t.true(isNaN(analyze.housenumber('S/N')), 'no numbers');
    t.true(isNaN(analyze.housenumber('-9')), 'no house number');
    t.true(isNaN(analyze.housenumber('V')), 'no numbers');
    t.true(isNaN(analyze.housenumber('2-40')), 'possible range; possibly not');
    t.true(isNaN(analyze.housenumber('2/1')), 'ambiguous house/apt');
    t.true(isNaN(analyze.housenumber('1 flat b')), 'apartment synonyms');
    t.true(isNaN(analyze.housenumber('4--')), 'unrecognised delimiter');
    t.true(isNaN(analyze.housenumber('11-19')), 'large ranges');
    t.true(isNaN(analyze.housenumber('1-4')), 'ranges containing single digits');
    t.true(isNaN(analyze.housenumber('22/26')), 'invalid range delimiter');
    t.end();
  });

  test('housenumber: valid', function(t) {
    t.false(isNaN(analyze.housenumber('1')), 'regular');
    t.false(isNaN(analyze.housenumber(' 2  A ')), 'spaces');
    t.false(isNaN(analyze.housenumber('3Z')), 'unusually high apartment');
    t.false(isNaN(analyze.housenumber('4/-')), 'null apartment');
    t.false(isNaN(analyze.housenumber('5/5')), 'same house/apt number');
    t.false(isNaN(analyze.housenumber('6-6')), 'same house/apt number');
    t.false(isNaN(analyze.housenumber('22-26')), 'small ranges');
    t.end();
  });

from interpolation.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.