vladimarius / pyap Goto Github PK
View Code? Open in Web Editor NEWPython address detector and parser
License: MIT License
Python address detector and parser
License: MIT License
Thanks for the amazing code ....
It's Really helpful for me ...
But it can't find short address like ..
Roselle, NJ
Delta, BC
Waminster PA
Hello Vlad! Congratulations for this amazing work, I really appreciate. However, I noticed that the parser has problems with some cases of French Canadian addresses.
Examples
This address doesn't parse at all
This one parse but the result is wrong, occupancy is set to None and city is set to bureau 100 Longueuil
{'full_address': '2545, rue De Lorimier, bureau 100 Longueuil, QC, J4K3P7, Canada', 'full_street': '2545, rue De Lorimier', 'street_number': '2545', 'street_type': 'rue', 'street_name': 'De Lorimier', 'route_id': None, 'post_direction': None, 'floor': None, 'building_id': None, 'occupancy': None, 'postal_box': None, 'city': 'bureau 100 Longueuil', 'region1': 'QC', 'postal_code': 'J4K3P7', 'country_id': 'CA'}
Here are some other examples of addresses that do not parse, but seem valid to me, probably because they contain hyphens, parenthesis or something else:
Could you please take a look? My knowledge of regular expressions is very limited
Hi, thanks for the fantastic package. I'm finding it really useful.
Occasionally, some addresses aren't identified in natural text and I've deduced this to their use of "Unit \d+" to denote the unit of a condo. E.g.:
5625 NW 109th Ave, Unit 65, Doral, FL, 33178
451 Ives Dairy Road, Unit 204-1, Miami, Florida 33179
A search on the repo and I don't know if this variant has been mentioned before. If there's a reason this isn't implemented, sorry for missing that. If this is a possible improvement, I'd be happy to make the PR just let me know.
Until then, I might just replace all "Unit" with "Room" or one of the other current variants. Thanks!
For example 123 Main St Kingston ON
will parse, but 123 Main St Kingston Ont.
will not. This longer abbreviation is extremely common formatting, especially for PEI and NWT. Also YK is probably a more common abbreviation than YT, despite that YK does not parse.
Likewise French Canadian speakers would probably use TNL, IPE, etc. Quebec should also have the two letter acronyms PQ and QB.
Postal | English | French |
---|---|---|
AB | Alta. | Alb. |
BC | B.C. | C.-B. |
MB | Man. | Man. |
NB | N.B. | N.-B. |
NL | N.L. | T.-N.-L |
NS | N.S. | N.-É. |
ON | Ont. | Ont. |
PE | P.E.I | Î.-P.-É |
QC | Que. | Qc / P.Q. |
SK | Sask. | Sask. |
NT | N.W.T. | T.N.-O |
NU | Nvt. | Nt. |
YT | Yuk. | YK |
State Name | USPS Abbreviation | Traditional Abbreviation |
---|---|---|
Alabama | AL | Ala. |
Alaska | AK | Alaska |
Arizona | AZ | Ariz. |
Arkansas | AR | Ark. |
California | CA | Calif. |
Colorado | CO | Colo. |
Connecticut | CT | Conn. |
Delaware | DE | Del. |
Florida | FL | Fla. |
Georgia | GA | Ga. |
Hawaii | HI | Hawaii |
Idaho | ID | Idaho |
Illinois | IL | Ill. |
Indiana | IN | Ind. |
Iowa | IA | Iowa |
Kansas | KS | Kans. |
Kentucky | KY | Ky. |
Louisiana | LA | La. |
Maine | ME | Maine |
Maryland | MD | Md. |
Massachusetts | MA | Mass. |
Michigan | MI | Mich. |
Minnesota | MN | Minn. |
Mississippi | MS | Miss. |
Missouri | MO | Mo. |
Montana | MT | Mont. |
Nebraska | NE | Neb. or Nebr. |
Nevada | NV | Nev. |
New Hampshire | NH | N.H. |
New Jersey | NJ | N.J. |
New Mexico | NM | N.Mex. |
New York | NY | N.Y. |
North Carolina | NC | N.C. |
North Dakota | ND | N.Dak. |
Ohio | OH | Ohio |
Oklahoma | OK | Okla. |
Oregon | OR | Ore. or Oreg. |
Pennsylvania | PA | Pa. |
Rhode Island | RI | R.I. |
South Carolina | SC | S.C. |
South Dakota | SD | S.Dak. |
Tennessee | TN | Tenn. |
Texas | TX | Tex. or Texas |
Utah | UT | Utah |
Vermont | VT | Vt. |
Virginia | VA | Va. |
Washington | WA | Wash. |
West Virginia | WV | W.Va. |
Wisconsin | WI | Wis. or Wisc. |
Wyoming | WY | Wyo. |
Sources
When I use pyap.parse()
on the address below, the full address is formatted where the newline character \n
is replaced by a comma and a space. I wonder if there is a way to also get the extracted but unformatted address. This might be useful if, say, a user would like to get the span of an address in the original text where the address is extracted from. Thanks!
address = """14234 Wilshire Blvd
Los Angeles, CA 90011"""
pyap.parse(address, country='US')[0].full_address
#14234 Wilshire Blvd, Los Angeles, CA 90011
Using pyap to parse addresses apart from names. Thank you for the work done as it is an amazing package.
Usually works great but when two names prepend the address it seems to falsely believe that the "and ExampleName" is part of the street.
Input: "ExampleName and ExampleName 111 Rock Rd Pittsboro, NC 1111"
Interestingly, using the "&" symbol seems to not cause this issue and works as expected.
It seems that if there is an address like N8780 Something Blvd Cape Canaveral, FL 11111, the housing number that contains a letter in it will be parsed from the first letter onward. So the street address will be 8780 Something Blvd. Upon further testing, if a letter is added in the middle then most of the housing number will be parsed out before the last occurrence of a letter in the housing number.
Unable to parse this address. Canada parsing is fine though but US is very poor
"1607 23rd Street NW, Washington, DC 20008"
Hi Vladimarius,
do you know if some work has been done on french addresses ? I have to parse some french addresses, it seems libpostal is nice but I have problems installing it and I found no other reliable solution, do you know other parsers for international addresses ?
Thanks for your work !
Romain
Hi,
Thanks for your great work! when I try to install I got this error:
(venv) User@SOZ-MBP16 sample_project % pip install pypa Collecting pypa Downloading pyPA-1.0rc.tar.gz (38 kB) ERROR: Command errored out with exit status 1: command: /Users/User/PycharmProjects/sample_project/venv/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/6k/6tmz3rc91759jdxs80_fjd180000gp/T/pip-install-4z09r1ik/pypa_857e7edb458246b0bbdc50f992f13718/setup.py'"'"'; __file__='"'"'/private/var/folders/6k/6tmz3rc91759jdxs80_fjd180000gp/T/pip-install-4z09r1ik/pypa_857e7edb458246b0bbdc50f992f13718/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/6k/6tmz3rc91759jdxs80_fjd180000gp/T/pip-pip-egg-info-tkpdibrw cwd: /private/var/folders/6k/6tmz3rc91759jdxs80_fjd180000gp/T/pip-install-4z09r1ik/pypa_857e7edb458246b0bbdc50f992f13718/ Complete output (6 lines): Traceback (most recent call last): File "<string>", line 1, in <module> File "/private/var/folders/6k/6tmz3rc91759jdxs80_fjd180000gp/T/pip-install-4z09r1ik/pypa_857e7edb458246b0bbdc50f992f13718/setup.py", line 17 except ImportError, e: ^ SyntaxError: invalid syntax ---------------------------------------- WARNING: Discarding https://files.pythonhosted.org/packages/bc/5a/2964cadcb8bc8d875768a16f023abb328deb895fad65fb1406dd3abc6219/pyPA-1.0rc.tar.gz#sha256=8c5f32fed2f192bd2c07912f17e3f770ec3e09ebae2aef4091171dcdca875c72 (from https://pypi.org/simple/pypa/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Do you know what is the problem?
I am running on Macos python 3.8.2
Pyap doesn't seem to capture any addresses containing PO Boxes, Unit (number), and Floor (number)/ (number th) Floor.
Hi Dear vladimarius
Thank you for publishing this program I wanna use your program for euro country that has different street addresses formating
for example :
AUSTRIA
"""Mr J Brownhall
264 High Street
ALLAMBIE NSW 2100
AUSTRALIA"""
has this formating you implemented in US code in this format :
full_street = r"""
(?:
(?P<full_street>
{street_number}?,?\ ?
{street_name}?,?\ ?
(?:[\ \,]{street_type})\,?\ ?
{post_direction}?\,?\ ?
{floor}?\,?\ ?
{building}?\,?\ ?
{occupancy}?\,?\ ?
{po_box}?
)
)""".format(street_name=street_name,
street_number=street_number,
street_type=street_type,
post_direction=post_direction,
floor=floor,
building=building,
occupancy=occupancy,
po_box=po_box,
)
I get in trouble when I want to change the street number and street name :
full_street = r"""
(?:
(?P<full_street>
{street_name}?,?\ ?
{street_number}
(?:[\ ,]{street_type}),?\ ?
{post_direction}?,?\ ?
{floor}?,?\ ?
{building}?,?\ ?
{occupancy}?,?\ ?
{po_box}?
)
)""".format(street_name=street_name,
street_number=street_number,
street_type=street_type,
post_direction=post_direction,
floor=floor,
building=building,
occupancy=occupancy,
po_box=po_box,
)
and error is :
raise source.error(msg, len(condname) + 1)
re.error: unknown group name 'street_number' at position 142 (line 8, column 28)
Could you possibly help me
I would be grateful
Sometime between 0.2.0 and 0.3.0, non-string data started getting stored in addresses. If I iterate through the values in address.as_dict(), street numbers are now type int instead of type string (which broke some stuff I was using it for).
Not sure if this intentional or not, I wouldn't expect users to want numerical parts of addresses to be stored numerically, but it's not backwards compatible so I thought I'd bring it up.
Cheers!
Hi Vlad! Thanks for this amazing work. However, I noticed that the parser has problems with some of the US addresses.
Examples
These following addresses don't parse at all.
20555 Devonshire Street #116 Chatsworth CA 91311
260-C North El Camino Real Encinitas CA 92024-2852
623 H St NW Floor 3 Washington DC 20001
Could you please take a look?
Ill fork and attempt to fix.
I also wanted to say thank you for an absolutely amazing package BTW. Seriously this is incredible work and well executed.
In the output of parse(some_text=text)
, the address objects have fields match_start
and match_end
, which should refer to the location of the start and end of addresses. Rather, they refer to the start and end of addresses in the string AddressParser._normalize_string(text)
.
Hello,
Here is a list of ~200 addresses "from the wild" that pyap
does not match
It seems that when you include Saint or Mount / Mountain in the city name, it does not properly understand what the city is. Take the two examples:
In the first example, the City will be Pleasant and the "Mount" will be part of the street. In the second example, the "Augustine" will be the city and the "St." will be part of the street. If you convert "St." to Saint" it works great.
The app catches words like 'Drug' thinking it's a driveway. Shouldn't the div
be changed from [\.\ ,]?
to [\.\ ,]
in line 177 of data.py
?
This doesn't work well. If I change your example to even:
225 E. John Carpenter Fwy, Suite 1500 Irving, Texas 75062
It isn't detected.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.