vladimarius / pyap Goto Github PK

View Code? Open in Web Editor NEW

197.0 197.0 58.0 57 KB

Python address detector and parser

License: MIT License

Python 100.00%

pyap's Issues

It doesn't identify short address

Thanks for the amazing code ....
It's Really helpful for me ...
But it can't find short address like ..

Roselle, NJ
Delta, BC
Waminster PA

Parsing French Canadian Address with Occupancy

Hello Vlad! Congratulations for this amazing work, I really appreciate. However, I noticed that the parser has problems with some cases of French Canadian addresses.

Examples
This address doesn't parse at all

2545, rue De Lorimier, bureau 100, Longueuil, QC, J4K3P7

This one parse but the result is wrong, occupancy is set to None and city is set to bureau 100 Longueuil

2545, rue De Lorimier, bureau 100 Longueuil, QC, J4K3P7
{'full_address': '2545, rue De Lorimier, bureau 100 Longueuil, QC, J4K3P7, Canada', 'full_street': '2545, rue De Lorimier', 'street_number': '2545', 'street_type': 'rue', 'street_name': 'De Lorimier', 'route_id': None, 'post_direction': None, 'floor': None, 'building_id': None, 'occupancy': None, 'postal_box': None, 'city': 'bureau 100 Longueuil', 'region1': 'QC', 'postal_code': 'J4K3P7', 'country_id': 'CA'}

Here are some other examples of addresses that do not parse, but seem valid to me, probably because they contain hyphens, parenthesis or something else:

110-395 Rue des Érables, Salaberry-de-Valleyfield, QC, J6T6T5, Canada
1095 Rue de la Visitation, Saint-Charles-Borromée, QC, J6E0W7, Canada
461 Rue Dufferin, Salaberry-de-Valleyfield, QC, J6S2B3, Canada
200-1345 Boul Dagenais Ouest (Sainte-Rose), Laval, QC, H7L5Z9, Canada
3149 Boul Dagenais Ouest (Fabreville), Laval, QC, H7P1T8, Canada
655 Rue Boucher, Saint-Jean-sur-Richelieu, QC, J3B8P4, Canada
101-2575 32e Avenue (LaSalle), Montréal, QC, H8T3G9, Canada
3875 Boul Sainte-Rose (Laval-Ouest), Laval, QC, H7R1V2, Canada
1840 32e Avenue (Lachine), Montréal, QC, H8T3M6, Canada
123 Rue Huot, Notre-Dame-de-l'Île-Perrot, QC, J7V7M4, Canada
1468 Boul Monseigneur-Langlois, Salaberry-de-Valleyfield, QC, J6S1C2, Canada
93 Ave Conrad-Gosselin, Saint-Jean-sur-Richelieu, QC, J2X0A1, Canada
795 Ave de Grande-Île, Salaberry-de-Valleyfield, QC, J6S3N9, Canada
525 Rue Gadbois, Saint-Jean-sur-Richelieu, QC, J3A1V1, Canada
400 Rue Croisetière, Saint-Jean-sur-Richelieu, QC, J2X0E5, Canada
695 DU PONT, Terrebonne, QC, J6W1A2, Canada
5150 Boul Dagenais Ouest (Laval-Ouest), Laval, QC, H7R1L8, Canada
1250 Boul Dagenais Ouest (Fabreville), Laval, QC, H7L5E3, Canada
1889 Boul Dagenais Ouest (Sainte-Rose), Laval, QC, H7L5A3, Canada
149 MTEE DU MOULIN, Laval, QC, H7N3Y8, Canada
3675 Boul Dagenais Ouest (Fabreville), Laval, QC, H7P5C9, Canada
398 Boul Curé-Labelle (Chomedey), Laval, QC, H7V2S3, Canada
3251 Boul Dagenais Ouest (Fabreville), Laval, QC, H7P1V3, Canada
196 Rue St-Louis, Saint-Jean-sur-Richelieu, QC, J3B1Y1, Canada
2525B Rue du Pont, Marieville, QC, J3M0C5, Canada
1585 du Chevrotin, Richelieu, QC, J3L4Y3, Canada
91 Ave Conrad-Gosselin, Saint-Jean-sur-Richelieu, QC, J2X0A1, Canada
1000 Boul du Séminaire N, Saint-Jean-sur-Richelieu, QC, J3A1E5, Canada
A-1645A Aut Jean-Noël-Lavoie, Laval, QC, H7L3W3, Canada

Could you please take a look? My knowledge of regular expressions is very limited

Matching condominium units for US addresses

Hi, thanks for the fantastic package. I'm finding it really useful.

Occasionally, some addresses aren't identified in natural text and I've deduced this to their use of "Unit \d+" to denote the unit of a condo. E.g.:

5625 NW 109th Ave, Unit 65, Doral, FL, 33178
451 Ives Dairy Road, Unit 204-1, Miami, Florida 33179

A search on the repo and I don't know if this variant has been mentioned before. If there's a reason this isn't implemented, sorry for missing that. If this is a possible improvement, I'd be happy to make the PR just let me know.

Until then, I might just replace all "Unit" with "Room" or one of the other current variants. Thanks!

Support traditional (non-postal) province/state abbreviations

For example 123 Main St Kingston ON will parse, but 123 Main St Kingston Ont. will not. This longer abbreviation is extremely common formatting, especially for PEI and NWT. Also YK is probably a more common abbreviation than YT, despite that YK does not parse.

Likewise French Canadian speakers would probably use TNL, IPE, etc. Quebec should also have the two letter acronyms PQ and QB.

Postal	English	French
AB	Alta.	Alb.
BC	B.C.	C.-B.
MB	Man.	Man.
NB	N.B.	N.-B.
NL	N.L.	T.-N.-L
NS	N.S.	N.-É.
ON	Ont.	Ont.
PE	P.E.I	Î.-P.-É
QC	Que.	Qc / P.Q.
SK	Sask.	Sask.
NT	N.W.T.	T.N.-O
NU	Nvt.	Nt.
YT	Yuk.	YK

State Name	USPS Abbreviation	Traditional Abbreviation
Alabama	AL	Ala.
Alaska	AK	Alaska
Arizona	AZ	Ariz.
Arkansas	AR	Ark.
California	CA	Calif.
Colorado	CO	Colo.
Connecticut	CT	Conn.
Delaware	DE	Del.
Florida	FL	Fla.
Georgia	GA	Ga.
Hawaii	HI	Hawaii
Idaho	ID	Idaho
Illinois	IL	Ill.
Indiana	IN	Ind.
Iowa	IA	Iowa
Kansas	KS	Kans.
Kentucky	KY	Ky.
Louisiana	LA	La.
Maine	ME	Maine
Maryland	MD	Md.
Massachusetts	MA	Mass.
Michigan	MI	Mich.
Minnesota	MN	Minn.
Mississippi	MS	Miss.
Missouri	MO	Mo.
Montana	MT	Mont.
Nebraska	NE	Neb. or Nebr.
Nevada	NV	Nev.
New Hampshire	NH	N.H.
New Jersey	NJ	N.J.
New Mexico	NM	N.Mex.
New York	NY	N.Y.
North Carolina	NC	N.C.
North Dakota	ND	N.Dak.
Ohio	OH	Ohio
Oklahoma	OK	Okla.
Oregon	OR	Ore. or Oreg.
Pennsylvania	PA	Pa.
Rhode Island	RI	R.I.
South Carolina	SC	S.C.
South Dakota	SD	S.Dak.
Tennessee	TN	Tenn.
Texas	TX	Tex. or Texas
Utah	UT	Utah
Vermont	VT	Vt.
Virginia	VA	Va.
Washington	WA	Wash.
West Virginia	WV	W.Va.
Wisconsin	WI	Wis. or Wisc.
Wyoming	WY	Wyo.

Sources

Is there a way to return the raw, unformatted addresses

When I use pyap.parse() on the address below, the full address is formatted where the newline character \n is replaced by a comma and a space. I wonder if there is a way to also get the extracted but unformatted address. This might be useful if, say, a user would like to get the span of an address in the original text where the address is extracted from. Thanks!

address = """14234 Wilshire Blvd
Los Angeles, CA 90011"""

pyap.parse(address, country='US')[0].full_address
#14234 Wilshire Blvd, Los Angeles, CA 90011

"and" causing issues

Using pyap to parse addresses apart from names. Thank you for the work done as it is an amazing package.
Usually works great but when two names prepend the address it seems to falsely believe that the "and ExampleName" is part of the street.

Input: "ExampleName and ExampleName 111 Rock Rd Pittsboro, NC 1111"

Interestingly, using the "&" symbol seems to not cause this issue and works as expected.

House numbers with letters in them do not parse correctly

It seems that if there is an address like N8780 Something Blvd Cape Canaveral, FL 11111, the housing number that contains a letter in it will be parsed from the first letter onward. So the street address will be 8780 Something Blvd. Upon further testing, if a letter is added in the middle then most of the housing number will be parsed out before the last occurrence of a letter in the housing number.

US address parsing issue

Unable to parse this address. Canada parsing is fine though but US is very poor

"1607 23rd Street NW, Washington, DC 20008"

French addresses

Hi Vladimarius,
do you know if some work has been done on french addresses ? I have to parse some french addresses, it seems libpostal is nice but I have problems installing it and I found no other reliable solution, do you know other parsers for international addresses ?
Thanks for your work !
Romain

Can't install pypa in python 3.8.2

Hi,

Thanks for your great work! when I try to install I got this error:

(venv) User@SOZ-MBP16 sample_project % pip install pypa Collecting pypa Downloading pyPA-1.0rc.tar.gz (38 kB) ERROR: Command errored out with exit status 1: command: /Users/User/PycharmProjects/sample_project/venv/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/6k/6tmz3rc91759jdxs80_fjd180000gp/T/pip-install-4z09r1ik/pypa_857e7edb458246b0bbdc50f992f13718/setup.py'"'"'; __file__='"'"'/private/var/folders/6k/6tmz3rc91759jdxs80_fjd180000gp/T/pip-install-4z09r1ik/pypa_857e7edb458246b0bbdc50f992f13718/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/6k/6tmz3rc91759jdxs80_fjd180000gp/T/pip-pip-egg-info-tkpdibrw cwd: /private/var/folders/6k/6tmz3rc91759jdxs80_fjd180000gp/T/pip-install-4z09r1ik/pypa_857e7edb458246b0bbdc50f992f13718/ Complete output (6 lines): Traceback (most recent call last): File "<string>", line 1, in <module> File "/private/var/folders/6k/6tmz3rc91759jdxs80_fjd180000gp/T/pip-install-4z09r1ik/pypa_857e7edb458246b0bbdc50f992f13718/setup.py", line 17 except ImportError, e: ^ SyntaxError: invalid syntax ---------------------------------------- WARNING: Discarding https://files.pythonhosted.org/packages/bc/5a/2964cadcb8bc8d875768a16f023abb328deb895fad65fb1406dd3abc6219/pyPA-1.0rc.tar.gz#sha256=8c5f32fed2f192bd2c07912f17e3f770ec3e09ebae2aef4091171dcdca875c72 (from https://pypi.org/simple/pypa/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Do you know what is the problem?
I am running on Macos python 3.8.2

Certain Addresses not captured

Pyap doesn't seem to capture any addresses containing PO Boxes, Unit (number), and Floor (number)/ (number th) Floor.

How to Change Street number and street name in regex's rules

Hi Dear vladimarius

Thank you for publishing this program I wanna use your program for euro country that has different street addresses formating
for example :
AUSTRIA
"""Mr J Brownhall
264 High Street
ALLAMBIE NSW 2100
AUSTRALIA"""

has this formating you implemented in US code in this format :
full_street = r"""
(?:
(?P<full_street>
{street_number}?,?\ ?
{street_name}?,?\ ?

        (?:[\ \,]{street_type})\,?\ ?
        {post_direction}?\,?\ ?
        {floor}?\,?\ ?
        {building}?\,?\ ?
        {occupancy}?\,?\ ?
        {po_box}?
    )
)""".format(street_name=street_name,
            street_number=street_number,
            street_type=street_type,
            post_direction=post_direction,
            floor=floor,
            building=building,
            occupancy=occupancy,
            po_box=po_box,
            )

I get in trouble when I want to change the street number and street name :
full_street = r"""
(?:
(?P<full_street>
{street_name}?,?\ ?
{street_number}
(?:[\ ,]{street_type}),?\ ?
{post_direction}?,?\ ?
{floor}?,?\ ?
{building}?,?\ ?
{occupancy}?,?\ ?
{po_box}?
)
)""".format(street_name=street_name,
street_number=street_number,
street_type=street_type,
post_direction=post_direction,
floor=floor,
building=building,
occupancy=occupancy,
po_box=po_box,
)

and error is :
raise source.error(msg, len(condname) + 1)
re.error: unknown group name 'street_number' at position 142 (line 8, column 28)

Could you possibly help me
I would be grateful

address.as_dict() values are no longer guaranteed to be strings

Sometime between 0.2.0 and 0.3.0, non-string data started getting stored in addresses. If I iterate through the values in address.as_dict(), street numbers are now type int instead of type string (which broke some stuff I was using it for).

Not sure if this intentional or not, I wouldn't expect users to want numerical parts of addresses to be stored numerically, but it's not backwards compatible so I thought I'd bring it up.

Cheers!

Parsing issue with some of the US Address.

Hi Vlad! Thanks for this amazing work. However, I noticed that the parser has problems with some of the US addresses.

Examples
These following addresses don't parse at all.

20555 Devonshire Street #116 Chatsworth CA 91311
260-C North El Camino Real Encinitas CA 92024-2852
623 H St NW Floor 3 Washington DC 20001

Could you please take a look?

st / avenue etc are required to be capitalized in regex.

Ill fork and attempt to fix.

I also wanted to say thank you for an absolutely amazing package BTW. Seriously this is incredible work and well executed.

`match_start` and `match_end` refer to the normalized text, not the original

In the output of parse(some_text=text), the address objects have fields match_start and match_end, which should refer to the location of the start and end of addresses. Rather, they refer to the start and end of addresses in the string AddressParser._normalize_string(text).

A small corpus of unmatched addresses using v0.3.1

Hello,

Here is a list of ~200 addresses "from the wild" that pyap does not match

St. (Saint) and Mount / Mountain in City name failing to properly parse

It seems that when you include Saint or Mount / Mountain in the city name, it does not properly understand what the city is. Take the two examples:

111 Example Name Mount Pleasant SC 11111
111 Example Name St. Augustine SC 11111

In the first example, the City will be Pleasant and the "Mount" will be part of the street. In the second example, the "Augustine" will be the city and the "St." will be part of the street. If you convert "St." to Saint" it works great.

Street type issue

The app catches words like 'Drug' thinking it's a driveway. Shouldn't the div be changed from [\.\ ,]? to [\.\ ,] in line 177 of data.py ?

Doesn't work well.

This doesn't work well. If I change your example to even:

225 E. John Carpenter Fwy, Suite 1500 Irving, Texas 75062

It isn't detected.

vladimarius / pyap Goto Github PK

pyap's Issues

Recommend Projects

Recommend Topics

Recommend Org