psolin / cleanco Goto Github PK

View Code? Open in Web Editor NEW

323.0 323.0 95.0 191 KB

Company Name Processor written in Python

License: MIT License

Python 100.00%

cleanco's People

Contributors

Stargazers

Watchers

Forkers

afscott potatochip natereed cronan edskal jzhzhu chreko rejo-p-deepr jz2327 khanhnguyenneka saberry mlaprise jnj16180340 pombredanne dhenderson zolrath rlaumeyer a-bencheikh daviddigital jhfvr galondsc y1my1 appurwar jamshaidsohail5 danielm-github charx0r austinkempf baijiaoo jonathanbossenger compa-inc byrro hjin36 stungkit agrima27 michaelg-baringa helmithejoe chrisdietr maasanka twalen hongbopeng maxpospischil geoffreyweiner rajeshkannanramakrishnan nata1y isvworld darimadam vkelk maxu777 mpucci92 elliottsmith saharmor pqhai akshaysharma29 altons mkbldn taraskuzyk pulin05 zzandww rjurney hacktbrasil ronarbo tboland pablomitchell synapticarbors vchauhan-ai jma4 mohammed78620 caas-hamburg workable aalars giorgosandreadis kouichi1229 stealth-bomber fbnil emreyesilyurt elijahahianyo tunchunairarko gen-li lauren-cgreen mbaak italanchan sarahlevitz robbarry brightquery-inc replicawj arpitjain799 zh4ng3 jonasr alvinjxz warrencohn shankerj alexanderlukanin13

cleanco's Issues

Company extensions ending in punctuation

Although its currently removing Inc from the end but unable to remove Inc.. or Inc. Implement multiple punctuation as optional at the end of company extension

Estonian entity types mostly missing

Estonian legal entity types such as OÜ, MTÜ, AS, UÜ, TÜ are missing – only FIE seems to be supported.

Could try adding them myself – is termdata.py the only place that needs to have these added in?

use spaces for indentation (change tabs to spaces)?

By convention, spaces are nowadays used, see PEP8: https://www.python.org/dev/peps/pep-0008/ - but not at the expense of consistency.

Clean_name to remove all items after a comma

I like the idea of this and think there is a lot of use to it. I think it would be more useful if it removed all of company name string after(and including) a ','. I'd add this into the clean_name function similar to how you do with hyphens.

add translated legal entity names

The legal terms are to some extent translatable across jurisdictions. It would be useful if user could ask for the business types in their own native language.

For example, as a Finnish person, limited liability company is known as "osakeyhtiö" ("oy") to me, whilst a public(ly traded) limited liability company would be called "julkinen osakeyhtiö" ("oyj") in Finland.

error in Belgium

HI Guys

there is an error in one of the belgium types:
the correct type is CVBA: coöperatieve vennootschap met beperkte aansprakelijkheid (CVBA)

However in your covered terms it is written as "'cbva'"
https://pydoc.net/cleanco/1.3/termdata/

thanks !

Handle prefixed (and in-middle), possibly multiple terms

In Finland, you sometimes see the format "Oy Corporation Ab" where "Oy" refers to limited liability (in Finnish) and "Ab" the same (in Swedish, the other official language of Finland).

In other words, the abbreviations can also appear in front of the company name - or both before and after.

brackets handled incorrectly

When clean_name() is used in the following way:

>>> cleanco('company (country) Pvt. Ltd.').clean_name()
'company (country'

it strips not only the organisation name.
The expected output would be: company (country)

use ISO 20275 data from GLEIF

See https://www.gleif.org/en. There's a lot of data that would help improve the legal affix database of cleanco.

Polish legal endings

HI guys
many of the polish companies I received has full legal endings as:
spółka z ograniczoną odpowiedzialnością
spółka Jawna
spółka komandytowa
spółka akcyjna
spółka cywilna
spółka komandytowa
spółka z ograniczoną odpowiedzialnością

Would it be possible to add them on the list?

Remove old API with 2.2

Suggest we drop it in 2.2, whenever that will be out. See README for description and disclosure of deprecation plans.

add more test data (company names)

@psolin , would you have any lists of company names that you want to see tested?

still alive?

Hello there!

thanks for this incredible package! I am just wondering: is this package still alive? I havent seen any update for about a year.

Thanks!

broken handling of dots within suffixes

Sigh. It seems c.clean_name() fails for any suffix with dots within it, or something like that:

>>> c = cleanco("Company l.p.")
>>> c.clean_name()
'Company l.p'
>>> c = cleanco("Company l.p.p.")
>>> c.clean_name()
'Company l.p.p'

support unicode, not just ascii

If a name ends with umlaut char such as 'ä', cleaning fails. To fix, re.search needs to be called with the re.UNICODE flag.

Package for PyPI

Seems like a good next step. If my tests with this software prove that it is a good fit for my project, I will gladly put in the work to push it up to PyPI.

Getting rid of abbreviations

Just wanted to have some thoughts on this. They seem like they could go beyond the scope of the project.

optimize (is horribly slow)

Due to the way cleanco currently works, quite intensive operations are taking place every time a name is cleaned (a class is instantiated every time a name is cleaned; see what happens in __init__).

This should be optimized so that the operations only take place once.

Move data away from main class (into country-specific modules?)

To support for example abbreviation expansion for other languages than english, it would be better if the data was split into submodules rathen than kept embedded in the class.

For example, add a "data" subpackage to contain language modules with names from the ISO 639-1 standard. So current ones would be in module "data/uk.py".

If this is ok, I can provide an implementation.

test against all suffixes / prefixes

We have a nice database of suffixes/prefixes. We should have a test that runs cleanco.clean_name() against the full database.

optimization and simplification suggestions

switch to function-based API

it makes no sense to instantiate a class for each cleaned name; it's overcomplex, extra work and unnecessary, especially when most of setup code is now outside the class

switch to working on whitespace-separated name parts rather than full strings

In effect we would check for example in case of suffix for business_name.split()[-1] == term rather than business_name.endswith(' ' + term). Of course the splitting would be done just once in the beginning.

at the moment, the class is splitting and rejoining the name already, to get rid of extra whitespaces
at the moment, the code already looks for a prefix/suffix that's padded by a single whitespace, so in effect it's the same

If we can just handle the fact that some legal terms are "multi-part" (whitespace-separated), this would simplify the code and make it run faster since for example we'd only have to work on the last whitespace-separated name part for suffix, and just the first for prefix. There are other cases, too.

We would not have to presort the data, either.

don't use both legal and countrywise suffixes in clean_name

there are a lot of duplicates, it should be enough to use just either (preferably countrywise data since that would allow dropping off countries easily)

Croatian companies

I minor thing:

'Croatia': ['d.d.', 'd.d.o.', 'obrt'],

should be:

'Croatia': ['d.d.', 'd.o.o.', 'obrt'],

d.o.o = "drustvo ogranicene odgovornosti"; there is no d.d.o (but d.d. is OK, as it stands for "dionicko drustvo")

add travis ci

so that any changes are automatically tested against. We should have better tests first, though.

Readme clean_name different than code

Hi,

I like what you are doing with this module! I went to run x.clean_name() and received an error saying "cleanco instance has no attribute 'clean_name'.

I looked through your code and noticed that your actual method/attribute is cleanname(). Should be fixed in one of the locations. To me clean_name() makes more sense.

Error is on line 207 of cleanco.py. Thanks.

Problems parsing company names with punctuations

Hello,

Very nice module but it doesn't always handle well some real human entered company names we deal a lot with. Below some obvious examples where the name is not parsed:

LIBGAS,LTD -> LIBGAS,LTD
AIRDAS USA,LLC -> AIRDAS USA,LLC
GF LOGISTICS.INC -> GF LOGISTICS.INC
HAKUTATZ.TECH.CO.,LTD. -> HAKUTATZ.TECH.CO.,LTD

Thanks

Add some proper tests

Use unittest, or py.test or nose, whatever you prefer. I would recommend using py.test with https://pypi.python.org/pypi/hypothesis/

Acronyms for the legal entity should be included

There should be abbreviations of legal entity added in order to classify rightly

In [3]: matches("Relience Private Limited", classification_sources)                                       
Out[3]: 
['Hong Kong',
 'Israel',
 'New Zealand',
 'Pakistan',
 'United Kingdom',
 'United States of America']

Here there should have been India in the output.

drop Python2 support

Incorrect detection of "Pty Limited" Suffix

>>> cleanco("Example Example Pty Ltd").clean_name() # CORRECT
'Example Example'
>>> cleanco("Example Example Pty Limited").clean_name() # Not so good
'Example Example Pty'

The give you a view on the scope of the problem: I'm working to normalise a database of around on processing a database of around 900k company names which have been typed into an application over a 10 year period. The database contains primarily companies from anglophone countries. Of these, around 580 have a company name like this.

Do you see this as a problem also? If so, I'm happy to put together a patch.

Drop Python 3.5, add GH actions, drop Travis CI, start preparations for 2.1

Done. Need someone to update changelog.

Use ISO3166 country names

This makes it easier to map the country-specific codes to country data in other systems. The names can be found for example in the python "iso3166" package.

Add support for case, whitespace & separator normalization

I understand this may fall outside the scope, but it would be very convenient if cleanco also had this kind of simple normalization built-in:

standardizing lettercase (e.g., all lowercase)
standardizing separators (e.g., commas must be followed by spaces)
standardizing whitespace (e.g., converting all runs of whitespace to single spaces)

test for cyrillic (Russian)

Recent code commits introduced improved Unicode support. However there are no tests to demonstrate it works.

switch to setuptools

Done with not being able to use "python setup.py develop" ....

fix multipart term checking

The recent 2.0 work ignored multi-part ie. "co. ltd." type terms that contain a whitespace. A significant minority of the terms are multi-part so this regression needs to be fixed.

Remove build directory from version control

That should not be included.

Cleanco not properly identifying Czech companies

Hello, thank you for your work on cleanco.

Cleanco does not seem to work properly for czech companies, eg:

>>> c = cleanco("Company s.r.o.")
>>> c.type() is None
True
>>> c.country() is None
True
>>> c = cleanco("Company a.s.")
>>> c.type() is None
True
>>> c.country() is None
True

Although I see that 's.r.o.' and 'a.s.' are present in termdata.py in the right places.

One more detail, there is also the possibility to use 'spol. s r.o.' instead of 's.r.o.' in Czech Republic - they are equivalent. 'spol. s r.o.' is not present in termdata.py.

cleanco('AMBA').clean_name() is empty

Not Working for 'p.c.

i am trying to parse a business name which contains p.c. as an extension, but when i try to use x.type() it returns none type object

Ex:-

cleanco(dentistry for children, louis a. pollina, d.d.s., p.c.)
x.type()
Returns none

Add SE and AG

Thanks for the great library!
Could we add SE (https://en.wikipedia.org/wiki/Societas_Europaea), "AKTIENGESELLSCHAFT" (which stands for AG, EG (Erwerbsgesellschaft) and see if it is possible to identify a dash also as a separator between a company type and its name? Examples below:

SE:
('ALBA SE';'NEW YORKER SE') --> currently: ('ALBA SE';'NEW YORKER SE'), correct: ('ALBA';'NEW YORKER')

AKTIENGESELLSCHAFT:
'WIELAND-WERKE AKTIENGESELLSCHAFT' --> currently: ('WIELAND-WERKE AKTIENGESELLSCHAFT'), correct: ('WIELAND-WERKE AKTIENGESELLSCHAFT')

EG:
'REWE DORTMUND GROSSHANDEL EG'

-AG
'DEUTSCHE VERSICHERUNGS-AG' --> currently: ('DEUTSCHE VERSICHERUNGS-AG'), correct: ('DEUTSCHE VERSICHERUNGS-AG')

improve string comparisons

The current implementation just does case-insensitive matching. Comparisons are however much more complex in Unicode world. See for example:

https://stackoverflow.com/questions/319426/how-do-i-do-a-case-insensitive-string-comparison

http://www.unicode.org/reports/tr15/#Normalization_Forms_Table

SRL missing

Hi guys
fantastic job.. but one important ending is missing: "SRL" without final dot
name = cleanco("Unimarkt Handelsgesellschaft SRL").clean_name(prefix=True, suffix=True, middle=True, multi=True)
print(name)
output
Unimarkt Handelsgesellschaft SRL

stripping of suffix fails if it ends in full stop.

Same happens if the ending character is a comma. Example:

>>> from cleanco import cleanco
>>> y='posnegansett properties, llc.'
>>> ya=cleanco(y)
print ya.type()
['Limited Liability Company']
>>> print ya.clean_name()
posnegansett properties, llc

The result should not have the 'llc' suffix.

Add license

Under what license is this library distributed? GPL2? BSD? Something else? Please can we have a LICENSE file added?

country logic does not work for terms ending with '.'

business_name = "Some Big Pharma sh.a."
x = cleanco(business_name)

print(x.business_name)
print(x.string_stripper(x.business_name))
print(x.clean_name())
print(x.country())

prints:

Some Big Pharma sh.a.
Some Big Pharma sh.a
Some Big Pharma
None

sh.a. is in the Albania terms:

cleanco/termdata.py

Line 46 in 56ff654

'Albania': ['sh.a.', 'sh.p.k.'],

It is not being recognized as Albanian because the . at the end of sh.a. is removed in:

cleanco/cleanco.py

Line 56 in 56ff654

business_name = self.string_stripper(business_name)

tox and python setup.py test support

Need two things:

make tests runnable by 'python setup.py test'
support multiple version (2.7 & 3.5) testing using tox

Inconsistent parsing

The following two equivalent names parse different when presumably they should be the same:

cleanco('Hello World Company Limited').clean_name()

cleanco('Hello World Company Ltd').clean_name()