psolin / cleanco Goto Github PK
View Code? Open in Web Editor NEWCompany Name Processor written in Python
License: MIT License
Company Name Processor written in Python
License: MIT License
Although its currently removing Inc from the end but unable to remove Inc.. or Inc. Implement multiple punctuation as optional at the end of company extension
Estonian legal entity types such as OÜ, MTÜ, AS, UÜ, TÜ are missing – only FIE seems to be supported.
Could try adding them myself – is termdata.py
the only place that needs to have these added in?
By convention, spaces are nowadays used, see PEP8: https://www.python.org/dev/peps/pep-0008/ - but not at the expense of consistency.
I like the idea of this and think there is a lot of use to it. I think it would be more useful if it removed all of company name string after(and including) a ','. I'd add this into the clean_name function similar to how you do with hyphens.
The legal terms are to some extent translatable across jurisdictions. It would be useful if user could ask for the business types in their own native language.
For example, as a Finnish person, limited liability company is known as "osakeyhtiö" ("oy") to me, whilst a public(ly traded) limited liability company would be called "julkinen osakeyhtiö" ("oyj") in Finland.
HI Guys
there is an error in one of the belgium types:
the correct type is CVBA: coöperatieve vennootschap met beperkte aansprakelijkheid (CVBA)
However in your covered terms it is written as "'cbva'"
https://pydoc.net/cleanco/1.3/termdata/
thanks !
In Finland, you sometimes see the format "Oy Corporation Ab" where "Oy" refers to limited liability (in Finnish) and "Ab" the same (in Swedish, the other official language of Finland).
In other words, the abbreviations can also appear in front of the company name - or both before and after.
When clean_name() is used in the following way:
>>> cleanco('company (country) Pvt. Ltd.').clean_name()
'company (country'
it strips not only the organisation name.
The expected output would be: company (country)
See https://www.gleif.org/en. There's a lot of data that would help improve the legal affix database of cleanco.
HI guys
many of the polish companies I received has full legal endings as:
spółka z ograniczoną odpowiedzialnością
spółka Jawna
spółka komandytowa
spółka akcyjna
spółka cywilna
spółka komandytowa
spółka z ograniczoną odpowiedzialnością
Would it be possible to add them on the list?
Suggest we drop it in 2.2, whenever that will be out. See README for description and disclosure of deprecation plans.
@psolin , would you have any lists of company names that you want to see tested?
Hello there!
thanks for this incredible package! I am just wondering: is this package still alive? I havent seen any update for about a year.
Thanks!
Sigh. It seems c.clean_name() fails for any suffix with dots within it, or something like that:
>>> c = cleanco("Company l.p.")
>>> c.clean_name()
'Company l.p'
>>> c = cleanco("Company l.p.p.")
>>> c.clean_name()
'Company l.p.p'
If a name ends with umlaut char such as 'ä', cleaning fails. To fix, re.search needs to be called with the re.UNICODE flag.
Seems like a good next step. If my tests with this software prove that it is a good fit for my project, I will gladly put in the work to push it up to PyPI.
Just wanted to have some thoughts on this. They seem like they could go beyond the scope of the project.
Due to the way cleanco currently works, quite intensive operations are taking place every time a name is cleaned (a class is instantiated every time a name is cleaned; see what happens in __init__
).
This should be optimized so that the operations only take place once.
To support for example abbreviation expansion for other languages than english, it would be better if the data was split into submodules rathen than kept embedded in the class.
For example, add a "data" subpackage to contain language modules with names from the ISO 639-1 standard. So current ones would be in module "data/uk.py".
If this is ok, I can provide an implementation.
We have a nice database of suffixes/prefixes. We should have a test that runs cleanco.clean_name() against the full database.
In effect we would check for example in case of suffix for business_name.split()[-1] == term
rather than business_name.endswith(' ' + term)
. Of course the splitting would be done just once in the beginning.
If we can just handle the fact that some legal terms are "multi-part" (whitespace-separated), this would simplify the code and make it run faster since for example we'd only have to work on the last whitespace-separated name part for suffix, and just the first for prefix. There are other cases, too.
We would not have to presort the data, either.
I minor thing:
'Croatia': ['d.d.', 'd.d.o.', 'obrt'],
should be:
'Croatia': ['d.d.', 'd.o.o.', 'obrt'],
d.o.o = "drustvo ogranicene odgovornosti"; there is no d.d.o (but d.d. is OK, as it stands for "dionicko drustvo")
so that any changes are automatically tested against. We should have better tests first, though.
Hi,
I like what you are doing with this module! I went to run x.clean_name() and received an error saying "cleanco instance has no attribute 'clean_name'.
I looked through your code and noticed that your actual method/attribute is cleanname(). Should be fixed in one of the locations. To me clean_name() makes more sense.
Error is on line 207 of cleanco.py. Thanks.
Hello,
Very nice module but it doesn't always handle well some real human entered company names we deal a lot with. Below some obvious examples where the name is not parsed:
LIBGAS,LTD -> LIBGAS,LTD
AIRDAS USA,LLC -> AIRDAS USA,LLC
GF LOGISTICS.INC -> GF LOGISTICS.INC
HAKUTATZ.TECH.CO.,LTD. -> HAKUTATZ.TECH.CO.,LTD
Thanks
Use unittest, or py.test or nose, whatever you prefer. I would recommend using py.test with https://pypi.python.org/pypi/hypothesis/
There should be abbreviations of legal entity added in order to classify rightly
In [3]: matches("Relience Private Limited", classification_sources)
Out[3]:
['Hong Kong',
'Israel',
'New Zealand',
'Pakistan',
'United Kingdom',
'United States of America']
Here there should have been India in the output.
>>> cleanco("Example Example Pty Ltd").clean_name() # CORRECT
'Example Example'
>>> cleanco("Example Example Pty Limited").clean_name() # Not so good
'Example Example Pty'
The give you a view on the scope of the problem: I'm working to normalise a database of around on processing a database of around 900k company names which have been typed into an application over a 10 year period. The database contains primarily companies from anglophone countries. Of these, around 580 have a company name like this.
Do you see this as a problem also? If so, I'm happy to put together a patch.
Done. Need someone to update changelog.
This makes it easier to map the country-specific codes to country data in other systems. The names can be found for example in the python "iso3166" package.
I understand this may fall outside the scope, but it would be very convenient if cleanco also had this kind of simple normalization built-in:
Recent code commits introduced improved Unicode support. However there are no tests to demonstrate it works.
Done with not being able to use "python setup.py develop" ....
The recent 2.0 work ignored multi-part ie. "co. ltd." type terms that contain a whitespace. A significant minority of the terms are multi-part so this regression needs to be fixed.
That should not be included.
Hello, thank you for your work on cleanco.
Cleanco does not seem to work properly for czech companies, eg:
>>> c = cleanco("Company s.r.o.")
>>> c.type() is None
True
>>> c.country() is None
True
>>> c = cleanco("Company a.s.")
>>> c.type() is None
True
>>> c.country() is None
True
Although I see that 's.r.o.'
and 'a.s.'
are present in termdata.py in the right places.
One more detail, there is also the possibility to use 'spol. s r.o.'
instead of 's.r.o.'
in Czech Republic - they are equivalent. 'spol. s r.o.'
is not present in termdata.py.
i am trying to parse a business name which contains p.c. as an extension, but when i try to use x.type() it returns none type object
Ex:-
Thanks for the great library!
Could we add SE (https://en.wikipedia.org/wiki/Societas_Europaea), "AKTIENGESELLSCHAFT" (which stands for AG, EG (Erwerbsgesellschaft) and see if it is possible to identify a dash also as a separator between a company type and its name? Examples below:
SE:
('ALBA SE';'NEW YORKER SE') --> currently: ('ALBA SE';'NEW YORKER SE'), correct: ('ALBA';'NEW YORKER')
AKTIENGESELLSCHAFT:
'WIELAND-WERKE AKTIENGESELLSCHAFT' --> currently: ('WIELAND-WERKE AKTIENGESELLSCHAFT'), correct: ('WIELAND-WERKE AKTIENGESELLSCHAFT')
EG:
'REWE DORTMUND GROSSHANDEL EG'
-AG
'DEUTSCHE VERSICHERUNGS-AG' --> currently: ('DEUTSCHE VERSICHERUNGS-AG'), correct: ('DEUTSCHE VERSICHERUNGS-AG')
The current implementation just does case-insensitive matching. Comparisons are however much more complex in Unicode world. See for example:
https://stackoverflow.com/questions/319426/how-do-i-do-a-case-insensitive-string-comparison
http://www.unicode.org/reports/tr15/#Normalization_Forms_Table
Hi guys
fantastic job.. but one important ending is missing: "SRL" without final dot
name = cleanco("Unimarkt Handelsgesellschaft SRL").clean_name(prefix=True, suffix=True, middle=True, multi=True)
print(name)
output
Unimarkt Handelsgesellschaft SRL
Same happens if the ending character is a comma. Example:
>>> from cleanco import cleanco
>>> y='posnegansett properties, llc.'
>>> ya=cleanco(y)
print ya.type()
['Limited Liability Company']
>>> print ya.clean_name()
posnegansett properties, llc
The result should not have the 'llc' suffix.
Under what license is this library distributed? GPL2? BSD? Something else? Please can we have a LICENSE file added?
business_name = "Some Big Pharma sh.a."
x = cleanco(business_name)
print(x.business_name)
print(x.string_stripper(x.business_name))
print(x.clean_name())
print(x.country())
prints:
Some Big Pharma sh.a.
Some Big Pharma sh.a
Some Big Pharma
None
sh.a.
is in the Albania
terms:
Line 46 in 56ff654
It is not being recognized as Albanian because the .
at the end of sh.a.
is removed in:
Line 56 in 56ff654
Need two things:
The following two equivalent names parse different when presumably they should be the same:
cleanco('Hello World Company Limited').clean_name()
cleanco('Hello World Company Ltd').clean_name()
In many countries, a "public" limited (liability) company has a distinction that its shares are publicly traded or -tradable. We don't have this distinction in cleanco currently.
The non-determinism was fixed in #54 in June, but the latest release (2.0.1) was in April. Is it possible to release a version fixing the non-determinism?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.