Comments (5)
I'm also not able to extract a country for "USA" using geotext. It looks like "UK" also does not produce a match:
import geotext
text = "UK"
geo_text = geotext.GeoText(text)
dict(geo_text.country_mentions)
returns {}
(an empty dict). @elyase would these be easy fixes?
from geotext.
Check the demo data carefully. GeoText does not use synonyms in its lookup.
from geotext.
Check the demo data carefully
Are you talking about the data in geotext/data
? I do see data that would seem to allow for mapping USA
and UK
to countries:
geotext/geotext/data/countryInfo.txt
Line 285 in 21a8a7f
geotext/geotext/data/countryInfo.txt
Line 8 in 21a8a7f
geotext/geotext/data/countryInfo.txt
Line 128 in 21a8a7f
GeoText does not use synonyms in its lookup
First, isn't USA
the official ISO 3166 3-letter code for the United States? So not a synonym. Also if this issue is caused by excluding synonyms, perhaps that's the wrong design decision?
from geotext.
The data in geotext/data
does not reflect what GeoText is looking for in a text. It only takes a small part of it for the lookup. So yes, the data allows for more, but GeoText is prohibiting it.
I talked about it briefly in issue-22. If you want synonyms, I tried another approach over at flashgeotext. Not sure I cover all the synonyms for country names, but some. And, I leave it to you to bring your own data/add data if something is missing.
from geotext.
Hi guys, we don't include ISO because the approach used in Geotext (rule based regex) is based on high precision rules (so you can almost be certain that it is correct when it works). The drawback is that we lose some recall.
While there are several ways this can be improved there is always a fine line in the precision / recall tradeoff. For example, if you take a look at the ISO list you will see many of them are token that are found everywhere even when they don't represent a country. Even some of them like USA, have meanings in others languages (USA means "to use" in Spanish). I have long wanted to improve the regex using a data based approach but I am missing data with representative negative examples (like extracting USA when it shouldn't be the case).
So I prefer the approach of providing basic functionality with high precision and leaving the responsibility of extending recall to users ex what @lisiq did (preprocessing the data). What we could do is improve the API to make it easier to add your own exceptions.
That said flashgeotext from @iwpnd looks great. Please try it out and let me know if we should join efforts there.
from geotext.
Related Issues (19)
- Not able to derive city names using Geotext library HOT 6
- 3 words cities HOT 1
- Numerous False Negatives HOT 4
- Tag new release and update version on PyPI HOT 1
- Cannot extracts city when sentence is too long. HOT 2
- 'UK' is in country mentions HOT 2
- Melbourne and Bristol coming up as US only... HOT 6
- Case insensitive option HOT 2
- Tests fail HOT 1
- Cities not identified HOT 8
- Country names with three words or more are not detected HOT 3
- Multiple City Names in capitalized sentences HOT 2
- State query
- UnicodeDecodeError with Python 3 on Window HOT 21
- Filter cities by country HOT 6
- Sensitivity to capitalization, punctuation, and places sharing a name. HOT 3
- Comma Issue HOT 1
- UnicodeDecodeError: 'gbk' codec can't decode byte 0xbf in position 2: illegal multibyte sequence HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from geotext.