Code Monkey home page Code Monkey logo

geograpy3's Introduction

geograpy3

Join the discussion at https://github.com/somnathrakshit/geograpy3/discussions Documentation Status pypi Github Actions Build PyPI Status Downloads GitHub issues GitHub closed issues License

geograpy3 is a fork of geograpy2, which is itself a fork of geograpy and inherits most of it, but solves several problems (such as support for utf8, places names with multiple words, confusion over homonyms etc). Also, geograpy3 is compatible with Python 3, unlike geograpy2.

since geograpy3 0.0.2 cities,countries and regions are matched against a database derived from the corresponding wikidata entries

What it is

geograpy extracts place names from a URL or text, and adds context to those names -- for example distinguishing between a country, region or city.

The extraction is a two step process. The first process is a Natural Language Processing task which analyzes a text for potential mentions of geographic locations. In the next step the words which represent such locations are looked up using the Locator.

If you already know that your content has geographic information you might want to use the Locator interface directly.

Examples/Tutorial

Install & Setup

Grab the package using pip (this will take a few minutes)

pip install geograpy3

geograpy3 uses NLTK for entity recognition, so you'll also need to download the models we're using. Fortunately there's a command that'll take care of this for you.

geograpy-nltk

Getting the source code

git clone https://github.com/somnathrakshit/geograpy3
cd geograpy3
scripts/install

Basic Usage

Import the module, give some text or a URL, and presto.

import geograpy
url = 'https://en.wikipedia.org/wiki/2012_Summer_Olympics_torch_relay'
places = geograpy.get_geoPlace_context(url=url)

Now you have access to information about all the places mentioned in the linked article.

  • places.countries contains a list of country names
  • places.regions contains a list of region names
  • places.cities contains a list of city names
  • places.other lists everything that wasn't clearly a country, region or city

Note that the other list might be useful for shorter texts, to pull out information like street names, points of interest, etc, but at the moment is a bit messy when scanning longer texts that contain possessive forms of proper nouns (like "Russian" instead of "Russia").

But Wait, There's More

In addition to listing the names of discovered places, you'll also get some information about the relationships between places.

  • places.country_regions regions broken down by country
  • places.country_cities cities broken down by country
  • places.address_strings city, region, country strings useful for geocoding

Last But Not Least

While a text might mention many places, it's probably focused on one or two, so geograpy3 also breaks down countries, regions and cities by number of mentions.

  • places.country_mentions
  • places.region_mentions
  • places.city_mentions

Each of these returns a list of tuples. The first item in the tuple is the place name and the second item is the number of mentions. For example:

[('Russian Federation', 14), (u'Ukraine', 11), (u'Lithuania', 1)]  

If You're Really Serious

You can of course use each of Geograpy's modules on their own. For example:

from geograpy import extraction

e = extraction.Extractor(url='https://en.wikipedia.org/wiki/2012_Summer_Olympics_torch_relay')
e.find_geoEntities()

# You can now access all of the places found by the Extractor
print(e.places)

Place context is handled in the places module. For example:

from geograpy import places

pc = places.PlaceContext(['Cleveland', 'Ohio', 'United States'])

pc.set_countries()
print pc.countries #['United States']

pc.set_regions()
print(pc.regions #['Ohio'])

pc.set_cities()
print(pc.cities #['Cleveland'])

print(pc.address_strings #['Cleveland, Ohio, United States'])

And of course all of the other information shown above (country_regions etc) is available after the corresponding set_ method is called.

Stackoverflow

Credits

geograpy3 uses the following excellent libraries:

  • NLTK for entity recognition
  • newspaper for text extraction from HTML
  • jellyfish for fuzzy text match
  • pylodstorage for storage and retrieval of tabular data from SQL and SPARQL sources

geograpy3 uses the following data sources:

Hat tip to Chris Albon for the name.

geograpy3's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

geograpy3's Issues

Geograpy3 for other languages

Is your feature request related to a problem? Please describe.
I would like to use geograpy3 in my current project, but as I'm working with sentences in German, it doesn't work...

Describe the solution you'd like
Support other languages as well.

Similarity matching is too error prone

The similarity matching might need to be adjusted to avoid identifying United States as region

region_match("United States", "United States Minor Outlying Islands") is evaluated to true
(region_match()→fuzzy_match()→jaro_winkler_similarity() with a max distance threshold of 0.8)

Not able to fetch places when passing url, but text works

Hey there,

Thanks so much for the awesome module! I installed it on python3.
Overall worked great so far. However I am not able to fetch the places info when I pass a url. But when I pass a text, it does an excellent job.

WITH URL:

import geograpy
url='https://www.bbc.com/news/uk-13426353'
places = geograpy.get_geoPlace_context(url = url)
print(places)
countries=[]
regions=[]
cities=[]
other=[]

WITH TEXT:

import geograpy
text='During the 70-day torch relay, it will pass through towns and cities including Bristol, Cardiff, Liverpool, Belfast, Glasgow, Aberdeen, Newcastle, Manchester, Sheffield, Nottingham, Oxford, Southampton and Dover.'
places = geograpy.get_geoPlace_context(text=text)
print(places)
countries=['South Africa', 'Australia', 'New Zealand', 'United Kingdom', 'Ireland', 'United States', 'Canada']
regions=[]
cities=['Newcastle', 'Belfast', 'Sheffield', 'Cardiff', 'Oxford', 'Southampton', 'Nottingham', 'Bristol']
other=[]

Not sure if I missed out something really silly.

Thanks for your help in advance.

Monkins

drag and drop service for extraction from files

Is your feature request related to a problem? Please describe.
There should be a web-based service using a drop-target zone to allow drag&drop (or at least upload) of files of different formats. For a start e.g. text,PDF, maybe Doc, docx and other office formats.

Describe the solution you'd like
stepwise process:

  • extract text from document
  • process potentially location references
  • create a list of results in a selectable format e.g. csv, json, xml, sparql

Describe alternatives you've considered

  • RESTFul interface
  • Content Negotiation
  • Command-Line Tool

Additional context

[BUG]locs.db version check needed

If the locs.db is outdated a pip install --upgrade will not fix it.
workaround e.g.

rm /Users/wf/Library/Python/3.8/lib/python/site-packages/geograpy/locs.db

by adding version information and checking this behavior could be improved. It won't hit people on first usage though

[BUG]Regression: testGetCitiesOfRegion fails on nightly build

ERROR: testGetCitiesOfRegion (tests.test_wikidata.TestWikidata)
Test getting cities based on region wikidata id
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/hd/luxio/var/lib/jenkins/jobs/geograpy3/workspace/tests/test_wikidata.py", line 105, in testGetCitiesOfRegion
    biggestCity=cities[0]
IndexError: list index out of range
  • Python Version [e.g. 3.6]

Additional context
Add any other context about the problem here.

Score for each location?

Could it be possible to have a score for each location found?
Indeed sometimes it could be good to know why some locations are in the results for example :
`

import geograpy
places = geograpy.get_geoPlace_context(text="This sentence mention UK as country and London as city.")
places.countries
['United Kingdom', 'United States', 'Canada']
places.cities
['London']
places = geograpy.get_geoPlace_context(text="Jin Yin-tan Hospital, Wuhan, China.")
places.countries
['China', 'Mexico', 'United States']
places.cities
['China']
`

Something like the following score could be interesting :
[('United Kingdom',0.99), ('United States',0.56), ('Canada',0.45)]

A score of confidence could help to avoid those results.

[BUG] warning on loading the JSON Files

Describe the bug
pyLodStorage complains about list types in the attributes

To Reproduce
Initialize the Location Context

Expected behavior
No warnings.

The labels are a technical concept of wikidata and might be translated to something more general like "shortName". Having multiple shortnames is an exotic case and can be covered with the existing handling of e.g. wrongly written names.

grep labels regions_geograpy3.json -A2
            "labels": [
                "AK"
            ],
--
            "labels": [
                "AL"
            ],
--
            "labels": [
                "AR"
            ],
--
            "labels": [
                "AS"
            ],
--
            "labels": [
                "AZ"
            ],
--
            "labels": [
                "CA",
                "CA"
--
            "labels": [
                "CO"
            ],
--
            "labels": [
                "CT"
            ],
--
            "labels": [
                "DC"
            ],
--
            "labels": [
                "DE"
            ],
--
            "labels": [
                "FL"
            ],
--
            "labels": [
                "GA"
            ],
--
            "labels": [
                "GU"
            ],
--
            "labels": [
                "HI"
            ],
--
            "labels": [
                "IA"
            ],
--
            "labels": [
                "ID"
            ],
--
            "labels": [
                "IL"
            ],
--
            "labels": [
                "IN"
            ],
--
            "labels": [
                "KS"
            ],
--
            "labels": [
                "KY"
            ],
--
            "labels": [
                "LA"
            ],
--
            "labels": [
                "MA"
            ],
--
            "labels": [
                "MD"
            ],
--
            "labels": [
                "ME"
            ],
--
            "labels": [
                "MI"
            ],
--
            "labels": [
                "MN"
            ],
--
            "labels": [
                "MO"
            ],
--
            "labels": [
                "MP"
            ],
--
            "labels": [
                "MS"
            ],
--
            "labels": [
                "MT"
            ],
--
            "labels": [
                "NC"
            ],
--
            "labels": [
                "ND"
            ],
--
            "labels": [
                "NE"
            ],
--
            "labels": [
                "NH"
            ],
--
            "labels": [
                "NJ"
            ],
--
            "labels": [
                "NM"
            ],
--
            "labels": [
                "NV"
            ],
--
            "labels": [
                "NY"
            ],
--
            "labels": [
                "OH"
            ],
--
            "labels": [
                "OK"
            ],
--
            "labels": [
                "OR"
            ],
--
            "labels": [
                "PA"
            ],
--
            "labels": [
                "PR"
            ],
--
            "labels": [
                "RI"
            ],
--
            "labels": [
                "SC"
            ],
--
            "labels": [
                "SD"
            ],
--
            "labels": [
                "TN"
            ],
--
            "labels": [
                "TX"
            ],
--
            "labels": [
                "UM"
            ],
--
            "labels": [
                "UT"
            ],
--
            "labels": [
                "VA"
            ],
--
            "labels": [
                "VT"
            ],
--
            "labels": [
                "WA"
            ],
--
            "labels": [
                "WI"
            ],
--
            "labels": [
                "WV"
            ],
--
            "labels": [
                "WY"
            ],

Improve performance by avoiding ORM loading of all data

Is your feature request related to a problem? Please describe.
Loading the LocationContext as objects in to main memory currently takes 4 seconds which is completly unnecessary for most usecases where only a few lookups are needed

Describe the solution you'd like
Allow for lazy loading via properties and lookup without using hashtables as the solution was a few weeks ago using SQLite access.

when can it supports text that written in chinese?

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Get closest location

Given a location find the closes other location from a list of locations.
E.g. find the closest country for a city. (e.g. to check that the city is actually located in the country)

Disambiguate via population, gdp data

When looking for non unique cities allow to disambiguate via population and or gdp per capita data.

Thus Vienna and Paris would find the major cities in Austria and France by default given the much higher population of these cities if no other disambiguation information is available.

[BUG]AttributeError: 'NoneType' object has no attribute 'name' on "Pristina, Kosovo"

Describe the bug

geograpy.get_geoPlace_context(text="Pristina, Kosovo")

leads to python error.

To Reproduce
Steps to reproduce the behavior:

def testIssue(self):
        '''
        test Issue
        '''    
        locality="Pristina, Kosovo"
        gp=geograpy.get_geoPlace_context(text=locality)
        if self.debug:
            print("  %s" % gp.countries)
            print("  %s" % gp.regions)
            print("  %s" % gp.cities)

File "/Users/wf/Documents/pyworkspace/geograpy3/geograpy/places.py", line 189, in set_cities
country_name = country.name
AttributeError: 'NoneType' object has no attribute 'name'

Expected behavior
Python should not choke on this although the political result may be disputed.

[BUG]wikidataid is not unique and labels are not handled as lists

Describe the bug
The wikidataid for countries, regions and cities should be unique there might be multiple labels but not multiple entries under the same id

To Reproduce
use primarykey of pyLoDStorage

Expected behavior
Primary key should be set and index should be automatically created.

[BUG] ModuleNotFoundError: No module named 'lodstorage'

Describe the bug

When trying to use geograpy, I'm getting the following error:

.../geograpy/wikidata.py:7: in <module>
    from lodstorage.sparql import SPARQL
E   ModuleNotFoundError: No module named 'lodstorage'

To Reproduce

  1. pipenv install geograpy3
  2. NLTK_DATA=./nltk_data/ geograpy-nltk
  3. In my Python program:
from pathlib import Path

import nltk
nltk.data.path.append(Path(__file__).parent.parent.parent.parent / 'nltk_data')
import geograpy  # here it blows up with ModuleNotFoundError

print(geograpy.get_geoPlace_context(text='Prague'))

Expected behavior

I'm changing path to the NLTK data, but I don't think that's related. It blows up on not being able to find a module lodstorage, which indeed I can't see anywhere in the package nor in the requirements. Where does it come from? What is it?

Environment (please complete the following information):

  • OS: macOS Mojave
  • Python Version: 3.7.5

[BUG] location.db access fails within read-only docker container

Describe the bug
Geograpy fails with the following error if used on read-only docker containers:

Traceback (most recent call last):
  File "/app/index.py", line 152, in <module>
    places = geograpy.get_geoPlace_context(
  File "/home/friendlyusername/.local/lib/python3.9/site-packages/geograpy/__init__.py", line 24, in get_geoPlace_context
    places=get_place_context(url, text, labels=Labels.geo, debug=debug)
  File "/home/friendlyusername/.local/lib/python3.9/site-packages/geograpy/__init__.py", line 46, in get_place_context
    pc = PlaceContext(e.places)
  File "/home/friendlyusername/.local/lib/python3.9/site-packages/geograpy/places.py", line 29, in __init__
    super().__init__()
  File "/home/friendlyusername/.local/lib/python3.9/site-packages/geograpy/locator.py", line 184, in __init__
    self.sqlDB=SQLDB(self.db_file,errorDebug=True)
  File "/home/friendlyusername/.local/lib/python3.9/site-packages/lodstorage/sql.py", line 41, in __init__
    self.c=sqlite3.connect(dbname,detect_types=sqlite3.PARSE_DECLTYPES,check_same_thread=check_same_thread)
sqlite3.OperationalError: unable to open database file

To Reproduce
Use geograpy as part of a script inside of a read-only docker container (but with writable /tmp folder)

import nltk
import geograpy
nltk.downloader.download('maxent_ne_chunker')
nltk.downloader.download('words')
nltk.downloader.download('treebank')
nltk.downloader.download('maxent_treebank_pos_tagger')
nltk.downloader.download('punkt')
# since 2020-09
nltk.downloader.download('averaged_perceptron_tagger')
places = geograpy.get_geoPlace_context(text)

Expected behavior
Should not fail or there should be a way to configure the used directories and files.

Environment (please complete the following information):

  • Python 3.9.6
  • Docker: python:3-slim
  • geograpy3==0.1.24

Additional context
Maybe use /tmp for temporary files instead of "random folder".
I also tried setting GEOGRAPY_DB=/tmp/loc.db but this fails with:

Traceback (most recent call last):
  File "/app/index.py", line 152, in <module>
    places = geograpy.get_geoPlace_context(
  File "/home/friendlyusername/.local/lib/python3.9/site-packages/geograpy/__init__.py", line 24, in get_geoPlace_context
    places=get_place_context(url, text, labels=Labels.geo, debug=debug)
  File "/home/friendlyusername/.local/lib/python3.9/site-packages/geograpy/__init__.py", line 46, in get_place_context
    pc = PlaceContext(places)
  File "/home/friendlyusername/.local/lib/python3.9/site-packages/geograpy/places.py", line 32, in __init__
    self.setAll()
  File "/home/friendlyusername/.local/lib/python3.9/site-packages/geograpy/places.py", line 87, in setAll
    self.set_countries()
  File "/home/friendlyusername/.local/lib/python3.9/site-packages/geograpy/places.py", line 98, in set_countries
    country=self.getCountry(place)
  File "/home/friendlyusername/.local/lib/python3.9/site-packages/geograpy/locator.py", line 1162, in getCountry
    countryRecords=self.sqlDB.query(query,params)
  File "/home/friendlyusername/.local/lib/python3.9/site-packages/lodstorage/sql.py", line 183, in query
    query = cur.execute(sqlQuery,params)
sqlite3.OperationalError: no such table: countries

geograpyIssue32

see ushahidi/geograpy#32

def testGeograpyIssue32(self):
        '''
        test https://github.com/ushahidi/geograpy/issues/32
        '''
        url = "https://www.politico.eu/article/italy-incurable-economy/" 
        places = geograpy.get_geoPlace_context(url = url) 
        print(places)
        self.assertEquals(['Rome', 'Brussels', 'Italy'],places.cities)  

with result:

countries=['Censis', 'German Democratic Republic', 'Brussels', 'League', 'France', 'Italian', 'Italy', 'Deliveroo', 'Italians', 'Rome', 'European', 'EU', 'United States', 'Belgium']
regions=['Censis', 'Brussels', 'League', 'France', 'Germany', 'Italian', 'Italy', 'Deliveroo', 'Italians', 'Rome', 'European', 'EU']
cities=['Rome', 'Brussels', 'Italy']
other=[]

which is not optimal yet but better.

correctMisspelling should be False by default for Locator

Describe the bug
When correctMisspelling is used unwanted side effects happen.

To Reproduce
locate "Amsterdam, Netherlands" with correctMisspelling = True
Result: None since the correction kicks in and wrongly changes the Netherlands to Netherland Antilles ...

Expected behavior
Amsterdam (NH(North Holland) - NL(Netherlands))

geograpy.locateCity("Berlin") is returning US instead of DE

Hi Admin,
first of all many thanks for this great library.

Here my issue:

import geograpy

def extract_country(input):
    city=geograpy.locateCity(input)
    country=city.country.iso
    return country

if __name__ == "__main__":
    print(extract_country("Berlin"))

As result I get US

Should not return instead DE ?

Many thanks

[BUG]OperationalError: no such table: countries

Describe the bug
A clear and concise description of what the bug is.
[OperationalError: no such table: countries]

`---------------------------------------------------------------------------
OperationalError Traceback (most recent call last)
in
1 import geograpy
2 url = 'https://en.wikipedia.org/wiki/2012_Summer_Olympics'
----> 3 places = geograpy.get_geoPlace_context(url=url)

~/.conda/envs/WikiCOVID/lib/python3.7/site-packages/geograpy/init.py in get_geoPlace_context(url, text, debug)
22 PlaceContext: the place context
23 '''
---> 24 places=get_place_context(url, text, labels=Labels.geo, debug=debug)
25 return places
26

~/.conda/envs/WikiCOVID/lib/python3.7/site-packages/geograpy/init.py in get_place_context(url, text, labels, debug)
44 e.find_entities(labels=labels)
45 places=e.places
---> 46 pc = PlaceContext(places)
47 pc.setAll()
48 return pc

~/.conda/envs/WikiCOVID/lib/python3.7/site-packages/geograpy/places.py in init(self, place_names, setAll, correctMisspelling)
30 self.places = self.normalizePlaces(place_names)
31 if setAll:
---> 32 self.setAll()
33
34 def str(self):

~/.conda/envs/WikiCOVID/lib/python3.7/site-packages/geograpy/places.py in setAll(self)
85 Set all context information
86 '''
---> 87 self.set_countries()
88 self.set_regions()
89 self.set_cities()

~/.conda/envs/WikiCOVID/lib/python3.7/site-packages/geograpy/places.py in set_countries(self)
96 countries = []
97 for place in self.places:
---> 98 country=self.getCountry(place)
99 if country is not None:
100 countries.append(country.name)

~/.conda/envs/WikiCOVID/lib/python3.7/site-packages/geograpy/locator.py in getCountry(self, name)
1160 params=(name,name,)
1161 country = None
-> 1162 countryRecords=self.sqlDB.query(query,params)
1163 if len(countryRecords)==1:
1164 country=Country.fromRecord(countryRecords[0])

~/.conda/envs/WikiCOVID/lib/python3.7/site-packages/lodstorage/sql.py in query(self, sqlQuery, params)
181 cur=self.c.cursor()
182 if params is not None:
--> 183 query = cur.execute(sqlQuery,params)
184 else:
185 query = cur.execute(sqlQuery)

OperationalError: no such table: countries`

To Reproduce
Steps to reproduce the behavior:

  1. Run the example on local, remote, colab, jupyter notebook

Expected behavior
A clear and concise description of what you expected to happen.
The codes should results list of countries

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS: MAC / UBUNTU / COLAB
  • Python Version 3.9 / 3 / ?

Additional context
Add any other context about the problem here.

Jenkins nightly build fails

traceback (most recent call last):
  File "geograpy/bin/geograpy-nltk", line 3, in <module>
    import nltk
ImportError: No module named nltk

env python still default to python2 on some systems ...

create BallTree of List of Locations

this is a base functionality for efficient distance calculation and matching algorithm since geocoordinates can not be naturally sorted by lat/lon to achieve closest neighbour comparisons

API for lookup

Given a set of 1 to 3 (maybe 4) words return a map/dict of tuples with city,region and country information (each with label and Wikidata-Q ID) and the probability of matching (ordering is good enough for a start). For the time being the probability may be calculated by the population later we'll use the probability distribution of conference corpus entries that have been successfully matched.

Comparison of population is done on the "lowest" level. E.g. Athens, Greece is preferred to Athens, Georgia, USA since Athens, Greece has a pop of 600 thousand while Athens Georgia population is 5 times lower.

United Kingdom not recognised as a country.

United Kingdom isn't matched as a country.

>>> import geograpy
>>> geograpy.get_geoPlace_context(text='United Kingdom').countries
[]
>>> geograpy.get_geoPlace_context(text='UK').countries
[]
>>> geograpy.get_geoPlace_context(text='Great Britain').countries
[]
>>> geograpy.get_geoPlace_context(text='GB').countries
[]
>>> geograpy.get_geoPlace_context(text='United States').countries
['United States']

Expected behavior
Should match some or all variations of United Kingdom.

Version
geograpy3 0.1.27

Brazil is not recognized as a country

Describe the bug

I'm not sure if I'm not using geograpy correctly or if this is an issue. Brazil is not extracted at all using get_place_context with an expression and considered to be an US region when passed on its own.

Additionally, Aorus, which is definitely not a place name, is recoginzed as a geoEntity when using the extractor - instead of Brazil.

>>> geograpy.get_place_context(text='Aorus league 2021 Brazil').countries
[]

>>> geograpy.get_place_context(text='Aorus league 2021 Brazil').other
['Aorus']

>>> e = Extractor(text='Aorus League 2021 #1 Brazil')
>>> e.find_geoEntities()
['Aorus']

>>> e = Extractor(text='Brazil')
>>> e.find_geoEntities()
['Brazil']

>>> geograpy.get_place_context(text='Brazil').countries
['Brazil', 'United States']

>>> geograpy.get_place_context(text='Brazil').regions
['Brazil']

Environment (please complete the following information):

  • OS: Ubuntu 20.04
  • Python Version: 3.8

Update Database

Hi,
I was thinking to the ways to update the csv database or even use some other datasources like osm names and etc. Is there any way to modify the datasource which is used in this library? I know it reads data from csv files, but what is the standard for making those CSV files?
thanks

geograpy.get_placecontext uses Label set including PERSON and ORGANIZATION By default - offer alternative

def testGetGeoPlace(self):
        '''
        test geo place handling
        '''
        url='http://www.bbc.com/news/world-europe-26919928'
        places=geograpy.get_place_context(url=url)
        if self.debug:
            print(places)

gives the result:

countries=['Russian', 'Russian Federation', 'Czech Republic', 'China', 'Milos Zeman', 'Nato', 'Crimea', 'Steve Rosenberg', 'Oleksandr Turchynov', 'AFP', 'Ukrainian', 'Council', 'European', 'Arsen Avakov', 'Central African Republic', 'Andriy Deshchytsya', 'John Kerry', 'Vitaly Yarema', 'Afghanistan', 'Kharkiv', 'Daniel Sandford', 'BBC Moscow', 'Moscow', 'Sergei Lavrov', 'US Miscellaneous Pacific Islands', 'Brussels', "Donetsk 'republic", 'Related Topics Russia Nato Ukraine', 'Viktor Yanukovych', 'Luhansk', 'European Union', 'Interim', 'Ekho Moskvy', 'Footage', 'Kiev', 'Valentyn Nalyvaychenko', 'Eastern Ukraine', 'Security Service', 'First', 'Crimean', 'Donetsk', 'Online', 'Donetsk Region People', 'Lithuania', 'Andriy Parubiy', 'Arseniy Yatsenyuk', 'Turchynov', 'Ukraine', 'Ukrainian National Security', 'Belgium', 'United States']
regions=['Russian', 'Milos Zeman', 'Nato', 'Crimea', 'Steve Rosenberg', 'Oleksandr Turchynov', 'AFP', 'Czech', 'Ukrainian', 'Council', 'European', 'Arsen Avakov', 'Andriy Deshchytsya', 'John Kerry', 'Vitaly Yarema', 'Kharkiv', 'Daniel Sandford', 'BBC Moscow', 'Moscow', 'Sergei Lavrov', 'Brussels', 'State', "Donetsk 'republic", 'Related Topics Russia Nato Ukraine', 'Viktor Yanukovych', 'Luhansk', 'European Union', 'Republic', 'Interim', 'Ekho Moskvy', 'Footage', 'Kiev', 'Valentyn Nalyvaychenko', 'Eastern Ukraine', 'Security Service', 'US', 'First', 'Russia', 'Crimean', 'Donetsk', 'Online', 'Donetsk Region People', 'Lithuania', 'Andriy Parubiy', 'Arseniy Yatsenyuk', 'Turchynov', 'People', 'Ukraine', 'Ukrainian National Security']
cities=['Moscow', 'Donetsk', 'Brussels', 'Kharkiv', 'Republic', 'Council', 'Russia']
other=[]

add new geograpy.get_geoPlace_context

def testGetGeoPlace(self):
        '''
        test geo place handling
        '''
        url='http://www.bbc.com/news/world-europe-26919928'
        places=geograpy.get_geoPlace_context(url=url)
        if self.debug:
            print(places)
        self.assertEqual(['Moscow', 'Donetsk', 'Brussels', 'Kharkiv', 'Russia'],places.cities)

using only the GPE label so that the result is:

countries=['Ukrainian', 'Moscow', 'Lithuania', 'Brussels', 'Crimean', 'Online', 'Luhansk', 'Kharkiv', 'Donetsk', 'European', 'Russian Federation', 'Kiev', 'Russian', 'Crimea', 'Ukraine', 'Belgium', 'United States']
regions=['Ukrainian', 'Moscow', 'Lithuania', 'Brussels', 'Crimean', 'Russia', 'Online', 'Luhansk', 'Kharkiv', 'Donetsk', 'European', 'Kiev', 'Russian', 'Crimea', 'Ukraine']
cities=['Moscow', 'Donetsk', 'Brussels', 'Kharkiv', 'Russia']
other=[]

Make docs available via release process

scripts/doc already does some basic work but might need some love e.g. fixing the error

WARNING: autodoc: failed to import module 'setup'; the following exception was raised:
Traceback (most recent call last):
File "/Users/wf/Documents/pyworkspace/geograpy3/setup.py", line 5, in
with open('README.md', encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'README.md'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/wf/Library/Python/3.8/lib/python/site-packages/sphinx/ext/autodoc/importer.py", line 66, in import_module
return importlib.import_module(modname)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 783, in exec_module
File "", line 219, in _call_with_frames_removed
File "/Users/wf/Documents/pyworkspace/geograpy3/setup.py", line 9, in
long_description = open('README.md').read()
FileNotFoundError: [Errno 2] No such file or directory: 'README.md'

and linking back to the pypi information. The result might also show up in readthedocs.

[BUG] Test fails with SQL error attempt to write a readonly database in line 1 on Ubuntu

SQL error attempt to write a readonly database in line 1:
	CREATE TABLE cityPops(city TEXT,cityLabel TEXT,cityPop FLOAT,geoNameId TEXT,country TEXT,countryLabel TEXT,countryIsoCode TEXT,countryPopulation FLOAT)
SQL error no such table: cityPops in line 2:
	INSERT INTO "cityPops" VALUES('http://www.wikidata.org/entity/Q2734','Skierniewice',47837.0,'759123','http://www.wikidata.org/entity/Q36','Poland','PL',38454576.0)

Tweaking country recognition

Hi,
sometimes additional countries are being recognized by geograpy that were missing in the input:

   > text="France, Hungary, Poland, Spain, United Kingdom"
   > print(f"{text}  vs. {str(sorted(geograpy.get_geoPlace_context(text=text).countries))}")
   > France, Hungary, Poland, Spain, United Kingdom  vs. ['France', 'Hungary', 'Poland', 'Spain', 'United Kingdom', 'United States']

or standard country mentions are not picked up at all:

  > text="Bulgaria 3, Croatia 2, Czech Republic 1, Hungary 3"
  > Bulgaria 3, Croatia 2, Czech Republic 1, Hungary 3  vs. ['Bulgaria', 'Croatia']

I assume this relates to suboptimal NER results of the underlying nltk library.
Could you please recommend a process to improve the precision/reliability?

[BUG] downloads are done on every call instead of just once

Describe the bug
Openresearch tests show multiple download actions:

Downloading /Users/wf/.geograpy3/countries_geograpy3.json.gz from https://raw.githubusercontent.com/wiki/somnathrakshit/geograpy3/data/countries_geograpy3.json.gz ... this might take a few seconds
unzipping /Users/wf/.geograpy3/countries_geograpy3.json from /Users/wf/.geograpy3/countries_geograpy3.json.gz
warning: unsupported type <class 'list'> for field labels
Downloading /Users/wf/.geograpy3/regions_geograpy3.json.gz from https://raw.githubusercontent.com/wiki/somnathrakshit/geograpy3/data/regions_geograpy3.json.gz ... this might take a few seconds
unzipping /Users/wf/.geograpy3/regions_geograpy3.json from /Users/wf/.geograpy3/regions_geograpy3.json.gz
warning: unsupported type <class 'list'> for field labels
Downloading /Users/wf/.geograpy3/cities_geograpy3.json.gz from https://raw.githubusercontent.com/wiki/somnathrakshit/geograpy3/data/cities_geograpy3.json.gz ... this might take a few seconds
unzipping /Users/wf/.geograpy3/cities_geograpy3.json from /Users/wf/.geograpy3/cities_geograpy3.json.gz

To Reproduce
Run tests of OpenResearch

Expected behavior
Download should only happen once

[BUG]Country by name disambiguation

examples for wrong location lookups:

San Juan, Puerto Rico -> San Juan (J(San Juan) - AR(Argentina))
Puebla, Mexico -> Mexico (MO(Missouri) - US(United States))
Newcastle, UK  -> Newcastle (NSW(New South Wales) - AU(Australia))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.