Code Monkey home page Code Monkey logo

locationtagger's Introduction

locationtagger

version 0.0.1

Detect and extract locations (Countries, Regions/States & Cities) from text or URL. Also, find relationships among countries, regions & cities.


About Project

In the field of Natural Lauguage Processing, many algorithms have been derived for different types of syntactic & semantic analysis of the textual data. NER (Named Entity Recognition) is one of the best & frequently needed tasks in real-world problems of text mining that follows some grammer-based rules & statistical modelling approaches. An entity extracted from NER can be a name of person, place, organization or product. locationtagger is a further process of tagging & filter out place names (locations) amongst all the entities found with NER.

Approach followed is given below in the picture;

https://github.com/kaushiksoni10/locationtagger/blob/master/locationtagger/data/diagram.jpg?raw=true Approach


Install and Setup

(Environment: python >= 3.5)

Install the package using pip -

pip install locationtagger

But before we install the package, we need to install some useful libraries given below,

nltk

spacy

newspaper3k

pycountry

After installing these packages, there are some important nltk & spacy modules that need to be downloaded using commands given in /locationtagger/bin/locationtagger-nltk-spacy on IPython shell or Jupyter notebook.


Usage

After proper installation of the package, import the module and give some text/URL as input;

Text as input

import locationtagger

text = "Unlike India and Japan, A winter weather advisory remains in effect through 5 PM along and east of a line from Blue Earth, to Red Wing line in Minnesota and continuing to along an Ellsworth, to Menomonie, and Chippewa Falls line in Wisconsin."

entities = locationtagger.find_locations(text = text)


Now we can grab all the place names present in above text,

entities.countries

['India', 'Japan']

entities.regions

['Minnesota', 'Wisconsin']

entities.cities

['Ellsworth', 'Red Wing', 'Blue Earth', 'Chippewa Falls', 'Menomonie']


Apart from above places extracted from the text, we can also find the countries where these extracted cities, regions belong to,

entities.country_regions

{'United States': ['Minnesota', 'Wisconsin']}

entities.country_cities

{'United States': ['Ellsworth', 'Red Wing', 'Blue Earth', 'Chippewa Falls', 'Menomonie']}


Since "United States" is a country but not present in the text still came from the relations to the cities & regions present in the text, we can find it in other_countries,

entities.other_countries

['United States']


If we are really serious about the cities we got in the text we can find which regions in the world it may fall in,

entities.region_cities

{'Maine': ['Ellsworth'], 'Minnesota': ['Red Wing', 'Blue Earth'], 'Wisconsin': ['Ellsworth', 'Chippewa Falls', 'Menomonie'], 'Pennsylvania': ['Ellsworth'], 'Michigan': ['Ellsworth'], 'Illinois': ['Ellsworth'], 'Kansas': ['Ellsworth'], 'Iowa': ['Ellsworth']}


And obviously, we'll put these regions in other_regions since they are not present in original text,

entities.other_regions

['Maine', 'Minnesota', 'Wisconsin', 'Pennsylvania', 'Michigan', 'Illinois', 'Kansas', 'Iowa']


Whatever words nltk & spacy both grabbed from the original text as named entity , most of them are stored in cities, regions & countries. But the remaining words (not recognized as place name) will be stored in other.

entities.other

['winter', 'PM', 'Chippewa']

URL as Input

Similarly, It can grab places from urls too,

URL = 'https://edition.cnn.com/2020/01/14/americas/staggering-number-of-human-rights-defenders-killed-in-colombia-the-un-says/index.html'
entities2 = locationtagger.find_locations(url = URL)


outputs we get: countries;

entities2.countries

['Switzerland', 'Colombia']


regions;

entities2.regions

['Geneva']


cities;

entities2.cities

['Geneva', 'Colombia']


Now, if we want to check how many times a place has been mentioned or most common places which have been mentioned in the whole page of the URL, we can have an idea about what location that page is talking about;

hence, most commonly mentioned countries;

entities2.country_mentions

[('Colombia', 3), ('Switzerland', 1), ('United States', 1), ('Mexico', 1)]


and most commonly mentioned cities;

entities2.city_mentions

[('Colombia', 3), ('Geneva', 1)]


Credits

locationtagger uses data from following source for country, region & city lookups,

GEOLITE2 free downloadable database

Apart from famous nlp libraries NLTK & spacy, locationtagger uses following very useful libraries;

pycountry

newspaper3k

locationtagger's People

Contributors

kaushiksoni10 avatar

Stargazers

Anastasia Lotze avatar Julius Danek avatar Batuhan Bardak avatar Georghios Joseph avatar Thomas Yee avatar Lars avatar Surya Pradeep Kumar avatar  avatar Eli Pousson avatar Derek Eder avatar cass avatar Lars K Oestergaard avatar Tanmay Deshpande avatar  avatar Arash avatar Kaushik avatar Yijing Li avatar Ana Paula Gomes avatar  avatar  avatar

Watchers

 avatar

locationtagger's Issues

Some U.S. Addresses Not Recognized

The tagger will not recognize even the city and state in the following valid U.S. address:
826 E. Route 66 Glendora, CA
Adding a comma after 66 (which is not a solution) enables the tagger to find the city.

Old-city names are not recognised. E.g. Mumbai is recognised, but not Bombay

'#print(sample_text)

sample_text2="Didma Rangoon calcutta Kolkata and missing India Yangon ,Bombay, Mumbai"

extracting entities.

place_entity = locationtagger.find_locations(text = sample_text)

getting all countries

print("The countries in text : ")
print(place_entity.countries)

getting all states

print("The states in text : ")
print(place_entity.regions)

getting all cities

print("The cities in text : ")
print(place_entity.cities)

#othe reg
print("Other regions")
print(place_entity.other_regions)
'

Iran, is not recognized as a country

when I ran the location tagger with: "I am living in Iran", the country Iran I never detected as a country( Iraq, Afganisthan, Italy, they all are).
If replaced with "Islamic republic of Iran" then it works... why?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.