Code Monkey home page Code Monkey logo

normalize_country's Introduction

NormalizeCountry

Convert country names and codes to a standard.

<img src=“https://travis-ci.org/sshaw/normalize_country.svg?branch=master” alt=“Build Status” /> <img src=“https://codeclimate.com/github/sshaw/normalize_country.svg” />

Overview

require "normalize_country"

NormalizeCountry("America")                       # "United States"
NormalizeCountry("United States of America")      # "United States"
NormalizeCountry("USA",  :to => :official)        # "United States of America"
NormalizeCountry("Iran", :to => :official)        # "Islamic Republic of Iran"
NormalizeCountry("U.S.", :to => :alpha2)          # "US"
NormalizeCountry("U.S.", :to => :numeric)         # "840"
NormalizeCountry("US",   :to => :fifa)            # "USA"
NormalizeCountry("US",   :to => :emoji)           # "🇺🇸"
NormalizeCountry("US",   :to => :shortcode)       # ":flag-us:"
NormalizeCountry("Iran", :to => :alpha3)          # "IRN"
NormalizeCountry("Iran", :to => :ioc)             # "IRI"
NormalizeCountry("DPRK", :to => :short)           # "North Korea"
NormalizeCountry("North Korea", :to => :iso_name) # "Korea, Democratic People's Republic Of"

# Or
NormalizeCountry.convert("U.S.", :to => :alpha2)  # "US"

# Set the default
NormalizeCountry.to = :alpha3
NormalizeCountry.convert("Mexico")                 # "MEX"
NormalizeCountry.convert("United Mexican States")  # "MEX"

Installation

Rubygems (part of Ruby):

gem install normalize_country

Bundler:

gem "normalize_country"

Supported Conversions

In addition to trying to convert from common, non-standardized names and abbrivations, NormalizeCountry will convert to/from the following:

:alpha2

ISO 3166-1 alpha-2

:alpha3

ISO 3166-1 alpha-3

:emoji

The country’s emoji

:fifa

FIFA (International Federation of Association Football)

:ioc

International Olympic Committee

:iso_name

Country name used by ISO 3166-1

:numeric

ISO 3166-1 numeric code

:official

The country’s official name

:short

A shortned version of the country’s name, commonly used when speaking and/or writing (US English)

:shortcode:

Emoji shortcode

A list of valid formats can be obtained by calling NormalizeCountry.formats.

Obtaining an Array or Hash

NormalizeCountry.to_a                              # Defaults to NormalizeCountry.to
NormalizeCountry.to_a(:ioc)                        # Array of IOC codes in ascending order
NormalizeCountry.to_h(:ioc)                        # :ioc => NormalizeCountry.to
NormalizeCountry.to_h(:ioc, :to => :numeric)       # :ioc => :numeric

Conversion Utility

A small script is included that can convert country names contained in a DB table or a set of XML or CSV files

shell > normalize_country -h
usage: normalize_country [options] SOURCE
    -h, --help                       Show this message
    -f, --format FORMAT              The format of SOURCE
    -t, --to CONVERSION              Convert country names to this format (see docs for valid formats)
    -l, --location LOCATION          The location of the conversion

Some examples

normalize_country -t alpha2 -l 'Country Name' -f csv data.csv
normalize_country -t numeric -l countries.code -f db postgres://usr:pass@localhost/conquests
normalize_country -t fifa -l //teams[@sport = 'fútbol americano']//country -f xml data.xml

If the format is xml or csv you can spefify a directory instead of a filename

normalize_country -t alpha2 -l 'Country Name' -f csv /home/sshaw/capital-losses/2008

With a format of csv it will read all files with an extension of csv or tsv. For csv and xml the original file(s) will be overwritten with new file(s) containing the converted country names.

To convert an XML file with namespaces just include the namespace prefix defined in the file in the XPath query (LOCATION).

The db format’s SOURCE argument must be a Sequel connection string. Here LOCATION is in the format table.column, which will be updated with the converted name.

Random Country Data for Your Tests

Random data generating gems like Faker and RandomData don’t generate much country data. If you’d like to use this gem to do so I suggest checking out this gist: gist.github.com/sshaw/6068404

Faulty/Missing/Erroneous Country Names

Please submit a patch or open an issue.

Why?

This code was -to some extent- part of a larger project that allowed users to perform a free-text search by country. Country names were stored in the DB by their ISO names.

Several years later at work we had to extract country names from a web service that didn’t standardize them. Sometimes they used UK, other times U.K. It then occured to me that this code could be useful outside of the original project. The web service was fixed but, nevertheless…

Somewhat Similar Gems

Upon further investigation I’ve found the following:

  • Carmen: ISO country names and states/subdivisions

  • countries ISO country names, states/subdivisions, currency, E.164 phone numbers and language translations

  • country_codes ISO country names and currency data

  • i18n_data: ISO country names in different languages, includes alpha codes

  • ModelUN: Similar to this gem but with less support for conversion, it does include US states

See Also

normalize_country's People

Contributors

sshaw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

normalize_country's Issues

Problems with punctuation and order

I've a dataset which has some problematic names. Specifically:

  • Palestine, State of
  • Côte d’Ivoire

The en.yml data file contains these relevant entries:

PS:
  aliases:
  - Palestinian Territories
  - Palestinian Territory
  alpha2: PS
  alpha3: PSE
  fifa: PLE
  ioc: PLE
  iso_name: Palestinian Territory, Occupied
  numeric: "275"
  official: State of Palestine
  short: Palestine
  emoji: "\U0001F1F5\U0001F1F8"
  shortcode: ":flag-ps:"
  alpha2: CI
  alpha3: CIV
  fifa: CIV
  ioc: CIV
  iso_name: Côte D'Ivoire
  numeric: "384"
  official: Republic of Côte D'Ivoire
  short: Ivory Coast
  emoji: "\U0001F1E8\U0001F1EE"
  shortcode: ":flag-ci:"

So it's a "close but no cigar" situation in both cases. I'm not sure how to solve this.

I'm wondering if the library should erase punctuation and flatten to ASCII when comparing? This would handle the different choice of apostrophe and any missing/altered accents in Côte D'Ivoire, but perhaps that goes too far. I can't currently think of country names it would break, but that's not saying they wouldn't be. And come to think of it, the official name is also a bit weird, mixing "Republic of" (English) with D'Ivoire (French).

There are other names with an apostrophe. These are going to be problematic, considering the general populace's facility with using punctuation. Likewise punctuation as in Bosnia-Herzegovina, Guinea-Bissau or accents as in Åland Islands, and just alternative spellings like Faeroes.

Palestine, State of does what some of the other names do, putting the main name first and any qualifiers like "State of" after a comma. But it doesn't match in this case. I think this is harder; removing punctuation is one thing, re-arranging word order is another.

I see elsewhere in en.yml there are aliases. Perhaps that's a better solution, adding a lot of aliases?

Add emoji as a target

NormalizeCountry("United States", :to => :emoji) # returns 🇺🇸
NormalizeCountry("Brazil", :to => :emoji) # returns nil

Handling obsolete, historic codes like ISO 3166-1:AN

The country code AN (Netherlands Antilles) has been removed in 2010 per ISO 3166-1 Newsletter VI-8.

I'm not sure what the best way would to handle such cases.

The easiest option would be to remove that entry from the list (as it isn't part of the standard anymore) and be done with it.

On the other hand it might be feasible to introduce an obsoleted_by property referencing the document mentioned above and add an optional parameter to relevant methods whether to include or exclude obsolete records. Or maybe even show obsolete records only? The default behavior should most likely reflect the current standard and exclude obsolete records and show only the active ones.

ISO_3166-1_alpha-2 CW (Curaçao) is missing

thx for your great work so far!
can you plz add Curaçao to your list?

CW:
  aliases:
  - Curacao
  alpha2: CW
  alpha3: CUW
  fifa: CUW
  ioc:
  iso_name: Curaçao
  numeric: "533"
  official: Curaçao
  short: Curaçao
emoji:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.