Code Monkey home page Code Monkey logo

countrycode's Introduction


DOI R build status

countrycode standardizes country names, converts them into ~40 different coding schemes, and assigns region descriptors. Scroll down for more details or visit the countrycode CRAN page

If you use countrycode in your research, we would be very grateful if you could cite our paper:

Arel-Bundock, Vincent, Nils Enevoldsen, and CJ Yetman, (2018). countrycode: An R package to convert country names and country codes. Journal of Open Source Software, 3(28), 848, https://doi.org/10.21105/joss.00848

Why countrycode?

The Problem

Different data sources use different coding schemes to represent countries (e.g. CoW or ISO). This poses two main problems: (1) some of these coding schemes are less than intuitive, and (2) merging these data requires converting from one coding scheme to another, or from long country names to a coding scheme.

The Solution

The countrycode function can convert to and from 40+ different country coding schemes, and to 600+ variants of country names in different languages and formats. It uses regular expressions to convert long country names (e.g. Sri Lanka) into any of those coding schemes or country names. It can create new variables with various regional groupings.

Installation

From the R console, type:

install.packages("countrycode")

To install the latest development version, you can use the remotes package:

library(remotes)
install_github('vincentarelbundock/countrycode')

Supported codes

To get an up-to-date list of supported country codes, install the package and type ?codelist. These include:

  • 600+ variants of country names in different languages and formats.
  • AR5
  • Continent and region identifiers.
  • Correlates of War (numeric and character)
  • European Central Bank
  • EUROCONTROL - The European Organisation for the Safety of Air Navigation
  • Eurostat
  • Federal Information Processing Standard (FIPS)
  • Food and Agriculture Organization of the United Nations
  • Global Administrative Unit Layers (GAUL)
  • Geopolitical Entities, Names and Codes (GENC)
  • Gleditsch & Ward (numeric and character)
  • International Civil Aviation Organization
  • International Monetary Fund
  • International Olympic Committee
  • ISO (2/3-character and numeric)
  • Polity IV
  • United Nations
  • United Nations Procurement Division
  • Varieties of Democracy
  • World Bank
  • World Values Survey
  • Unicode symbols (flags)

countrycode's People

Contributors

altaf-ali avatar bquast avatar christophe-gouel avatar christophergandrud avatar cjyetman avatar dbaston avatar espinielli avatar etiennebacher avatar etpinard avatar grasshoppermouse avatar mcooper avatar michalovadek avatar nenuial avatar nilsenevoldsen avatar pursuitofdatascience avatar salim-b avatar sthonnard avatar vincentarelbundock avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

countrycode's Issues

Various old and colonial names aren't matched

This particular issue is probably a neverending job, but here are some I came across:

  • Gold Coast
  • Upper Volta
  • Portuguese Guinea
  • Basutoland
  • Northern Rhodesia
  • Southern Rhodesia
  • Rhodesia
  • The Argentine
  • Dutch Guiana
  • Bohemia
  • Czechia
  • French Republic
  • Gaul
  • Hellas
  • Bessarabia
  • Bassarabia
  • Rumania
  • Roumania
  • Mesopotamia
  • Trucial States
  • Formosa
  • New Hebrides

Missing Sint Maarten and Caraçao

I was working with some WTO trade data and it seems that Sint Maarten and Caraçao are missing from the data. I'm willing fix it and open up a PR if you would like.

FAOstat code is incomplete (island nations)

I am working with some FAOSTAT data and wanted to convert some FAO codes to iso3c codes using your package. Great work by the way, it is very convenient. However, when I did a random check I noticed that some codes were not converted. Having a closer look and comparing the data table with the FAOSTAT country classification (http://faostat.fao.org/site/371/DesktopDefault.aspx?PageID=371) I found that a number of countries have missing data while the FAO code is in fact available. Most of them are tiny islands so not that important but it was striking that the value for Singapore is missing in the package.

I attach an excel sheet (still have to get a GitHub account) with the additions so you can update the package.

South Sudan

I think there might be a minor problem with “South Sudan,” which seceded from “Sudan” earlier this year.

The country.name and regex columns (at least) should be different between the two Sudans.

regex: CONGO, REPUBLIC OF

Email bug report:


Thank you for countrycode package. There is one issue with package. In
countrycode_data for "CONGO, REPUBLIC OF" where is an empty cell for
"regex" column, that's why all countries above match it when one use
regex.

No 'wb' to 'iso3c' matches for Andorra, Romania, Timor-Leste and Congo, the Democratic Republic

Running version 0.18 from CRAN I get these results:

> countrycode(c("AND","ROU","TLS","COD"), origin="wb", destination="iso3c")
[1] NA NA NA NA

For these World Bank codes I should be getting the same iso3c codes.

Additionally, the following do not match:

  • CHI: Channel Islands
  • IMN: Isle of Man
  • KSV: Kosovo
  • PSE: West Bank and Gaza

According to the Wikipedia ISO 3166 article, these codes don't seem to be official so I don't know what is the proper handling of these codes.

region assignment

I've found a few country codes that don't assign regions correctly.

library(countrycode)
library(cshapes)
cshp.data <- cshp()
#convert country data set to country year data set
base <- cshapes2yearly(cshp.data, vars = "COWCODE", useGW = T)
names(base) <- c("ccode.a", "year", "cowcode")
base <- subset(base, year >= 1975 & year <= 2008)
base$region <- countrycode(base$cowcode, "cown", "region")

not sure why but it does not parse ccodes:
260, 265, 315, 345, 347, 678, 680, 713, 731, 816, 971, 972, 973

[Request] Short country names (for legibility, graph labels)

It would be helpful to have short country names as an option. For example:

  • "Bolivia, Plurinational State Of" = "Bolivia"
  • "Korea, Democratic People'S Republic Of" = "Korea, DPR"

Please let me know if this option can be implemented soon, or whether you're interested in me contributing this option.

(North/South) Vietnam

I believe COW project uses 816 for Vietnam instead of 817, which was assigned to South Vietnam.

Possibility to add custom regional mappings

It would be good to give users the possibility to add custom regional mappings by e.g. providing a data frame containing a iso3c country code vector and regional map. Like this, only those regional mappings of broader interest could be included into the CRAN package whereas specialized mappings could be provided on demand.

Added benefit of this would be that one could translate custom regional mappings into more general ones.

New Zealand

Hi,

It looks like the Perl regex to match the Aland Islands also capture New Zealand, despite the !new instruction:

> countrycode("New Zealand", origin = "country.name", destination = "country.name", warn = T)
[1] "NEW ZEALAND"
Warning message:
 In countrycode("New Zealand", origin = "country.name", destination = "country.name",  :
  Some strings were matched more than once: New Zealand,ALAND ISLANDS,NEW ZEALAND

Any idea how the regex might be modified to effectively match only New Zealand?

Serbia not included

I was working on data on Central Bank Governors and they include Serbia. When using

countrycode(county, "country.name", "cown")

the package does not return the correlates of war code 340.

country.name regex double match warning

I was wondering however if it is always possible to avoid exact matching? Maybe I feel there are some cases where a double matching occurs, i.e. a country is matched twice? Maybe you could introduce a check such as:
for (j in matches) {
if(!is.na(destination.vector[j])) warning("country ", as.vector(sourcevar), " matched twice: ", destination.vector[j], " and ", Destination_code)
destination.vector[j] <- Destination_code
}

with this, I get for example:

countrycode("nigeria", "country.name", "iso3c")
[1] "NGA"
Message d'avis :
In countrycode("nigeria", "country.name", "iso3c") :
country nigeria matched twice: NER and NGA

or is it intentional?

Welcome to the Jungle: Several variations of the Congos misidentified

There's a nonzero chance that I myself have made a typo in these lists. Beware. :-)

These are all synonyms for Congo-Brazzaville (the small one), and their matched values:

  • Republic of the Congo -> Congo-Brazzaville (Correct.)
  • Congo, Rep. -> Congo-Brazzaville (Correct.)
  • ROC -> NA (Correct, since it might be Taiwan.)
  • Congo-Brazzaville -> Congo-Kinshasa (Wrong.)
  • French Congo -> Congo-Brazzaville (Correct.)

These are all synonyms for Congo-Kinshasa (the BIG one), and their matched values:

  • Democratic Republic of the Congo -> Congo-Kinshasa (Correct.)
  • Congo, Dem. Rep. -> Congo-Kinshasa (Correct.)
  • DR Congo -> Congo-Kinshasa (Correct.)
  • DRC -> NA
  • DROC -> NA
  • RDC -> NA
  • Congo-Kinshasa -> Congo-Brazzaville (Wrong.)
  • Congo-Zaire -> Congo-Kinshasa (Correct.)
  • Belgian Congo -> Congo-Brazzaville (Wrong.)
  • Republic of the Congo-Léopoldville -> Congo-Brazzaville (Wrong.)
  • Congo Free State -> Congo-Brazzaville (Wrong.)

Polity IV “scode” not implemented

Polity IV uses "scode", supposedly (and mostly) similar to COW codes, but there appear to be differences:

  • ETI -> NA (Should match Ethiopia--historical issue?)
  • FJI -> NA (Should match Fiji)
  • IVO -> NA (Should match Ivory Coast)
  • KOS -> NA (Should match Kosovo)
  • MNT -> NA (Should match Montenegro)
  • PKS -> NA (Should match Pakistan--historical issue?)
  • RUM -> NA (Should match Romania)
  • SER -> NA (Should match Serbia)
  • VIE -> NA (Should match Vietnam)
  • ZAI -> NA (Should match Democratic Republic of Congo)

Add functionality of the stata package ``kountry``

This package, for example, allows the user to easily add spellings of each country's long name. The package could also provide wrapper functions to extract continent, region, and convert long names to country codes.

Message d'erreur pour origin code

-1 Message plus explicite pour "Origin code not supported"

tu pourrais peut-ere mettre:
if (!origin %in% o_codes){stop("Origin code not supported. Should be one of: ", paste(o_codes, collapse=", "))}
if (!destination %in% d_codes){stop("Destination code not supported. Should be one of: ", paste(o_codes, collapse=", "))}

remarque que sinon la solution la plus "classique" consiste à avoir ces valuers plutot dans la déf de la fonction telle que:
countrycode <- function (sourcevar, origin=c("cowc", "cown", "fips04", "imf", "iso2c", "iso3c", "iso3n", "country.name"), destination, nomatch=FALSE){

o_codes<-match.arg(origin)

ce qui fait le test automatiquement avec message standard d'erreur.

Official country names that are misidentified or not identified

  • Democratic People's Republic of Korea -> South Korea
  • Republic of Guinea -> NA
  • Hellenic Republic -> NA
  • United Mexican States -> United States
  • Republic of the Union of Myanmar -> Reunion
  • Independent State of Samoa -> NA
  • Republic of South Sudan -> Sudan
  • Swiss Confederation -> NA

[Request] Google Maps API-compatible country names

This package is terrific; thanks so much for creating and maintaining it.

Any chance you could add country names that are fully compatible with Google Maps API as an option for origin and destination? Maybe call it "google.names"?

This comes up because I am trying to use ggmap's get_map(source = "google", location = [countryname]) to map and then plot points in countries called from a data frame that has used countrycode to create those names from iso3c codes. Most of the names work fine, but I've found a number of exceptions that return wacky results (e.g., a map of Tasmania when location = "Tanzania, United Republic of". I'm probably not the only person doing this, and I think this request overlaps with a previous one for an option for short-form names.

Also, I don't think it would require many variations on the existing name list. I went through and tried mapping ones that seemed like they could be problematic, and these were the only ones that seemed to require attention (with working alternatives after the -->):

Tanzania, United Republic of --> Tanzania
Bolivia, Plurinational State of --> Bolivia
Congo, the Democratic Republic of the --> Democratic Republic of the Congo; DRC
Korea, Democratic People's Republic of --> North Korea
Korea, Republic of --> South Korea
Macedonia, the former Yugoslav Republic of --> Macedonia
Syrian Arab Republic --> Syria
Venezuela, Bolivarian Republic of --> Venezuela

Oh, and one last note: I tried using the iso3c codes as the location in the call to get_map(), but that doesn't work either.

Congo CGO -> COG

Is it possible there is a mistake with congo, that should have iso3 code COG and not CGO?

Could there be a wiki page explaining the difference between the different country code system?

I have been trying to Google the difference between the various systems -- however the information is very scattered and unsystematic. At the moment, that leaves my choice of, say, iso2c vs iso3c largely arbitrary.

I suspect that in the process of writing this package, you have gained a working understanding of the systems. A wiki page would greatly help the package user choosing the appropriate system.

countrycode_data: Rep. of Congo & DRC

Update: countrycode_data

The entry for DR Congo should be corrected to:

c("CONGO, THE DEMOCRATIC REPUBLIC OF THE", "DRC", 490, "CG", 636, "CD", "COD", 180, "._dem.congo.|._congo.dem.|._congo.dr.|._dr.congo.|.zaire.|._congo._br.*", "Middle Africa", "Africa")

New entry for the Republic of the Congo should added (I'm not sure about the correct regex expression(s)):

c("CONGO, REPUBLIC OF THE", "CON", 484, "CF", 634, "CG", "CGO", 178, "regex expression", "Middle Africa", "Africa")

Return bad match object

Souvent, quand j'ai des variables qui sont pas "matchées" de source à origin, j'aimerais pouvoir voir desquelles il s'agit... la funciton déjà les affiche, mais peut-être ce serait bien de les exporter pour l'utilisateur? Je te mets petit script explicatif plus bas, avec les nouvelles lignes de code. Note que j'utilise un petit "trick" d'exporter à chaque fois sous un autre nom, pas sur que ce soit utile... (pt-etre devrait devenir arg de fonction elle meme?)

if(length(nomatch) > 0){
dest_li_nam<-"counCode_unmatched"
if(dest_li_nam%in%ls(envir=.GlobalEnv)) dest_li_nam<- paste(dest_li_nam, paste(sample(letters, size=3), collapse=""),sep="_")
assign(dest_li_nam, nomatch, envir=.GlobalEnv)
warning("Some values were not matched: (exported as ", dest_li_nam, "): ", paste(nomatch, collapse=", "), "\n")
}

### EXEMPLE

Donc cas classique:

library(WDI)
library(countrycode)
ex<-WDI(indicator="HH.DHS.YRS.15UP.GIN")

avec nouvelle fonction:

a<-countrycode(ex$iso2c, "iso2c", "iso3c",warn=TRUE)
Message d'avis :
In countrycode(ex$iso2c, "iso2c", "iso3c", warn = TRUE) :
Some values were not matched: 1A, 1W, 4E, 7E, 8S, CW, EU, JG, KV, NA, OE, S3, SS, SX, XC, XD, XE, XJ, XL, XM, XN, XO, XP, XQ, XR, XS, XT, XU, Z4, Z7, ZF, ZG, ZJ, ZQ

aa<-countrycode2(ex$iso2c, "iso2c", "iso3c",warn=TRUE)
Message d'avis :
In countrycode2(ex$iso2c, "iso2c", "iso3c", warn = TRUE) :
Some values were not matched: (exported as counCode_unmatched_ahk): 1A, 1W, 4E, 7E, 8S, CW, EU, JG, KV, NA, OE, S3, SS, SX, XC, XD, XE, XJ, XL, XM, XN, XO, XP, XQ, XR, XS, XT, XU, Z4, Z7, ZF, ZG, ZJ, ZQ

maintenant peut voir facilement ou est probleme!

unique(subset(ex, iso2c%in%counCode_unmatched_ahk, "country",drop=TRUE))
[1] "Arab World"
[2] "Caribbean small states"
[3] "East Asia & Pacific (developing only)"
[4] "East Asia & Pacific (all income levels)"
[5] "Europe & Central Asia (developing only)"
[6] "Europe & Central Asia (all income levels)"
[7] "Euro area"
[8] "European Union"
[9] "High income"
[10] "Heavily indebted poor countries (HIPC)"
[11] "Latin America & Caribbean (developing only)"
[12] "Latin America & Caribbean (all income levels)"
[13] "Least developed countries: UN classification"
[14] "Low income"
[15] "Lower middle income"
[16] "Low & middle income"
[17] "Middle East & North Africa (all income levels)"
[18] "Middle income"
[19] "Middle East & North Africa (developing only)"
[20] "North America"
[21] "High income: nonOECD"
[22] "High income: OECD"

Regex for Yugoslavia looks weird

I haven't experienced any issue, but I noticed that the regex to match Yugoslavia is odd:

.*yugoslavia.*|.*yugoslavia.*

If that's intentional, feel free to close out this bug.

Trinidad regex

Also, I notice that for trinidad, the regex contains tabago, should it not be tobago?
subset(countrycode_data, regex==grep("rinidad", countrycode_data$regex, value=TRUE))

Various non-official names that are misidentified

Misidentification is significantly worse than nonidentification, so I've erred on the side of including uncommon names on this list.

  • DPRK -> Cambodia
  • Byelorussia -> Russia
  • British Honduras -> Honduras
  • Bechuanaland -> Aland
  • Nyasaland -> Aland
  • British East Africa -> South Africa
  • East Africa Protectorate -> South Africa
  • [Moldovian, Ukrainian, etc] Soviet Socialist Republic -> All become Russian Federation instead of the corresponding post-Soviet state
  • East Pakistan -> Pakistan
  • Chinese Taipei -> Thailand
  • Taipei -> Thailand

"Republic of China" (Taiwan's official name) matches to China

While the country name recognition is great, I started playing around it and noticed a geopolitically relevant flaw in a specific edge case: The territory officially named the "Republic of China" is also unofficially known as Taiwan, but countrycode() matches the string to the (People's Republic of) China instead.

Example:

> countrycode("Republic of China", "country.name", "iso2c")
[1] "CN"
> countrycode("Republic of China", "country.name", "country.name")
[1] "China"
> countrycode("TW", "iso2c", "country.name")
[1] "Taiwan, Province of China"
> countrycode("Taiwan", "country.name", "country.name")
[1] "Taiwan, Province of China"

Expected behavior would be to map all of those strings to TW or Taiwan, Province of China, as appropriate.

I am using version 0.18, the latest (I believe) version from CRAN.

World Bank Income Class mapping

First of all, many thanks for providing this great package! I have been using it a lot recently and found that it would be great if there was an option to map countries to their income group as classified by the World Bank (i.e. High Income, Upper-Middle Income, Lower-Middle Income and Low Income).

destination.vector<-NULL

lso, a few suggestions:

  •    destination.vector<-NULL
    for (z in 1:length(SOURCEVAR)){destination.vector<-c(NA,destination.vector)}
    
    can be also written just: destination.vector <- rep(NA, length(SOURCEVAR))

Congo (again!)

I notice however that there is no regex for Congo:
subset(countrycode_data, iso3c=="CGO","regex")

which has as consequence that any country appearing before congo in countrycode_data will be assigned to congo:

for(i in countrycode_data[1:50,"country.name"]) print(countrycode(i, ORIGIN="country.name", DESTINATION="iso3c"))

country-year

Some coding scheme different codes for "similar" countries over the years. For example:

Ethiopia is one state until some date with the code 320, then it splits in to Ethiopia 321 and Eritrea 322 for the subsequent years.

Unmatched values reporting

-it would be very useful if you could add a check for unmatched values, for example:

potential.nas<-subset(data.frame(SOURCEVAR,destination.vector), !is.na(SOURCEVAR),"destination.vector",drop=TRUE)
if(any(is.na(potential.nas))) {
  naValues<-subset(data.frame(SOURCEVAR,destination.vector), !is.na(SOURCEVAR)&is.na(destination.vector),"SOURCEVAR",drop=TRUE)
  warning("Some values not matched: ", paste(naValues,collapse=", "),"\n")
}

Common name matched more than once

Hi,

When attempting to match the following name, there are matches more than once for which there should be obvious the second match is "more accurate" given the country names in full:

1: In countrycode(country, "country.name", "iso2c", warn = TRUE) :
Some strings were matched more than once: Dem. Rep. of the Congo,CG,CD

2: In countrycode(country, "country.name", "iso2c", warn = TRUE) :
Some strings were matched more than once: Hong Kong, China,CN,HK

3: In countrycode(country, "country.name", "iso2c", warn = TRUE) :
Some strings were matched more than once: Macao, China,CN,MO

4: In countrycode(country, "country.name", "iso2c", warn = TRUE) :
Some strings were matched more than once: South Sudan,SS,SD

Nice work, by the way.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.