vincentarelbundock / countrycode Goto Github PK

R package: Convert country names and country codes. Assigns region descriptors.

Home Page: https://vincentarelbundock.github.io/countrycode

License: GNU General Public License v3.0

R 97.27% Dockerfile 0.42% TeX 0.79% HTML 0.99% Makefile 0.52%

countrycode's Introduction

countrycode standardizes country names, converts them into ~40 different coding schemes, and assigns region descriptors. Scroll down for more details or visit the countrycode CRAN page

If you use countrycode in your research, we would be very grateful if you could cite our paper:

Arel-Bundock, Vincent, Nils Enevoldsen, and CJ Yetman, (2018). countrycode: An R package to convert country names and country codes. Journal of Open Source Software, 3(28), 848, https://doi.org/10.21105/joss.00848

Why `countrycode`?

The Problem

Different data sources use different coding schemes to represent countries (e.g. CoW or ISO). This poses two main problems: (1) some of these coding schemes are less than intuitive, and (2) merging these data requires converting from one coding scheme to another, or from long country names to a coding scheme.

The Solution

The countrycode function can convert to and from 40+ different country coding schemes, and to 600+ variants of country names in different languages and formats. It uses regular expressions to convert long country names (e.g. Sri Lanka) into any of those coding schemes or country names. It can create new variables with various regional groupings.

Installation

From the R console, type:

install.packages("countrycode")

To install the latest development version, you can use the remotes package:

library(remotes)
install_github('vincentarelbundock/countrycode')

Supported codes

To get an up-to-date list of supported country codes, install the package and type ?codelist. These include:

600+ variants of country names in different languages and formats.
AR5
Continent and region identifiers.
Correlates of War (numeric and character)
European Central Bank
EUROCONTROL - The European Organisation for the Safety of Air Navigation
Eurostat
Federal Information Processing Standard (FIPS)
Food and Agriculture Organization of the United Nations
Global Administrative Unit Layers (GAUL)
Geopolitical Entities, Names and Codes (GENC)
Gleditsch & Ward (numeric and character)
International Civil Aviation Organization
International Monetary Fund
International Olympic Committee
ISO (2/3-character and numeric)
Polity IV
United Nations
United Nations Procurement Division
Varieties of Democracy
World Bank
World Values Survey
Unicode symbols (flags)

countrycode's People

Contributors

Stargazers

Watchers

Forkers

trcook aserlich ahalterman luiscape totallybullshit jameswtc muuankarski thanhdao285 sdorius fxcebx minakshisharma30 bquast nroming christophergandrud svmiller alistaire47 briatte thewiremonkey farahcp poldham andrewheiss anhqle giocomai tonyfujs jongyoonbaik econandrew sumtxt grasshoppermouse etpinard desval zauster aratsimbaharison jimjam-slam david-hammond msmall318 zhaoxiaohe altaf-ali syuukaxiaoy andybega nemochina2008 mnyrop bibliometrics ktp-forked-repos thangnguyen2001 geotheory dbaston sempervent liuliuball45 tsuilf randomcriticalanalysis mcooper kklot lukaswallrich salim-b elphegotorres iago-contributedforks duyanji pursuitofdatascience nilsenevoldsen cjyetman vanessabehrens xiang-li-fin kzlinguistics jennybc turbanisch petrina95 murphy829 anguyen1210 rivaquiroga christophe-gouel olivroy galalh frank113 rotemzel chroetz michalovadek etiennebacher wenlonglian rempsyc fajrunwm sthonnard

countrycode's Issues

Serbia and Montenegro is not in there separately

Could consider adding this with, for example, the reserved iso3c "SCG"

Various old and colonial names aren't matched

This particular issue is probably a neverending job, but here are some I came across:

Gold Coast
Upper Volta
Portuguese Guinea
Basutoland
Northern Rhodesia
Southern Rhodesia
Rhodesia
The Argentine
Dutch Guiana
Bohemia
Czechia
French Republic
Gaul
Hellas
Bessarabia
Bassarabia
Rumania
Roumania
Mesopotamia
Trucial States
Formosa
New Hebrides

Add instructions for ad hoc merge of new codes in the docs for countrycode_data

Add something like this:

library(countrycode)
countrycode::countrycode_data = merge(countrycode::countrycode_data, newcodes)
countrycode(x, 'newcode', 'iso3c')

Could be coded more efficiently

Could make use of factor variables to avoid duplication of work

Missing Sint Maarten and Caraçao

I was working with some WTO trade data and it seems that Sint Maarten and Caraçao are missing from the data. I'm willing fix it and open up a PR if you would like.

FAOstat code is incomplete (island nations)

I am working with some FAOSTAT data and wanted to convert some FAO codes to iso3c codes using your package. Great work by the way, it is very convenient. However, when I did a random check I noticed that some codes were not converted. Having a closer look and comparing the data table with the FAOSTAT country classification (http://faostat.fao.org/site/371/DesktopDefault.aspx?PageID=371) I found that a number of countries have missing data while the FAO code is in fact available. Most of them are tiny islands so not that important but it was striking that the value for Singapore is missing in the package.

I attach an excel sheet (still have to get a GitHub account) with the additions so you can update the package.

South Sudan

I think there might be a minor problem with “South Sudan,” which seceded from “Sudan” earlier this year.

The country.name and regex columns (at least) should be different between the two Sudans.

regex: CONGO, REPUBLIC OF

Email bug report:

Thank you for countrycode package. There is one issue with package. In
countrycode_data for "CONGO, REPUBLIC OF" where is an empty cell for
"regex" column, that's why all countries above match it when one use
regex.

No 'wb' to 'iso3c' matches for Andorra, Romania, Timor-Leste and Congo, the Democratic Republic

Running version 0.18 from CRAN I get these results:

> countrycode(c("AND","ROU","TLS","COD"), origin="wb", destination="iso3c")
[1] NA NA NA NA

For these World Bank codes I should be getting the same iso3c codes.

Additionally, the following do not match:

CHI: Channel Islands
IMN: Isle of Man
KSV: Kosovo
PSE: West Bank and Gaza

According to the Wikipedia ISO 3166 article, these codes don't seem to be official so I don't know what is the proper handling of these codes.

region assignment

I've found a few country codes that don't assign regions correctly.

library(countrycode)
library(cshapes)
cshp.data <- cshp()
#convert country data set to country year data set
base <- cshapes2yearly(cshp.data, vars = "COWCODE", useGW = T)
names(base) <- c("ccode.a", "year", "cowcode")
base <- subset(base, year >= 1975 & year <= 2008)
base$region <- countrycode(base$cowcode, "cown", "region")

not sure why but it does not parse ccodes:
260, 265, 315, 345, 347, 678, 680, 713, 731, 816, 971, 972, 973

[Request] Short country names (for legibility, graph labels)

It would be helpful to have short country names as an option. For example:

"Bolivia, Plurinational State Of" = "Bolivia"
"Korea, Democratic People'S Republic Of" = "Korea, DPR"

Please let me know if this option can be implemented soon, or whether you're interested in me contributing this option.

Add UN and FAO codes

UN (and eventually FAO, for which I am working) codes to this list?

(North/South) Vietnam

I believe COW project uses 816 for Vietnam instead of 817, which was assigned to South Vietnam.

Possibility to add custom regional mappings

It would be good to give users the possibility to add custom regional mappings by e.g. providing a data frame containing a iso3c country code vector and regional map. Like this, only those regional mappings of broader interest could be included into the CRAN package whereas specialized mappings could be provided on demand.

Added benefit of this would be that one could translate custom regional mappings into more general ones.

New Zealand

Hi,

It looks like the Perl regex to match the Aland Islands also capture New Zealand, despite the !new instruction:

> countrycode("New Zealand", origin = "country.name", destination = "country.name", warn = T)
[1] "NEW ZEALAND"
Warning message:
 In countrycode("New Zealand", origin = "country.name", destination = "country.name",  :
  Some strings were matched more than once: New Zealand,ALAND ISLANDS,NEW ZEALAND

Any idea how the regex might be modified to effectively match only New Zealand?

Serbia not included

I was working on data on Central Bank Governors and they include Serbia. When using

countrycode(county, "country.name", "cown")

the package does not return the correlates of war code 340.

eq guinea confusion

TW is not matched to a region of the world

TW is not matched to a region of the world (Eastern Asia) nor continent (Asia)

country.name regex double match warning

I was wondering however if it is always possible to avoid exact matching? Maybe I feel there are some cases where a double matching occurs, i.e. a country is matched twice? Maybe you could introduce a check such as:
for (j in matches) {
if(!is.na(destination.vector[j])) warning("country ", as.vector(sourcevar), " matched twice: ", destination.vector[j], " and ", Destination_code)
destination.vector[j] <- Destination_code
}

with this, I get for example:

countrycode("nigeria", "country.name", "iso3c")
[1] "NGA"
Message d'avis :
In countrycode("nigeria", "country.name", "iso3c") :
country nigeria matched twice: NER and NGA

or is it intentional?

Welcome to the Jungle: Several variations of the Congos misidentified

There's a nonzero chance that I myself have made a typo in these lists. Beware. :-)

These are all synonyms for Congo-Brazzaville (the small one), and their matched values:

Republic of the Congo -> Congo-Brazzaville (Correct.)
Congo, Rep. -> Congo-Brazzaville (Correct.)
ROC -> NA (Correct, since it might be Taiwan.)
Congo-Brazzaville -> Congo-Kinshasa (Wrong.)
French Congo -> Congo-Brazzaville (Correct.)

These are all synonyms for Congo-Kinshasa (the BIG one), and their matched values:

Democratic Republic of the Congo -> Congo-Kinshasa (Correct.)
Congo, Dem. Rep. -> Congo-Kinshasa (Correct.)
DR Congo -> Congo-Kinshasa (Correct.)
DRC -> NA
DROC -> NA
RDC -> NA
Congo-Kinshasa -> Congo-Brazzaville (Wrong.)
Congo-Zaire -> Congo-Kinshasa (Correct.)
Belgian Congo -> Congo-Brazzaville (Wrong.)
Republic of the Congo-Léopoldville -> Congo-Brazzaville (Wrong.)
Congo Free State -> Congo-Brazzaville (Wrong.)

Polity IV “scode” not implemented

Polity IV uses "scode", supposedly (and mostly) similar to COW codes, but there appear to be differences:

ETI -> NA (Should match Ethiopia--historical issue?)
FJI -> NA (Should match Fiji)
IVO -> NA (Should match Ivory Coast)
KOS -> NA (Should match Kosovo)
MNT -> NA (Should match Montenegro)
PKS -> NA (Should match Pakistan--historical issue?)
RUM -> NA (Should match Romania)
SER -> NA (Should match Serbia)
VIE -> NA (Should match Vietnam)
ZAI -> NA (Should match Democratic Republic of Congo)

Add functionality of the stata package ``kountry``

This package, for example, allows the user to easily add spellings of each country's long name. The package could also provide wrapper functions to extract continent, region, and convert long names to country codes.

regional aggregates codes

WB, FAO, IMF

Message d'erreur pour origin code

-1 Message plus explicite pour "Origin code not supported"

tu pourrais peut-ere mettre:
if (!origin %in% o_codes){stop("Origin code not supported. Should be one of: ", paste(o_codes, collapse=", "))}
if (!destination %in% d_codes){stop("Destination code not supported. Should be one of: ", paste(o_codes, collapse=", "))}

remarque que sinon la solution la plus "classique" consiste à avoir ces valuers plutot dans la déf de la fonction telle que:
countrycode <- function (sourcevar, origin=c("cowc", "cown", "fips04", "imf", "iso2c", "iso3c", "iso3n", "country.name"), destination, nomatch=FALSE){

o_codes<-match.arg(origin)

ce qui fait le test automatiquement avec message standard d'erreur.

Taiwan is not supposed to be called "Taiwan, Province Of China"

On the same issue of Taiwan's name as this open issue, but from another angle: Taiwan is not a province of China (as a indignant Taiwanese colleague of mine point out) :)

Official country names that are misidentified or not identified

Democratic People's Republic of Korea -> South Korea
Republic of Guinea -> NA
Hellenic Republic -> NA
United Mexican States -> United States
Republic of the Union of Myanmar -> Reunion
Independent State of Samoa -> NA
Republic of South Sudan -> Sudan
Swiss Confederation -> NA

[Request] Google Maps API-compatible country names

This package is terrific; thanks so much for creating and maintaining it.

Any chance you could add country names that are fully compatible with Google Maps API as an option for origin and destination? Maybe call it "google.names"?

This comes up because I am trying to use ggmap's get_map(source = "google", location = [countryname]) to map and then plot points in countries called from a data frame that has used countrycode to create those names from iso3c codes. Most of the names work fine, but I've found a number of exceptions that return wacky results (e.g., a map of Tasmania when location = "Tanzania, United Republic of". I'm probably not the only person doing this, and I think this request overlaps with a previous one for an option for short-form names.

Also, I don't think it would require many variations on the existing name list. I went through and tried mapping ones that seemed like they could be problematic, and these were the only ones that seemed to require attention (with working alternatives after the -->):

Tanzania, United Republic of --> Tanzania
Bolivia, Plurinational State of --> Bolivia
Congo, the Democratic Republic of the --> Democratic Republic of the Congo; DRC
Korea, Democratic People's Republic of --> North Korea
Korea, Republic of --> South Korea
Macedonia, the former Yugoslav Republic of --> Macedonia
Syrian Arab Republic --> Syria
Venezuela, Bolivarian Republic of --> Venezuela

Oh, and one last note: I tried using the iso3c codes as the location in the call to get_map(), but that doesn't work either.

Congo CGO -> COG

Is it possible there is a mistake with congo, that should have iso3 code COG and not CGO?

Two incorrect official short names

Lao People'S Democratic Republic
(Capital "S")
Libyan Arab Jamahiriya
(Just "Libya" since 2011)

Could there be a wiki page explaining the difference between the different country code system?

I have been trying to Google the difference between the various systems -- however the information is very scattered and unsystematic. At the moment, that leaves my choice of, say, iso2c vs iso3c largely arbitrary.

I suspect that in the process of writing this package, you have gained a working understanding of the systems. A wiki page would greatly help the package user choosing the appropriate system.

countrycode_data: Rep. of Congo & DRC

Update: countrycode_data

The entry for DR Congo should be corrected to:

New entry for the Republic of the Congo should added (I'm not sure about the correct regex expression(s)):

c("CONGO, REPUBLIC OF THE", "CON", 484, "CF", 634, "CG", "CGO", 178, "regex expression", "Middle Africa", "Africa")

Return bad match object

Souvent, quand j'ai des variables qui sont pas "matchées" de source à origin, j'aimerais pouvoir voir desquelles il s'agit... la funciton déjà les affiche, mais peut-être ce serait bien de les exporter pour l'utilisateur? Je te mets petit script explicatif plus bas, avec les nouvelles lignes de code. Note que j'utilise un petit "trick" d'exporter à chaque fois sous un autre nom, pas sur que ce soit utile... (pt-etre devrait devenir arg de fonction elle meme?)

if(length(nomatch) > 0){
dest_li_nam<-"counCode_unmatched"
if(dest_li_nam%in%ls(envir=.GlobalEnv)) dest_li_nam<- paste(dest_li_nam, paste(sample(letters, size=3), collapse=""),sep="_")
assign(dest_li_nam, nomatch, envir=.GlobalEnv)
warning("Some values were not matched: (exported as ", dest_li_nam, "): ", paste(nomatch, collapse=", "), "\n")
}

### EXEMPLE

Donc cas classique:

library(WDI)
library(countrycode)
ex<-WDI(indicator="HH.DHS.YRS.15UP.GIN")

avec nouvelle fonction:

a<-countrycode(ex$iso2c, "iso2c", "iso3c",warn=TRUE)
Message d'avis :
In countrycode(ex$iso2c, "iso2c", "iso3c", warn = TRUE) :
Some values were not matched: 1A, 1W, 4E, 7E, 8S, CW, EU, JG, KV, NA, OE, S3, SS, SX, XC, XD, XE, XJ, XL, XM, XN, XO, XP, XQ, XR, XS, XT, XU, Z4, Z7, ZF, ZG, ZJ, ZQ

aa<-countrycode2(ex$iso2c, "iso2c", "iso3c",warn=TRUE)
Message d'avis :
In countrycode2(ex$iso2c, "iso2c", "iso3c", warn = TRUE) :
Some values were not matched: (exported as counCode_unmatched_ahk): 1A, 1W, 4E, 7E, 8S, CW, EU, JG, KV, NA, OE, S3, SS, SX, XC, XD, XE, XJ, XL, XM, XN, XO, XP, XQ, XR, XS, XT, XU, Z4, Z7, ZF, ZG, ZJ, ZQ

maintenant peut voir facilement ou est probleme!

unique(subset(ex, iso2c%in%counCode_unmatched_ahk, "country",drop=TRUE))
[1] "Arab World"
[2] "Caribbean small states"
[3] "East Asia & Pacific (developing only)"
[4] "East Asia & Pacific (all income levels)"
[5] "Europe & Central Asia (developing only)"
[6] "Europe & Central Asia (all income levels)"
[7] "Euro area"
[8] "European Union"
[9] "High income"
[10] "Heavily indebted poor countries (HIPC)"
[11] "Latin America & Caribbean (developing only)"
[12] "Latin America & Caribbean (all income levels)"
[13] "Least developed countries: UN classification"
[14] "Low income"
[15] "Lower middle income"
[16] "Low & middle income"
[17] "Middle East & North Africa (all income levels)"
[18] "Middle income"
[19] "Middle East & North Africa (developing only)"
[20] "North America"
[21] "High income: nonOECD"
[22] "High income: OECD"

Submit 0.17 to CRAN

Regex for Yugoslavia looks weird

I haven't experienced any issue, but I noticed that the regex to match Yugoslavia is odd:

.*yugoslavia.*|.*yugoslavia.*

If that's intentional, feel free to close out this bug.

Trinidad regex

Also, I notice that for trinidad, the regex contains tabago, should it not be tobago?
subset(countrycode_data, regex==grep("rinidad", countrycode_data$regex, value=TRUE))

Country Code for "Kosovo"

Hey,

according to https://countrycode.org/kosovo the country code for Kosovo is 'XKX' which does not seem to be coded within countrydata

Edit: iso3 code!

Various non-official names that are misidentified

Misidentification is significantly worse than nonidentification, so I've erred on the side of including uncommon names on this list.

DPRK -> Cambodia
Byelorussia -> Russia
British Honduras -> Honduras
Bechuanaland -> Aland
Nyasaland -> Aland
British East Africa -> South Africa
East Africa Protectorate -> South Africa
[Moldovian, Ukrainian, etc] Soviet Socialist Republic -> All become Russian Federation instead of the corresponding post-Soviet state
East Pakistan -> Pakistan
Chinese Taipei -> Thailand
Taipei -> Thailand

"Republic of China" (Taiwan's official name) matches to China

While the country name recognition is great, I started playing around it and noticed a geopolitically relevant flaw in a specific edge case: The territory officially named the "Republic of China" is also unofficially known as Taiwan, but countrycode() matches the string to the (People's Republic of) China instead.

Example:

> countrycode("Republic of China", "country.name", "iso2c")
[1] "CN"
> countrycode("Republic of China", "country.name", "country.name")
[1] "China"
> countrycode("TW", "iso2c", "country.name")
[1] "Taiwan, Province of China"
> countrycode("Taiwan", "country.name", "country.name")
[1] "Taiwan, Province of China"

Expected behavior would be to map all of those strings to TW or Taiwan, Province of China, as appropriate.

I am using version 0.18, the latest (I believe) version from CRAN.

Regex for Sint Maarten is incorrect

Should be "._sint.maarten.", currently it is "._sint.maartin."

World Bank Income Class mapping

First of all, many thanks for providing this great package! I have been using it a lot recently and found that it would be great if there was an option to map countries to their income group as classified by the World Bank (i.e. High Income, Upper-Middle Income, Lower-Middle Income and Low Income).

Christmas Island it not matched to a region

Country code CX. Should be Oceania

Add World Bank codes

I was wondering whether you would consider a minor extension: The World Bank uses three-letter country codes that are almost the same as “iso3c” – but with a few annoying exceptions.

Their codes are at:

http://siteresources.worldbank.org/DATASTATISTICS/Resources/CLASS.XLS

I have also attached a data frame with the World Bank “list of economies” by country name and code.

Accents don't seem to be handled right

> countrycode("Curaçao", "country.name", "iso3c")
[1] NA # Should work
> countrycode("Curacao", "country.name", "iso3c")
[1] "CUW" # works

destination.vector<-NULL

lso, a few suggestions:

   destination.vector<-NULL
for (z in 1:length(SOURCEVAR)){destination.vector<-c(NA,destination.vector)}

can be also written just: destination.vector <- rep(NA, length(SOURCEVAR))

Some common short names that aren't matched

U.S.
U.S.A.
U.K.
FYROM
United Arab Em. (Barro-Lee data uses this variation. Be careful not to match "United Arab Republic".)
Emirates
UAE
U.A.E.

Congo (again!)

I notice however that there is no regex for Congo:
subset(countrycode_data, iso3c=="CGO","regex")

which has as consequence that any country appearing before congo in countrycode_data will be assigned to congo:

for(i in countrycode_data[1:50,"country.name"]) print(countrycode(i, ORIGIN="country.name", DESTINATION="iso3c"))

country-year

Some coding scheme different codes for "similar" countries over the years. For example:

Ethiopia is one state until some date with the code 320, then it splits in to Ethiopia 321 and Eritrea 322 for the subsequent years.

Unmatched values reporting

-it would be very useful if you could add a check for unmatched values, for example:

potential.nas<-subset(data.frame(SOURCEVAR,destination.vector), !is.na(SOURCEVAR),"destination.vector",drop=TRUE)
if(any(is.na(potential.nas))) {
  naValues<-subset(data.frame(SOURCEVAR,destination.vector), !is.na(SOURCEVAR)&is.na(destination.vector),"SOURCEVAR",drop=TRUE)
  warning("Some values not matched: ", paste(naValues,collapse=", "),"\n")
}

Common name matched more than once

Hi,

When attempting to match the following name, there are matches more than once for which there should be obvious the second match is "more accurate" given the country names in full:

1: In countrycode(country, "country.name", "iso2c", warn = TRUE) :
Some strings were matched more than once: Dem. Rep. of the Congo,CG,CD

2: In countrycode(country, "country.name", "iso2c", warn = TRUE) :
Some strings were matched more than once: Hong Kong, China,CN,HK

3: In countrycode(country, "country.name", "iso2c", warn = TRUE) :
Some strings were matched more than once: Macao, China,CN,MO

4: In countrycode(country, "country.name", "iso2c", warn = TRUE) :
Some strings were matched more than once: South Sudan,SS,SD

Nice work, by the way.