Code Monkey home page Code Monkey logo

wikipediatrend's Introduction

Public Subject Attention via Wikipedia Page View Statistics

Status

AppVeyor build status Codecov

lines of R code: 474, lines of test code: 160

Version

2.1.6 ( 2020-06-03 12:43:18 )

Description

License

GPL (>= 2)
Peter Meissner [aut, cre], R Core Team [cph]

Credits

  • Parts of the package’s code have been shamelessly copied and modified from R base package written by R core team. This concerns the wp_date() generic and its methods and is detailed in the help files.

Citation

citation("wikipediatrend")

Meissner P (2020). wikipediatrend: Public Subject Attention via Wikipedia Page View Statistics. R package version 2.1.6.

BibTex for citing

toBibtex(citation("wikipediatrend"))

Installation

Stable version from CRAN:

install.packages("wikipediatrend")

Latest development version from Github:

devtools::install_github("petermeissner/wikipediatrend")

Usage

starting up …

library(wikipediatrend)
## 
##   [wikipediatrend]
##     
##   Note:
##     
##     - Data before 2016-01-01 
##       * is provided by petermeissner.de and
##       * was prepared in a project commissioned by the Hertie School of Governance (Prof. Dr. Simon Munzert)
##       * and supported by the Daimler and Benz Foundation.
##     
##     - Data from 2016-01-01 onwards 
##       * is provided by the Wikipedia Foundation
##       * via its pageviews package and API.
## 

getting some data …

trend_data <- 
  wp_trend(
    page = c("Der_Spiegel", "Die_Zeit"), 
    lang = c("de", "en"), 
    from = "2007-01-01",
    to   = Sys.Date()
  )

having a look …

trend_data
##      language article     date       views
## 2    en       die_zeit    2007-12-10    74
## 1    de       der_spiegel 2007-12-10   798
## 4    en       die_zeit    2007-12-11    35
## 3    de       der_spiegel 2007-12-11   710
## 5    de       der_spiegel 2007-12-12   770
## 9114 en       die_zeit    2020-05-31   209
## 9116 en       die_zeit    2020-06-01   174
## 9115 de       der_spiegel 2020-06-01  1498
## 9118 en       die_zeit    2020-06-02   208
## 9117 de       der_spiegel 2020-06-02  1252
## 
## ... 9108 rows of data not shown

having another look …

plot(
  trend_data[trend_data$views < 2500, ]
)
## `geom_smooth()` using formula 'y ~ x'

Usage 2

getting some data …

trend_data <- 
  wp_trend(
    page = 
      c(
        "Climate_crisis", 
        "2019–20_coronavirus_pandemic",
        "Donald_Trump",
        "Syria",
        "Crimea",
        "Influenza"
      ), 
    lang = "en", 
    from = "2007-01-01",
    to   = Sys.Date()
  )
## Warning in wpd_get_exact(page = page, lang = lang, from = from, to = to, : Unable to retrieve data for url:
## http://petermeissner.de:8880/article/exact/en/2019–20_coronavirus_pandemic. Status: error.

having a look …

trend_data
##       language article        date       views  
## 1     en       climate_crisis 2007-12-10       0
## 2     en       crimea         2007-12-10    1051
## 5     en       syria          2007-12-10    3205
## 4     en       influenza      2007-12-10    4153
## 3     en       donald_trump   2007-12-10    5050
## 22723 en       climate_crisis 2020-06-02     103
## 22726 en       influenza      2020-06-02    3437
## 22724 en       crimea         2020-06-02    3681
## 22727 en       syria          2020-06-02    4969
## 22725 en       donald_trump   2020-06-02  916742
## 
## ... 22717 rows of data not shown

having another look …

options(scipen = 1000000)

plot(trend_data) + 
  ggplot2::scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous y-axis

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1202 rows containing non-finite values (stat_smooth).

wikipediatrend's People

Contributors

bitdeli-chef avatar petermeissner avatar reisfe avatar simonmunzert avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wikipediatrend's Issues

Failed to connect to petermeissner.de port 8880: Connection refused

Hi Peter,

awesome package, I'm glad that wikipediatrend is back. I tried it a few weeks ago and it worked great, except for some weird encoding issues. I wanted to replicate this and post an issue but now I receive the following error message:

trend_data <- 
  wp_trend(
    page = c("Der_Spiegel", "Die_Zeit"), 
    lang = c("de", "en"), 
    from = "2007-01-01",
    to   = Sys.Date()
  )

Error in curl::curl_fetch_memory(url, handle = handle) :
Failed to connect to petermeissner.de port 8880: Connection refused

I'll get back to the encoding issue once this is solved. Thx!

Timeout error, cannot execute any wp_trend command

Hello,

I am not able to get any wp_trend results, the timeout error is always the same. Example:

res <- wp_trend(page=c("Der_Spiegel", "Die_Zeit"), lang=c("de", "en"))

yields

Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached: [petermeissner.de:8880] Connection timed out after 10002 milliseconds

I am using R 3.5.1 on macOS 10.15.2, wikipediatrend_2.1.4

Thank you in advance!

unused arguments in wp_trend

I’ve installed the developmental version (on 03/05/2015) of wikipediatrend but I couldn’t run the code that worked in the CRAN version:

bkomo <- wp_trend(  page      = "Bronisław Komorowski", 
                  from      = "2015-01-01", 
                  lang      = "pl", 
                  friendly  = T, 
                  userAgent = T)

I was getting the following error:

Error in wp_trend(page = "Bronislaw Komorowski", from = "2015-01-01",  : 
  unused arguments (friendly = T, userAgent = T)

I removed the last two arguments of the code (friendly, userAgent) and it worked but still with no meaningful results (non-Latin characters issue that was filed earlier).

Missing data from stats.grok.se

I just noticed that for the months 2008.01 and 2008.07 stats.grok.se does not return data for several days (1 day in the case of 2008.01 and 21 days for 2008.07). I checked the wikipedia pagecounts dump files and they exist.
Right now, wikipediatrend keeps downloading those months even if data have been fetched before because it detects some days missing. This is the right behaviour if this problem with stats.grok.se is temporary. If this is a permanent problem then wikipediatrend should flag this somehow and not download it again.

"to" date specification

I've been exploring your wikipediatrends R package, and have found that it doesn't appear to matter if I specify a "to" date in the code. Am I doing something incorrectly? When I run the code below it only retrieves data to sometime in January 2016. Are there data available past that date?

For example:
cryptowiki <-
wp_trend(
page = "Cryptosporidium",
from = "2015-01-01",
to = "2016-09-03")
ggplot(cryptowiki, aes(x=date, y=count, group=page, color=page)) +
geom_line(size=1) + theme_bw()+
scale_x_date(breaks=date_breaks("1 month"))+
theme(axis.text.x=element_text(angle=-90,size=rel(0.7)))+
labs(x = "Date", y = "Frequency",title = "Wikipeida Search Volume for Cryptosporidiosis")

Same issue #17 but with higher versions

Hi Peter
with following sessionInfo

sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1

## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
## [4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     

## other attached packages:
## [1] rvest_0.3.1          xml2_0.1.2           jsonlite_0.9.19      AnomalyDetection_1.0 wikipediatrend_1.1.7
## [6] Rcpp_0.12.3          nlme_3.1-124         ggplot2_2.0.0       

## loaded via a namespace (and not attached):
##  [1] lattice_0.20-33  digest_0.6.9     bitops_1.0-6     R6_2.1.2         grid_3.2.3       plyr_1.8.3      
##  [7] gtable_0.1.2     magrittr_1.5     scales_0.3.0     httr_1.1.0       stringi_1.0-1    curl_0.9.6      
## [13] labeling_0.3     tools_3.2.3      stringr_1.0.0    RCurl_1.95-4.7   munsell_0.4.3    colorspace_1.2-6

I got the same problem of issue #17

Warning message:
In value[[3L]](cond) : [wp_jsons_to_df()]
Could not extract data from server response. Data for one month will be missing.

What can be wrong? No packages installed from github.

Consistent zero counts for Dec 31 2008

Dear Peter,

thank you very much for providing access to the "older" wiki pageview stats through by means of an API. Too sad, that Wikipedia itself has not managed to include them in their own API, so far.
I am creating an Ado-File for Stata, right now. While doing so I found some characteristics of the responses of the API that I would like to understand better.
As I see in your own examples, there are zero counts for some days even for terms that are heavily requested. I would see this as a nuisance but it seems to be that some of these zero counts are consistent. One example is the zero counts for Dec 31 2008. Since this is a leap year, the entry should be in the 366th position in the page_view_count-field of the JSON-response. There is an entry there, but the entry seems to be consistently zero (I checked with Angela_Merkel, Albert_Einstein, Bazooka, and Lothar_Matthäus for various languages (de, fr, en))

Any insights would be extremely helpful.

Many regards
Ulrich Kohler

The wikipediatrend package currently has no server providing any page view information...

It appears that this problem is back (see #32) - or perhaps I misunderstood the solution.

I just installed the latest version of wikipediatrend from github

devtools::install_github("petermeissner/wikipediatrend")

And when I try to load the library and use it I get this:

> library(wikipediatrend)
> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin18.2.0 (64-bit)
Running under: macOS Mojave 10.14.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /usr/local/Cellar/openblas/0.3.5/lib/libopenblasp-r0.3.5.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] wikipediatrend_2.1.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0        rstudioapi_0.9.0  knitr_1.21        magrittr_1.5      usethis_1.4.0     devtools_2.0.1   
 [7] pkgload_1.0.2     R6_2.4.0          hellno_0.0.1      rlang_0.3.1       tools_3.5.3       pkgbuild_1.0.2   
[13] xfun_0.5          sessioninfo_1.1.1 cli_1.0.1         withr_2.1.2       remotes_2.0.2     htmltools_0.3.6  
[19] rprojroot_1.3-2   yaml_2.2.0        assertthat_0.2.0  digest_0.6.18     crayon_1.3.4      bookdown_0.9     
[25] processx_3.2.1    callr_3.1.1       fs_1.2.6          ps_1.3.0          curl_3.3          testthat_2.0.1   
[31] glue_1.3.0        memoise_1.1.0     evaluate_0.13     rmarkdown_1.11    blogdown_0.10     compiler_3.5.3   
[37] backports_1.1.3   desc_1.2.0        prettyunits_1.0.2
> trend_data <- 
+   wp_trend(
+     page = c("Der_Spiegel", "Die_Zeit"), 
+     lang = c("de", "en"), 
+     from = "2007-01-01",
+     to   = Sys.Date()
+   )

The wikipediatrend package currently has no server providing any page view information. 
Use package pageviews for recent (2016+) information. 
Older information hopefully available again soon.

grepl error when querying Russian wikipedia

I found another bug when querying Russian wikipedia.

I used the following code:

bkomoRU <- wp_trend(  page      = "Президент Польши", 
                      from      = "2015-01-01", 
                      lang      = "ru")

and I got the following error:

Error: grepl("\\w", page) is not TRUE

I'm not sure what's the reason for that.

This is an awesome library

Wikimedia's traffic analyst here: thank you so much for writing this!

(No actual issue, but this was the best way I could think of for saying thanks ;p)

stats.grok.se server down

At the moment (maybe forever) the stats.grok.se server which provides the data for the tool is down, so the package will not work.

wp_trend() - restrictive page and lang recycling

While ...-

wp_trend(page=c("Der_Spiegel", "Die_Zeit"), lang=c("de", "en")) 
##      language article     date       views
## 2    en       die_zeit    2007-12-10    74
## 1    de       der_spiegel 2007-12-10   798
## 4    en       die_zeit    2007-12-11    35
## 3    de       der_spiegel 2007-12-11   710
## 5    de       der_spiegel 2007-12-12   770
## 9010 en       die_zeit    2020-04-09   387
## 9012 en       die_zeit    2020-04-10   401
## 9011 de       der_spiegel 2020-04-10  1485
## 9014 en       die_zeit    2020-04-11   254
## 9013 de       der_spiegel 2020-04-11  1106
## 
## ... 9004 rows of data not shown

... works correctly, when I try to search for only one page in multiple languages I get an error:

wp_trend("Der_Spiegel", lang = c("de", "en"))
## Error in wp_trend("Der_Spiegel", lang = c("de", "en")) : 
##   length(page) == length(lang) | length(lang) == 1 is not TRUE 

... so it is impossible to compare many languages for a single topic.

Could not extract data from server response. Data for one month will be missing.

I try to download some Info, which usually workes great (thx, btw). However, there's a new Errormessage:

http://stats.grok.se/json/fr/201510/Pascale_Bruderer
Warning message:
In value[[3L]](cond) : [wp_jsons_to_df()]
Could not extract data from server response. Data for one month will be missing.

I updated rvest and wikipediatrend, but the error remains...

library(wikipediatrend)
wp_trend(page= "Pascale_Bruderer", 
                     lang= "fr",
                     from= "2015-10-10",  
                     to= "2015-10-10",
                     file="test.csv")

 session.Info()
 R version 3.2.2 (2015-08-14)
 Platform: x86_64-redhat-linux-gnu (64-bit)
 Running under: CentOS release 6.7 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base     

 other attached packages:
 [1] wikipediatrend_1.1.7

 loaded via a namespace (and not attached):
 [1] httr_1.0.0       R6_2.1.1         magrittr_1.5     tools_3.2.2      rstudioapi_0.3.1 RCurl_1.95-4.7  
 [7] curl_0.9.3       stringi_0.5-5    jsonlite_0.9.16  stringr_1.0.0    bitops_1.0-6   

Feature Request :datasource

Could you please add this additional datasource ?
There is a way to download CSV directly, so no JSON to CSV is required in R.

Error in order(res$date) : argument 1 is not a vector

For some Wikipedia pages I receive the following error message when setting the from argument to a date before 2016-01-01. For instance:

trend_data <- wp_trend(
    page = "Anton_von_Aretin_(Politiker)", 
    lang = c("de"), 
    from = "2007-01-01",
    to   = Sys.Date()
  )
Error in order(res$date) : argument 1 is not a vector

or

trend_data <- 
  wp_trend(
    page = "Hana_Kordová_Marvanová", 
    lang = c("cs"), 
    from = "2007-01-01",
    to   = Sys.Date()
  )
Error in order(res$date) : argument 1 is not a vector

The respective Wikipedia pages exist and carry the same title:
https://de.wikipedia.org/wiki/Anton_von_Aretin_(Politiker)
https://cs.wikipedia.org/wiki/Hana_Kordov%C3%A1_Marvanov%C3%A1

This can be resolved by setting the from argument to 2016-01-01 or later, i.e., using the pageviews API. However, both Wikipedia pages existed before that date.

Total views counts

Can total view counts (sum of all pages) for a language for a certain period be extracted? This is quite important for normalizing the counts of a page through time.

Wikipedia trend and jump in counts in January 2016

Dear all,

I am a postdoc, who is using the "wikipediatrend" package to conduct time series analysis of wikipedia views. I noticed that data extracted with this package experience a jump in January 2016, at least when observations come from it.wikipedia.org. This jump is widespread across all the pages from it.wikipedia.org.
An example with three different searches ("alien species", "beer", "democracy"):

image

I do not understand why this change occurs. Maybe something changed in the way Wikipedia recorded the visits to the various pages, in 2016? For example, maybe it did not record searches from mobile until December 2015.

Kind regards,

Jacopo Cerri

wp_linked_pages generates an internal error if there are no linked pages

Hi Peter,

Really enjoying the package and the ability to access the historical page views.

I found that if I use wp_linked_pages to find other language pages, on a page which doesn't have any, an internal error is generated.

e.g.
wikipediatrend::wp_linked_pages("Sheerness_Lifeboat_Station", lang="en")

The error message is:

Error in `[<-.data.frame`(`*tmp*`, lang_df$lang == "x-default", , value = c("Sheerness_Lifeboat_Station",  : 
  replacement has 2 rows, data has 0

Thought perhaps would be useful to know, in case you want to adapt to return a more informational warning message?

Thanks

Alan

non-Latin characters are changed in wp_trend

There is an issue when wp_trend is dealing with non-Latin characters.

I tried querying Wikipedia using the following code:

bkomo <- wp_trend(  page      = "Bronisław Komorowski", 
                  from      = "2015-01-01", 
                  lang      = "pl", 
                  friendly  = T, 
                  userAgent = T)

wp_trend seems to struggle with ‘ł’ and turns it into ‘l’ which brings irrelevant values.

Wikipedia Page Views Error

library(wikipediatrend)
data <- wp_trend(page = 'cera_care',
                  from = '2017-01-01',
                  to = Sys.Date())

Warning in value[3L] :
Unable to retrieve data via {pageviews}.
Error: The date(s) you used are valid, but we either do not have data for those date(s), or the project you asked for is not loaded yet. Please check https://wikimedia.org/api/rest_v1/?doc for more information.. Params: project = 'en', article='Cera_care', start='2017010100', end='2020041800', user_type='all', platform='all'.
Error in order(res$date, res$language, res$article) :
argument 1 is not a vector

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.