Code Monkey home page Code Monkey logo

referer-parser's People

Contributors

235 avatar alexanderdean avatar benfradet avatar blazy2k9 avatar danm avatar donspaulding avatar emilssolmanis avatar eyepulp avatar fblundun avatar jethron avatar jhirbour avatar jobartim44 avatar kaibinhuang avatar kingo55 avatar lstrojny avatar mkatrenik avatar mleuthold avatar ramin avatar raulgenially avatar rgraff avatar rzats avatar saj1th avatar shuttie avatar silviucpp avatar swijnands avatar tiborb avatar tombar avatar tsileo avatar ukutaht avatar yoloseem avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

referer-parser's Issues

Bug in Python module

I got the error:

from referer_parser import Referer
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\referer_parser__init__.py", line 32, in <
module>
REFERERS = load_referers(JSON_FILE)
File "C:\Python27\lib\site-packages\referer_parser__init__.py", line 21, in l
oad_referers
params = list(map(text_type.lower, config['parameters']))
TypeError: descriptor 'lower' requires a 'unicode' object but received a 'str'

i fix it temporaly modifing __init__.py, load_referers function:

if 'parameters' in config:
    p = config['parameters']
    if text_type == unicode:
        p = unicode(config['parameters'])
    params = list(map(text_type.lower, p))

Python 2.7, SO:Windows 7

Java referer-parser doesn't work on Hadoop

Unfortunately the bump to httpclient 4.3.3 has broken referer-parser on Hadoop.

Specifically it's this line of code:

81b88ff#diff-729b6a9a457c5e4a0b244bf130a1e08eR192

This is the error on Hadoop:

Caused by: java.lang.NoSuchMethodError: org.apache.http.client.utils.URLEncodedUtils.parse(Ljava/lang/String;Ljava/nio/charset/Charset;)Ljava/util/List;
    at com.snowplowanalytics.refererparser.Parser.extractSearchTerm(Parser.java:205)
    at com.snowplowanalytics.refererparser.Parser.parse(Parser.java:154)
    at com.snowplowanalytics.refererparser.Parser.parse(Parser.java:116)
    at com.snowplowanalytics.refererparser.scala.Parser$.parse(Parser.scala:153)
    at com.snowplowanalytics.snowplow.enrich.common.enrichments.registry.RefererParserEnrichment.extractRefererDetails(RefererParserEnrichment.scala:107)
    at com.snowplowanalytics.snowplow.enrich.common.enrichments.EnrichmentManager$$anonfun$enrichEvent$2$$anonfun$apply$5.apply(EnrichmentManager.scala:291)
    at com.snowplowanalytics.snowplow.enrich.common.enrichments.EnrichmentManager$$anonfun$enrichEvent$2$$anonfun$apply$5.apply(EnrichmentManager.scala:277)
    at scala.Option.foreach(Option.scala:236)
    at com.snowplowanalytics.snowplow.enrich.common.enrichments.EnrichmentManager$$anonfun$enrichEvent$2.apply(EnrichmentManager.scala:277)
    at com.snowplowanalytics.snowplow.enrich.common.enrichments.EnrichmentManager$$anonfun$enrichEvent$2.apply(EnrichmentManager.scala:277)
    at scalaz.Validation$class.foreach(Validation.scala:126)
    at scalaz.Success.foreach(Validation.scala:329)
    at com.snowplowanalytics.snowplow.enrich.common.enrichments.EnrichmentManager$.enrichEvent(EnrichmentManager.scala:277)
    at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$toCanonicalOutput$1$$anonfun$apply$2.apply(EtlJob.scala:70)
    at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$toCanonicalOutput$1$$anonfun$apply$2.apply(EtlJob.scala:70)
    at scalaz.std.OptionFunctions$class.cata(Option.scala:157)
    at scalaz.std.option$.cata(Option.scala:209)
    at scalaz.syntax.std.OptionOps$class.cata(OptionOps.scala:9)
    at scalaz.syntax.std.ToOptionOps$$anon$1.cata(OptionOps.scala:103)
    at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$toCanonicalOutput$1.apply(EtlJob.scala:70)
    at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$toCanonicalOutput$1.apply(EtlJob.scala:70)
    at scalaz.Validation$class.flatMap(Validation.scala:141)
    at scalaz.Success.flatMap(Validation.scala:329)
    at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$.toCanonicalOutput(EtlJob.scala:69)
    at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$7.apply(EtlJob.scala:170)
    at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$7.apply(EtlJob.scala:169)
    at com.twitter.scalding.MapFunction.operate(Operations.scala:58)
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:99)
    ... 11 more

Hadoop bundles an old version of httpclient which doesn't have parse(String, Charset). There is talk of Hadoop removing that dependency but in any case we and EMR use an oldish version of Hadoop.

The problem is not using 4.3.3 per se, but using parse(String, Charset). /cc @squeed

Re-factor Ruby library

Architecture is very OO and convoluted.

Move towards similar architecture to the Scala version: a Parser module which includes one-time instantiation, and then a static parse() method which returns a Referer object for a given URL.

@Tombar has moved it in the right direction with a public parse() method (Tombar@a280d9e).

Create explicit tests to express the recursive check logic

Basically, the v1 library was looking for exact matches even www.google.com vs google.com

That approach is actually flawed:

  1. Social networks often let users have their own subdomain, and you can't obviously list all of them
  2. Yahoo! puts its search engine on load-balanced subdomains, and that's an unpredictable list

So this version instead tries multiple lookups:

  • It tries the host, then the host + path, then the host + one-level path
  • Then strips off a subdomain and tries again

We should write some explicit tests into the Specs2 test suite to check that this recursion logic works correctly.

/cc @donspaulding

Update the PHP library with new internal domain tests

referer_tests.json now has two new tests for custom internal domains functionality:
https://github.com/snowplow/referer-parser/blob/feature/json-tests/resources/referer-tests.json#L235-L248

The Referer-Parser is configured with a list of domains which should be counted as internal:
https://github.com/snowplow/referer-parser/blob/feature/json-tests/java-scala/src/test/scala/com/snowplowanalytics/refererparser/scala/JsonParseTest.scala#L41

The PHP version of the Referer-Parser already has support for internal hosts, so it should be possible to get it working with the new tests.

When done, please update sync_data.py to automatically copy the master copy of referer-tests.json into the PHP subfolder: https://github.com/snowplow/referer-parser/blob/23e3fd9f3bfaa8947fcb456ed8fbdb22f271dabc/sync_data.py#L58

phar version?

Can someone with time create a phar version or do a pull request of one using box? This is above my head, but I cannot use without having a phar... would be extremely grateful if anyone has the time.

http://box-project.org/

Make JSON the standard format loaded by the libraries

Two reasons for this:

  1. JSON seems faster to load than YAML in most languages (see tobie/ua-parser#117)
  2. More languages have JSON handling built-in than YAML (e.g. Python)

This would involve updating the Java/Scala and Ruby ports, and making @donspaulding's build_json.py a standard part of the distribution for all ports.

Important note: we would still keep the master copy of the database in YAML for readability/editability purposes.

Can't use composer - possibility for a phar file?

Came across this tonight. I am interested in the php to parse domain and search terms out of url logs. Problem is I cannot use composer with my current server (WHM/Cpanel setup)... I came across this situation once before, however, they also offered a phar file of everything bundled which I could use without problems and eliminated the need for composer.

Any chance of this happening?

Strange Yahoo search data

Am seeing some strange Yahoo data showing up as "search".

A client ran a homepage takeover on Yahoo a few weeks back and sent a lot of traffic from the yahoo homepage from hostname "au.yahoo.com".

I know this isn't search traffic, so when I queried it in Snowplow, this hostname had no real keywords.

In contrast, the hostname "au.search.yahoo.com" had quite a few search terms.

Is this a case of not provided?

Query string not being extracted from below

Potentially because the &url= isn't being escaped? (Why isn't it being escaped?)

sa=t&rct=j&q=g+star&source=web&cd=3&ved=0CEEQFjAA&url=http://www.gstars.co.uk/?ito=GAG5362963510&itc=GAC19854885430&itkw=g-stars&itawnw=search&ei=8eMQUt_hAvTSpgLjqQE&usg=AFQjCNFFNpW7yF9pcqCfOpYvqafYS94p_Q

Java & Scala: make tests JSON-driven

referer-parser could take a cue from ua-parser by adding a YAML (or JSON) file with test cases consisting of example referer urls for each of the referers in search.yml and the corresponding parse results.

This would greatly increase the confidence level in the consistency of future ports to other programming languages.

Add tracking of keyword ranks

Suggestion from Peter O'Neill based on the blog post A new method to track keyword ranking using Google Analytics.

On occasion, Google search exposes the position of the keyword that drove the click to your website in the page_referrer as a cd= parameter. We should extract this from the referrer_url so that it can be stored in SnowPlow, and used to track:

  1. The average position of keywords over periods of time. (Is search engine rank for particular terms getting better or worse?)
  2. Grouping keywords together around specific e.g. products and categories and creating an performance index for those buckets. (As per the blog post.)

In order to implement this, we'll need to:

  • Extend the search engines YAML so it includes not just the query string parameter to identify the keywords, but also the parameter to identify the location of the search result. (Where this is available.)

Then in SnowPlow we'll need an addition mkt_xxx field to store the result e.g. mkt_rank.

Java: ParserTest failed

org.junit.ComparisonFailure: Internal subdomain HTTP medium
Expected :internal
Actual :unknown

at org.junit.Assert.assertEquals(Assert.java:125)
at com.snowplowanalytics.refererparser.ParserTest.refererTests(ParserTest.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)

Update npm module.

Hi guyz. Would you be so kind to update npm module :3. It's a bit outdated ('0.0.2': '2013-08-16T20:29:36.351Z')

Should return better result

Hi, all, please see this referer:
http://www.google.com/search?hl=en&as_q=&as_epq=carbonite+offer+code&as_oq=&as_eq=&num=100&lr=lang_en&as_filetype=&ft=i&as_sitesearch=&as_qdr=all&as_rights=&as_occt=any&cr=&as_nlo=&as_nhi=&safe=images

I've added parameters as_q, as_epq, as_eq to my local referers.yml.
and run

irb(main):002:0> require 'referer-parser'
=> true
irb(main):003:0> ref = "http://www.google.com/search?hl=en&as_q=&as_epq=carbonite+offer+code&as_oq=&as_eq=&num=100&lr=lang_en&as_filetype=&ft=i&as_sitesearch=&as_qdr=all&as_rights=&as_occt=any&cr=&as_nlo=&as_nhi=&safe=images"
=> "http://www.google.com/search?hl=en&as_q=&as_epq=carbonite+offer+code&as_oq=&as_eq=&num=100&lr=lang_en&as_filetype=&ft=i&as_sitesearch=&as_qdr=all&as_rights=&as_occt=any&cr=&as_nlo=&as_nhi=&safe=images"
irb(main):004:0> st = RefererParser::Referer.new(ref, 'referers.yml')
=> #<RefererParser::Referer:0x25feb227 @search_term="", @known=true, @referer="Google", @uri=#<URI::HTTP:0x746231ed URL:http://www.google.com/search?hl=en&as_q=&as_epq=carbonite+offer+code&as_oq=&as_eq=&num=100&lr=lang_en&as_filetype=&ft=i&as_sitesearch=&as_qdr=all&as_rights=&as_occt=any&cr=&as_nlo=&as_nhi=&safe=images>, @search_parameter="as_q">
irb(main):005:0> st.search_term
=> ""

I think that "carbonite offer code" is the better result.
https://github.com/snowplow/referer-parser/blob/master/ruby/lib/referer-parser/referer.rb#L85
If the referer contains two or more parameters, I prefer to return searchterm that is not nil or empty rather than the one when parameters first match.

Add support for search engines that use subdomains for LB

Some search engines operate load balancing etc on subdomains, leading to referers which can't be found in search.yml. For example, a Yahoo! refererer URL might be "us.yhs4.search.yahoo.com"

The most performant way of supporting this is probably this algo:

  1. Lookup the full domain in search.yml. Found? Finish
  2. Not found? Strip off first sub-portion ("us."). Lookup. Found? Finish
  3. Not found? Strip off next sub-portion ("yhs4"). Lookup. Found? Finish
  4. Continue till found or no parts left!

This should be a lot faster than switching to a regexp-based approach.

Add test for domain + path search engine

A test to ensure that a google.com/product referer returns "Google Product Search" not "Google".

/cc @donspaulding as I wasn't sure that this was handled in the Python version.

The relevant line in the Ruby is:

https://github.com/snowplow/referer-parser/blob/master/ruby/lib/referer-parser/referers.rb#L28

The relevant line in the Java is:

https://github.com/snowplow/referer-parser/blob/master/java-scala/src/main/java/com/snowplowanalytics/refererparser/Parser.java#L78

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.